WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Dec 11, 2025

Fix #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this from the commit

import kotlin.reflect.KType
import kotlin.reflect.full.withNullability

public class WrappedStatistic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class should not be public, should it?


public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipNaNDefault): T? =
Aggregators.max<T>(skipNaN).aggregateSingleColumn(this)
if (this is ValueColumnInternal<*>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so fond of this solution, as it requires a lot of refactoring in other functions, plus it does not work when you write df.max { myCol }, as I mentioned in #1492 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead. I'd do this check inside the original aggregateSingleColumn(). Each Aggregator has a name which you could use to query the ValueColumnInternal for the right WrappedStatistic if they are stored in a Map<String, WrappedStatistic>inValueColumnImpl. Though I suppose each Aggregatorwill also need to store any other provided arguments likeskipNaN: Booleanandpercentile: Doublewhen needed... In aMap<String, Any?>` maybe?

That way we could store our "Statistics Cache" in ValueColumnImpl as a

Map<String, Map<Map<String, Any?>, Any?>>

so the result cache could look like:

{
   "max" : {
        { "skipNaN": true } : 312.4
    },
   "min" : {},
    "std" : {
        { "std": 0.9, "skipNaN": false } : Double.NaN,
        { "std": 0.9, "skipNaN": true } : 12.3
    }
}

The challenge may lie in doing this neatly ;P

)

internal interface ValueColumnInternal<T> : ValueColumn<T> {
val max: WrappedStatistic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a var and nullable, so we can initialize it to null and don't need to instantiate a class for each statistic when a column is created.

import kotlin.reflect.KType
import kotlin.reflect.full.withNullability

public class WrappedStatistic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class should not be public, should it?

Also, I think, if you make the other a var, this can be a data class with var's. It's a bit more kotlin-like :)

public var wasComputedNotSkippingNaN: Boolean = false,
public var statisticComputedSkippingNaN: Any? = null,
public var statisticComputedNotSkippingNaN: Any? = null,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about std and percentile that take extra arguments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lazy statistics for columns

2 participants