Lazy statistics for ValueColumn #1636

CarloMariaProietti · 2025-12-11T18:52:37Z

Fix #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

Jolanrensen · 2025-12-12T15:17:14Z

git

please remove this from the commit

Jolanrensen · 2025-12-12T15:20:32Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

 import kotlin.reflect.KType
 import kotlin.reflect.full.withNullability

+public class WrappedStatistic(


this class should not be public, should it?

Jolanrensen · 2025-12-12T15:24:42Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/max.kt


 public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipNaNDefault): T? =
-    Aggregators.max<T>(skipNaN).aggregateSingleColumn(this)
+    if (this is ValueColumnInternal<*>) {


I'm not so fond of this solution, as it requires a lot of refactoring in other functions, plus it does not work when you write df.max { myCol }, as I mentioned in #1492 (comment)

Instead. I'd do this check inside the original aggregateSingleColumn(). Each Aggregator has a name which you could use to query the ValueColumnInternal for the right WrappedStatistic if they are stored in a Map<String, WrappedStatistic>inValueColumnImpl. Though I suppose each Aggregatorwill also need to store any other provided arguments likeskipNaN: Booleanandpercentile: Doublewhen needed... In aMap<String, Any?>` maybe?

That way we could store our "Statistics Cache" in ValueColumnImpl as a

Map<String, Map<Map<String, Any?>, Any?>>

so the result cache could look like:

{ "max" : { { "skipNaN": true } : 312.4 }, "min" : {}, "std" : { { "std": 0.9, "skipNaN": false } : Double.NaN, { "std": 0.9, "skipNaN": true } : 12.3 } }

The challenge may lie in doing this neatly ;P

Jolanrensen · 2025-12-12T15:27:33Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

+)
+
+internal interface ValueColumnInternal<T> : ValueColumn<T> {
+    val max: WrappedStatistic


I would make this a var and nullable, so we can initialize it to null and don't need to instantiate a class for each statistic when a column is created.

Jolanrensen · 2025-12-12T15:28:30Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

 import kotlin.reflect.KType
 import kotlin.reflect.full.withNullability

+public class WrappedStatistic(


this class should not be public, should it?

Also, I think, if you make the other a var, this can be a data class with var's. It's a bit more kotlin-like :)

Jolanrensen · 2025-12-12T15:33:23Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/ValueColumnImpl.kt

+    public var wasComputedNotSkippingNaN: Boolean = false,
+    public var statisticComputedSkippingNaN: Any? = null,
+    public var statisticComputedNotSkippingNaN: Any? = null,
+)


what about std and percentile that take extra arguments?

CarloMariaProietti added 7 commits November 25, 2025 19:07

First Idea

9e986ce

still working on solution

276c9be

work in progress

bacc395

need to test

4d5d714

one red test

bedea0e

need to clean

c0adc08

cleaning

38b26c3

CarloMariaProietti mentioned this pull request Dec 11, 2025

Lazy statistics for columns #1492

Open

Jolanrensen reviewed Dec 12, 2025

View reviewed changes

git

Copy link

Collaborator

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this from the commit

Jolanrensen requested changes Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lazy statistics for ValueColumn #1636

Lazy statistics for ValueColumn #1636

Uh oh!

CarloMariaProietti commented Dec 11, 2025 •

edited

Loading

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Jolanrensen Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lazy statistics for ValueColumn #1636

Are you sure you want to change the base?

Lazy statistics for ValueColumn #1636

Uh oh!

Conversation

CarloMariaProietti commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CarloMariaProietti commented Dec 11, 2025 •

edited

Loading