-
Notifications
You must be signed in to change notification settings - Fork 77
Lazy statistics for ValueColumn #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this from the commit
| import kotlin.reflect.KType | ||
| import kotlin.reflect.full.withNullability | ||
|
|
||
| public class WrappedStatistic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class should not be public, should it?
|
|
||
| public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipNaNDefault): T? = | ||
| Aggregators.max<T>(skipNaN).aggregateSingleColumn(this) | ||
| if (this is ValueColumnInternal<*>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so fond of this solution, as it requires a lot of refactoring in other functions, plus it does not work when you write df.max { myCol }, as I mentioned in #1492 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead. I'd do this check inside the original aggregateSingleColumn(). Each Aggregator has a name which you could use to query the ValueColumnInternal for the right WrappedStatistic if they are stored in a Map<String, WrappedStatistic>inValueColumnImpl. Though I suppose each Aggregatorwill also need to store any other provided arguments likeskipNaN: Booleanandpercentile: Doublewhen needed... In aMap<String, Any?>` maybe?
That way we could store our "Statistics Cache" in ValueColumnImpl as a
Map<String, Map<Map<String, Any?>, Any?>>so the result cache could look like:
{
"max" : {
{ "skipNaN": true } : 312.4
},
"min" : {},
"std" : {
{ "std": 0.9, "skipNaN": false } : Double.NaN,
{ "std": 0.9, "skipNaN": true } : 12.3
}
}
The challenge may lie in doing this neatly ;P
| ) | ||
|
|
||
| internal interface ValueColumnInternal<T> : ValueColumn<T> { | ||
| val max: WrappedStatistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a var and nullable, so we can initialize it to null and don't need to instantiate a class for each statistic when a column is created.
| import kotlin.reflect.KType | ||
| import kotlin.reflect.full.withNullability | ||
|
|
||
| public class WrappedStatistic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class should not be public, should it?
Also, I think, if you make the other a var, this can be a data class with var's. It's a bit more kotlin-like :)
| public var wasComputedNotSkippingNaN: Boolean = false, | ||
| public var statisticComputedSkippingNaN: Any? = null, | ||
| public var statisticComputedNotSkippingNaN: Any? = null, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about std and percentile that take extra arguments?
Fix #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.
It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.
I implemented the solution by overloading
aggregateSingleColumn, this overload exploits the originalaggregateSingleColumnby wrapping it so that it is possible to exploit caches.For the moment there is only
max, however it would be easy to do the same withmin,sum,meanandmedian.For
percentileandstdit could be done something similar.