-
Notifications
You must be signed in to change notification settings - Fork 77
Fast sorting of columns/table in Kotlin Notebook #1639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
fdbca04 to
cf3a7dc
Compare
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
cf3a7dc to
7795911
Compare
eaac104 to
c2f96cd
Compare
| /** | ||
| * Sorts a dataframe-like object by multiple columns. | ||
| * If a column type is not comparable, sorting by string representation is applied instead. | ||
| * Sorts DataFrames by their size because looking at the smallest / biggest groups after groupBy is very popular. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and lists
| testImplementation(projects.dataframeCsv) | ||
| } | ||
|
|
||
| benchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see first benchmark in our project!
| val finalComparator = if (isDesc[0]) comparator.reversed() else comparator | ||
|
|
||
| val permutation = Array(column.size()) { it } | ||
| Arrays.parallelSort(permutation, finalComparator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot is it thread-safe to use parallesort here?
ForkJoinPool under the hood uses the common thread pool and also, thinking am I - should it live inside Kotlin kernel thread (I have no idea about how it's organized in Kotlin Notebook), where exactly the child threads will be created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements significant performance optimizations for sorting DataFrames in Kotlin Notebook environments, achieving a ~10-100x speedup (from 250-500ms to 30-50ms for 1 million rows). The optimization combines three key techniques: (1) removing reflection from the sorting loop, (2) implementing lazy materialization via SortedDataFrameView to avoid materializing entire sorted DataFrames when only a subset is needed for rendering, and (3) using Arrays.parallelSort for single-column sorts.
Key Changes:
- Introduced
SortedDataFrameViewwrapper that defers row materialization until accessed - Optimized single-column sorting path using
Arrays.parallelSortwith index-based comparators - Added support for sorting DataFrame and List columns by their size (row count/length)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt | Core implementation: adds SortedDataFrameView for lazy materialization, refactors sorting logic with optimized single-column path, adds comparators for DataFrame/List types by size |
| core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/KotlinNotebookPluginUtilsTests.kt | New test file covering sorting of lists and DataFrames by size with various edge cases |
| core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/SortingBenchmark.kt | New benchmark suite measuring sorting performance across different column types and sizes |
| core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/DataFrameGet.kt | Simplifies getRows to delegate to get methods for consistency |
| core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/DataFrame.kt | Moves get(IntRange) and get(Iterable<Int>) implementations from getRows to interface default methods |
| core/build.gradle.kts | Adds kotlinx.benchmark plugin and configuration for performance testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fun `sort empty lists`() { | ||
| val lists = listOf(listOf(1, 2), emptyList(), listOf(1), emptyList()) | ||
| val df = dataFrameOf("listColumn" to lists) | ||
|
|
||
| val res = KotlinNotebookPluginUtils.sortByColumns(df, listOf(listOf("listColumn")), desc = listOf(true)) | ||
|
|
||
| res["listColumn"].values() shouldBe listOf(listOf(1, 2), listOf(1), emptyList(), emptyList()) | ||
| } |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test verifies that the result is an empty list twice, but doesn't test preservation of order (stability) for empty lists. Since the test is named sort empty lists, it would be helpful to verify that when sorting descending, the two empty lists maintain their original relative order (stability), similar to the test on line 47-54. Consider modifying the test to use tagged empty lists (e.g., different container objects) to verify stability.
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/KotlinNotebookPluginUtilsTests.kt
Show resolved
Hide resolved
| // Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead | ||
| // For now Comparator<DataRow> is needed for fallback case of sorting multiple columns. Although it's now impossible in UI | ||
| // Please make sure to change both this and createColumnComparator |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 119 states "Although it's now impossible in UI", but this doesn't clarify what "this" refers to or why it's impossible. It appears to refer to multi-column sorting being disabled in the UI, but this should be stated more explicitly. Consider clarifying the comment to explain: "Although multi-column sorting is currently disabled in the Kotlin Notebook Plugin UI, this fallback case is retained for API compatibility and potential future re-enablement."
| // Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead | |
| // For now Comparator<DataRow> is needed for fallback case of sorting multiple columns. Although it's now impossible in UI | |
| // Please make sure to change both this and createColumnComparator | |
| // Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead. | |
| // For now Comparator<DataRow> is needed as a fallback for multi-column sorting. Although multi-column sorting | |
| // is currently disabled in the Kotlin Notebook Plugin UI, this fallback is retained for API compatibility and | |
| // potential future re-enablement. Please make sure to change both this and createColumnComparator. |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/DataFrameGet.kt
Show resolved
Hide resolved
| fun `sort lists by size descending`() { | ||
| val random = Random(123) | ||
| val lists = List(20) { List(random.nextInt(1, 100)) { it } } + null | ||
| val df = dataFrameOf("listColumn" to lists) | ||
|
|
||
| val res = KotlinNotebookPluginUtils.sortByColumns(df, listOf(listOf("listColumn")), desc = listOf(true)) | ||
|
|
||
| res["listColumn"].values() shouldBe lists.sortedByDescending { it?.size ?: 0 } | ||
| } |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test only verifies sorting lists by size but doesn't test edge cases like: (1) all nulls, (2) mix of empty lists and non-empty lists with nulls, or (3) very large lists. Consider adding test cases for these edge cases to ensure the sorting behavior is well-defined and consistent.
Two options. First, default - Kotlin Kernel lives in its own JVM process. Second is attached mode where Kernel lives inside intellij process and uses common pool from there. But IMO in both cases it's no different from using Stream.parallelStream which is fairly widespread and, to my knowledge, recommended |
I suggest to look at this PR by individual commits to see how code evolved
First iteration is simple and managed to achieve sub 1s speed for 1 m rows, which is acceptable
Later i introduced wrapper that exploits the fact that Kotlin Notebook takes small part of sorted dataframe, so after sorting we can avoid materializing the entire thing
And, since we can only sort 1 column at a time now, added special case of sorting just 1 column. Comparator can be a bit faster. + It was easy to apply Arrays.parallelSort to that, which all combined gave us another magnitude of speedup
So now sorting of 1 m rows is within ~30-50 ms on my PC, depends on how much core you have.
Bonus: it's now possible to sort columns with DataFrame and List values by their size
Intermediate results after removing reflection from sorting loop:
Final results with lazy materialization + improved comparators for optimal bytecode where possible + Arrays.parallelSort:
fixes #1436