-
Notifications
You must be signed in to change notification settings - Fork 77
Work in progress: DbType2 #1632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ableColumnMetadata` in `generateTypeInformation()`. This contains potential pre- and post-processing logic for any type
…recursive preprocessing
| /** the name of the class of the DuckDB JDBC driver */ | ||
| override val driverClassName: String = "org.duckdb.DuckDBDriver" | ||
|
|
||
| override fun generateTypeInformation(tableColumnMetadata: TableColumnMetadata): AnyTypeInformation = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider: ColumnValuesConverter, JdbcColumnConverter, ...
I think it's important to convey that we, most importantly, want to convert values from JDBC classes to something Kotlin-friendly - providing KType / ColumnSchema is just a side effect, right? ValuePreprocessor and ColumnPostprocessor are good in this regard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, the main use case of this function is building the ColumnSchema, aka the type information for DataFrame, given some JDBC TableColumnMetadata, so we know what kind of DataFrame will pop out when some SQL is read.
Actually, most values can directly go from JDBC into a ValueColumn, no conversion needed. However, in the cases where conversion needs to be done (like Date -> Kotlin Date, Struct -> DataRow), this conversion can be provided in the TypeInformation class that's returned, either in the form of a value-preprocessor, or, if you need all values to be gathered first, in the form of a column-postprocessor. Still, they are strictly tied to the TableColumnMetadata.
Does that make sense? :) I've tried several names already for this TypeInformation class, (DbTypeInformation, DbColumnType...) but none really fit or become too large. But I'd like to convey that it does not háve to convert. If you do want to convert, you can create a TypeInformation instance with typeInformationWithPreprocessingFor(...).
We could also split up providing the type information and actually converting them. However, I liked the idea of providing the type information and converting in one place, because it forces you to write a JDBC type, the conversion, and the schema that's created at once, keeping the logic together.
If you'd ignore what I made thus far, how would you make an API like this where you can define how JDBC types are mapped (and potentially converted) to column types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if i'm thinking from scratch: looks like you're referring to ResultSet (kind of Row) -> value extraction, that also can be database specific apparently?
So people might need to specify what class their Jdbc driver will return to our "generic value extractor" of some sort. We're not converting anything at this step, just calling rs.getObject, and we need to know what to expect.
If my description is accurate, can we call it JdbcDriverSpec? Can it be a separate entity from "converters"?
ExpectedTypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can definitely be a separate step.
However, the converters will still need access to the original TableColumnMetadata to function properly, and they might do some duplicate logic:
So, we have a couple cases:
- The first case is the easiest: We just want to read a column from JDBC, no conversion necessary. So the Db implementation could simply give a KType representing the type of the values coming from that column, and a
ValueColumn<ThatType>will be created. - The second case is a bit more complicated: We want to read a column of values from JDBC, but they need to be converted, like Dates. In this case, the implementation could give a KType again representing the type of the values coming from that column. A preprocessor could take those values and KType and convert them (and return a new KType) before a
DataColumn<NewType>is created. The preprocessor might need the originalTableColumnMetadatathough, as it might have information that cannot be represented by just the KType. - The next case is comparable to the previous one, but now we want to post-process the column, like in a column of arrays where we want to infer their common type. So the implementation will give a KType of
java.sql.Array, it may be preprocessed, aValueColumnis created, and then the postprocessor can do its magic, converting theValueColumnwith values to any sort of column it likes. It might need the originalTableColumnMetadata, andKTypeand return the new column and ColumnSchema. - The final case is where a structured column needs to be created: We read a
STRUCTcolumn from JDBC; the first step returns the KTypejava.sql.Struct.The preprocessor can convert each value to aThe preprocessor should do nothing, making aDataRow<*>based on the KType andTableColumnMetadata, so aDataColumn<DataRow<*>, aka aColumnGroupcan be created. Though, we still need to report the newColumnSchema.Groupsomewhere so we can doTableColumnMetadata -> ColumnSchemawithout reading actual data. Maybe in the postprocessor?ValueColumn<Struct>, then the post-processor should turn it in aColumnGroup, returning the resultingColumnSchema.Groupas well.
In the final case, we can see the TableColumnMetadata needing being parsed up to 3 separate times: In the first step to generate a KType we're not even using, as it's just typeOf<AnyRow>(), in the second step, to convert each value to a correct DataRow, and in the final step to produce the right ColumnSchema.Group. This can be quite tedious, as TableColumnMetadata can contain recursive types as well... It's also hard to track the logic of a single type across multiple separate functions...
But maybe we can find a hybrid between these two approaches? Allowing logic to be grouped, but functions to be separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final case is where a structured column needs to be created: We read a STRUCT column from JDBC; the first step returns the KType java.sql.Struct. The preprocessor can convert each value to a DataRow<> based on the KType and TableColumnMetadata, so a DataColumn<DataRow<>, aka a ColumnGroup can be created. Though, we still need to report the new ColumnSchema.Group somewhere so we can do TableColumnMetadata -> ColumnSchema without reading actual data. Maybe in the postprocessor
As a side note, in recent Map.toDataFrame PR i noticed that map.toDataRow that uses type inference hits a very obvious bottleneck in reflective type inference that is called for each row, each value individually. In that case column-based creation of ColumnGroup reduces time from 17s to 1.5s!
I think efficient transformation of ResultSet with Struct value should be done in ColumnPostprocessor maybe? Like DataColumn -> ColumnGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the converters will still need access to the original TableColumnMetadata to function properly, and they might do some duplicate logic:
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:o
Yes, it makes sense to postpone it to the post-processing step indeed! A DataRow<*> is a DataFrame with one row after all, so forming a DF with 1000 rows will create 1000 intermediate DFs with type inference. We'd need #1541 to be able to do this efficiently.
But this just shows it's good to have both pre- and post-processing :) we need both.
When a FrameColumn is created, it does make sense to create DataFrames in the preprocessing step
Fixes #1273
Fixes #1587
Fixes #461
Might fix #462 ?
Follows up on #1266 and #462
Work in progress, so more information will come when the design is finalized.