-
Notifications
You must be signed in to change notification settings - Fork 109
Description
We have this documentation on DecimalArray:
/// These are just the maximal ranges for each scalar type, but it is perfectly legal to store
/// values with precision that does not match this exactly. For example, a valid DecimalArray with
/// precision=39 may store its values in an `i8` if all of the actual values fit into it.
///
/// Similarly, a `DecimalArray` can be built that stores a set of precision=2 values in a
/// `Buffer<i256>`.
This presents a problem (which is the cause of many of the recent fuzzer crashes): If we try to patch values or use fill_null with a decimal value that is valid according to the logical DType, but is not compatible with the storage DType, then we just panic!
This is highly concerning! The physical type has leaked and now governs what we are allowed to do instead of the logical type.
I see 3 ways forward:
- Keep this behavior. I think this is a bad idea, as it has completely different behavior from how
PrimitiveArrayworks withBitPackedArray, for example. - Reallocate the entire array whenever we run into this so that the new storage type can fit the new values.
I also think this is a bad idea in terms of performance. - We instead require that the storage type must always be able to fit any value in the domain of the logical
DecimalDType.
For 3, by requiring this constraint we no longer need to worry about checking every single value during validation, we can just do a single check in the constructor that the DType fits the storage.
We are about to do a huge migration to the new Vortex operator world (where everything is lazily computed / executed), which means we don't necessarily need to do this migration in the old world: We can just make sure we implement the correct behavior in the new world.
In the new world, we need to make sure that the DecimalVector the array tree executes into holds a DVector that can store every possible value given by the DecmialDType's PrecisionScale.
In practice, this will mean finding the smallest integer storage type for the given PrecisionScale. Note we don't necessarily need to use the minimal storage type, though that will be the most helpful for performance.
The more important thing is that we do not allow executing a DecimalArray into a DecimalVector that cannot fit all possible values according to the PrecisionScale.
And also note that we need to handle old files that we were written in the old world with this bad logic.
Fuzzer Crashes
- Fuzzing Crash: Sum operation returns null for decimal256 array #5590
- Fuzzing Crash: fill_null on DecimalArray fails when fill value has null decimal_value #5744
- Fuzzing Crash: DecimalArray fill_null with null fill value #5806
- Fuzzing Crash: DecimalArray fill_null panics on null scalar value #5807
- Fuzzing Crash: Sum operation returns null for valid DecimalArray ChunkedArray #5810
- Fuzzing Crash: Decimal Sum Overflow in ChunkedArray #5811