Overwrite and Overwrite-partition behavior definition for write parquet/csv/json #5739
stayrascal
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm trying to define what's our expected behavior about overwrite partition or overwrite parquet/csv/json.
Overwrite mode
I think the overwrite behavior is clear that we will clear all old files, but will we consider multiple daft job overwrite to a same path concurrently? I think we don't support this case currently even no error thrown, but there might be some potential dirty-data or lost-data problem.
IMO, since both Daft and object store don't provide transactional capability, it's hard to support concurrent write to same path. We have to depends on third tool to control the transaction, e.g. dynamodb. And it might damage write performance, e.g. using a staging dir to isolate write result, and rename the staging dir to final dir, but object store have to use copy + delete to implement rename operation and no atomic semantic guarantee.
So we'd better suggest user using lake format to handle the concurrent write requirements? e.g. iceberg, delta lake, hudi, lance.
Overwrite partition mode
The currently overwrite partition behavior is only overwrite the partition whose data is changed, for example, there are three partition folders: part=a1, part=a2, part=a3。
I'm not sure if above all behaviors are expected, they are make sense for me, but we'd better to doc these behavior/semantic in somewhere.
BTW, Spark sql provides a parameter
spark.sql.sources.partitionOverwriteModeto control whether delete other partitions, our overwrite-partition mode is refer to 'dynamic' mode and overwrite mode is refer tostaticmode, right?Overwrite with a empty dataframe
Currently, if the written dataframe is empty, we will generate a empty parquet/csv file no matter whether the write mode is overwrite or overwrite-partiiton or append mode.
May I consult the reason why we create a empty file? did we want to compatible the read behavior because if no files existing, the daft.read_xxx will failed since no data file found instead of got a empty dataframe?
Write a empty file bring some wired and dynamic phenomenons:
So is there any impact that if we won't generate the empty file?
Beta Was this translation helpful? Give feedback.
All reactions