Overwrite and Overwrite-partition behavior definition for write parquet/csv/json #5739

stayrascal · 2025-12-03T10:03:18Z

stayrascal
Dec 3, 2025

Hi, I'm trying to define what's our expected behavior about overwrite partition or overwrite parquet/csv/json.

Overwrite mode

I think the overwrite behavior is clear that we will clear all old files, but will we consider multiple daft job overwrite to a same path concurrently? I think we don't support this case currently even no error thrown, but there might be some potential dirty-data or lost-data problem.

dirty-data: A daft job write some parquet/json/csv files and then failed, the written files will remaining
lost-data:
- T0: Daft job A write a1,b1,c1 files,
- T1: Daft job B write a2, b2, c2 files, and delete a1, b1, c1 files during commit write
- T2: Daft job B write d1 file, and delete a2,b2,c2 files during commit write. Only d1 file remaining after Daft job B finish, but a1, b1, c1 are lost.

IMO, since both Daft and object store don't provide transactional capability, it's hard to support concurrent write to same path. We have to depends on third tool to control the transaction, e.g. dynamodb. And it might damage write performance, e.g. using a staging dir to isolate write result, and rename the staging dir to final dir, but object store have to use copy + delete to implement rename operation and no atomic semantic guarantee.

So we'd better suggest user using lake format to handle the concurrent write requirements? e.g. iceberg, delta lake, hudi, lance.

Overwrite partition mode

The currently overwrite partition behavior is only overwrite the partition whose data is changed, for example, there are three partition folders: part=a1, part=a2, part=a3。

If the written dataframe only change the data for part=a1, the part=a1 folder will be overwrite, the others are not changed.
If the written dataframe is empty, no partition is changed

I'm not sure if above all behaviors are expected, they are make sense for me, but we'd better to doc these behavior/semantic in somewhere.

BTW, Spark sql provides a parameter spark.sql.sources.partitionOverwriteMode to control whether delete other partitions, our overwrite-partition mode is refer to 'dynamic' mode and overwrite mode is refer to static mode, right?

Overwrite with a empty dataframe

Currently, if the written dataframe is empty, we will generate a empty parquet/csv file no matter whether the write mode is overwrite or overwrite-partiiton or append mode.

May I consult the reason why we create a empty file? did we want to compatible the read behavior because if no files existing, the daft.read_xxx will failed since no data file found instead of got a empty dataframe?

Write a empty file bring some wired and dynamic phenomenons:

we cannot write a empty json file, it means the latter daft.read_json() will failed.
in the overwrite partition with empty dataframe case, since we don't know what the actual partition value it is, so we write the empty file under the root dir, the other partition folder is not overwritten, so the layout looks wired, some example as follow. And it might bring some issue for other tools(e.g. hive, spark) which cannot recognise this layout.

(base) root@dm-6193440c78ef:~# tree /opt/data/overwrite/parquet_part
/opt/data/overwrite/parquet_part
├── 213d5416-34b6-4b9a-a421-ec5175d15271-0.parquet
├── 2dddd85f-ded8-49a1-b09c-bc6c3946a739-0.parquet
├── d9f5a18e-5c1b-4b67-bf48-01c72aabdefa-0.parquet
└── language_code=zh
    └── 3f970c21-ecf9-4227-89c2-35bf47411de2-0.parquet

So is there any impact that if we won't generate the empty file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overwrite and Overwrite-partition behavior definition for write parquet/csv/json #5739

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Overwrite and Overwrite-partition behavior definition for write parquet/csv/json #5739

Uh oh!

Uh oh!

stayrascal Dec 3, 2025

Overwrite mode

Overwrite partition mode

Overwrite with a empty dataframe

Replies: 0 comments

stayrascal
Dec 3, 2025