-
Notifications
You must be signed in to change notification settings - Fork 354
Open
Labels
Description
Summary
We need configurable retry policies and error tolerance in Daft to improve job reliability and reduce costs: (1) configurable retry exceptions and counts for frequent worker preemptions, (2) tolerance for failed rows to avoid expensive full re-runs.
Use Case
We are running Daft workloads on elastic Kubernetes resources where:
- Worker pods are frequently preempted - The default 4 retries are exhausted quickly, causing jobs to fail unnecessarily
- Unexpected exceptions tolerance - Some input files may be corrupted/invalid, or other unexpected exceptions may occur. Since we allow a small amount of data loss and the cost of completely re-running the job is too high (e.g., large GPU resources), these failed rows should be skipped rather than causing the entire pipeline to fail or triggering endless retries
Proposed Solution
Add configuration options to @daft.cls and @daft.method decorators to support:
1. Configurable Retry Exceptions
Allow users to specify which exceptions should trigger retries:
@daft.cls(
gpus=1,
max_concurrency=1,
use_process=True,
# New: Specify which exceptions should trigger retries
retry_exceptions=[OSError,CustomRetryableException],
)2. Configurable Retry Count
Allow users to set custom retry counts (currently hardcoded to 4):
@daft.cls(
gpus=1,
max_concurrency=1,
use_process=True,
# New: Set maximum retry attempts
max_retries=100, # or -1 for infinite retries
retry_exceptions=[OSError,CustomRetryableException],
)3. Error Tolerance (Skip Failed Data)
Allow users to specify a maximum number of failed rows that can be skipped:
@daft.cls(
gpus=1,
max_concurrency=1,
use_process=True,
max_retries=100,
retry_exceptions=[OSError, CustomRetryableException],
# New: Maximum number of failed rows that can be tolerated (skipped)
max_tolerated_failed_rows=10, # Skip up to 10 failed rows before failing the entire job
)Describe alternatives you've considered
No response
Additional Context
No response
Would you like to implement a fix?
No