WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Configurable Retry Policy and Error Tolerance for Daft UDFs #5726

@dragongu

Description

@dragongu

Summary

We need configurable retry policies and error tolerance in Daft to improve job reliability and reduce costs: (1) configurable retry exceptions and counts for frequent worker preemptions, (2) tolerance for failed rows to avoid expensive full re-runs.

Use Case

We are running Daft workloads on elastic Kubernetes resources where:

  1. Worker pods are frequently preempted - The default 4 retries are exhausted quickly, causing jobs to fail unnecessarily
  2. Unexpected exceptions tolerance - Some input files may be corrupted/invalid, or other unexpected exceptions may occur. Since we allow a small amount of data loss and the cost of completely re-running the job is too high (e.g., large GPU resources), these failed rows should be skipped rather than causing the entire pipeline to fail or triggering endless retries

Proposed Solution

Add configuration options to @daft.cls and @daft.method decorators to support:

1. Configurable Retry Exceptions

Allow users to specify which exceptions should trigger retries:

@daft.cls(
    gpus=1,
    max_concurrency=1,
    use_process=True,
    # New: Specify which exceptions should trigger retries
    retry_exceptions=[OSError,CustomRetryableException],
)

2. Configurable Retry Count

Allow users to set custom retry counts (currently hardcoded to 4):

@daft.cls(
    gpus=1,
    max_concurrency=1,
    use_process=True,
    # New: Set maximum retry attempts
    max_retries=100,  # or -1 for infinite retries
    retry_exceptions=[OSError,CustomRetryableException],
)

3. Error Tolerance (Skip Failed Data)

Allow users to specify a maximum number of failed rows that can be skipped:

@daft.cls(
    gpus=1,
    max_concurrency=1,
    use_process=True,
    max_retries=100,
    retry_exceptions=[OSError, CustomRetryableException],
    # New: Maximum number of failed rows that can be tolerated (skipped)
    max_tolerated_failed_rows=10,  # Skip up to 10 failed rows before failing the entire job
)

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

No

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions