WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@drazisil-codecov
Copy link
Contributor

Remove timeout logic from shell script that was killing worker processes
before Celery could properly reject tasks during warm shutdown. Now the
script simply forwards SIGTERM to the worker and waits for Celery to
complete its graceful shutdown sequence, allowing tasks with acks_late=True
to be properly rejected and redelivered.

…y handle graceful shutdown

Remove timeout logic from shell script that was killing worker processes
before Celery could properly reject tasks during warm shutdown. Now the
script simply forwards SIGTERM to the worker and waits for Celery to
complete its graceful shutdown sequence, allowing tasks with acks_late=True
to be properly rejected and redelivered.
@linear
Copy link

linear bot commented Dec 3, 2025

Comment on lines +59 to 61
# Wait for the worker to exit (Celery will handle graceful shutdown)
wait "$worker_pid" 2>/dev/null || true
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Removal of timeout in worker.sh causes Celery's 60s graceful shutdown to exceed K8s's default 30s terminationGracePeriodSeconds, leading to interrupted shutdowns.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

The removal of the timeout in apps/worker/worker.sh causes Celery's graceful shutdown to be interrupted. Celery's worker_soft_shutdown_timeout is 60 seconds, but Kubernetes' default terminationGracePeriodSeconds is 30 seconds. Without the shell script's timeout, the worker waits indefinitely, but K8s will send SIGKILL after 30 seconds, interrupting Celery's 60-second shutdown. This can lead to tasks with acks_late=True not being properly rejected or redelivered, potentially causing task loss or duplicate execution.

💡 Suggested Fix

Either explicitly configure K8s terminationGracePeriodSeconds > worker_soft_shutdown_timeout, reintroduce a shell timeout < K8s grace period, or reduce worker_soft_shutdown_timeout to match K8s default.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: apps/worker/worker.sh#L59-L61

Potential issue: The removal of the timeout in `apps/worker/worker.sh` causes Celery's
graceful shutdown to be interrupted. Celery's `worker_soft_shutdown_timeout` is 60
seconds, but Kubernetes' default `terminationGracePeriodSeconds` is 30 seconds. Without
the shell script's timeout, the worker waits indefinitely, but K8s will send SIGKILL
after 30 seconds, interrupting Celery's 60-second shutdown. This can lead to tasks with
`acks_late=True` not being properly rejected or redelivered, potentially causing task
loss or duplicate execution.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5225578

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this set to 2500

@sentry
Copy link

sentry bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.87%. Comparing base (240de92) to head (b9ced4b).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #593   +/-   ##
=======================================
  Coverage   93.87%   93.87%           
=======================================
  Files        1284     1284           
  Lines       46528    46528           
  Branches     1522     1522           
=======================================
  Hits        43676    43676           
  Misses       2542     2542           
  Partials      310      310           
Flag Coverage Δ
workerintegration 58.60% <ø> (ø)
workerunit 91.22% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@codecov-eu
Copy link

codecov-eu bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@drazisil-codecov drazisil-codecov added this pull request to the merge queue Dec 3, 2025
Merged via the queue into main with commit dc83cf0 Dec 3, 2025
40 of 41 checks passed
@drazisil-codecov drazisil-codecov deleted the CCMRG-1910 branch December 3, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants