-
Notifications
You must be signed in to change notification settings - Fork 10
CCMRG-1910: Simplify worker shutdown to forward signals and let Celery handle graceful shutdown #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…y handle graceful shutdown Remove timeout logic from shell script that was killing worker processes before Celery could properly reject tasks during warm shutdown. Now the script simply forwards SIGTERM to the worker and waits for Celery to complete its graceful shutdown sequence, allowing tasks with acks_late=True to be properly rejected and redelivered.
| # Wait for the worker to exit (Celery will handle graceful shutdown) | ||
| wait "$worker_pid" 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Removal of timeout in worker.sh causes Celery's 60s graceful shutdown to exceed K8s's default 30s terminationGracePeriodSeconds, leading to interrupted shutdowns.
Severity: CRITICAL | Confidence: High
🔍 Detailed Analysis
The removal of the timeout in apps/worker/worker.sh causes Celery's graceful shutdown to be interrupted. Celery's worker_soft_shutdown_timeout is 60 seconds, but Kubernetes' default terminationGracePeriodSeconds is 30 seconds. Without the shell script's timeout, the worker waits indefinitely, but K8s will send SIGKILL after 30 seconds, interrupting Celery's 60-second shutdown. This can lead to tasks with acks_late=True not being properly rejected or redelivered, potentially causing task loss or duplicate execution.
💡 Suggested Fix
Either explicitly configure K8s terminationGracePeriodSeconds > worker_soft_shutdown_timeout, reintroduce a shell timeout < K8s grace period, or reduce worker_soft_shutdown_timeout to match K8s default.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: apps/worker/worker.sh#L59-L61
Potential issue: The removal of the timeout in `apps/worker/worker.sh` causes Celery's
graceful shutdown to be interrupted. Celery's `worker_soft_shutdown_timeout` is 60
seconds, but Kubernetes' default `terminationGracePeriodSeconds` is 30 seconds. Without
the shell script's timeout, the worker waits indefinitely, but K8s will send SIGKILL
after 30 seconds, interrupting Celery's 60-second shutdown. This can lead to tasks with
`acks_late=True` not being properly rejected or redelivered, potentially causing task
loss or duplicate execution.
Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5225578
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have this set to 2500
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #593 +/- ##
=======================================
Coverage 93.87% 93.87%
=======================================
Files 1284 1284
Lines 46528 46528
Branches 1522 1522
=======================================
Hits 43676 43676
Misses 2542 2542
Partials 310 310
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Remove timeout logic from shell script that was killing worker processes
before Celery could properly reject tasks during warm shutdown. Now the
script simply forwards SIGTERM to the worker and waits for Celery to
complete its graceful shutdown sequence, allowing tasks with acks_late=True
to be properly rejected and redelivered.