WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

[BUG]SidecarSet controller stuck in infinite "update in flight" loop when Pod is deleted during update #2253

@liochou0-ui

Description

@liochou0-ui

What happened:
I observed that the SidecarSet controller sometimes gets stuck in an infinite loop and stops updating any Pods. This happens when a Pod matched by the SidecarSet is deleted (e.g., by the workload controller or manually) during the rolling update process.

The controller logs repeatedly print the following message indefinitely:
sidecarset <name> matched pods has some update in flight: [<deleted-pod-name>], will sync later

Even after waiting for a long time (more than 10 minutes), the state does not recover. I have to restart the kruise-controller-manager to clear this state.

What you expected to happen:
The controller should handle the Pod deletion gracefully. If a Pod is deleted, the "update expectation" for that Pod should either be cleared immediately or time out after a reasonable duration (e.g., 5 minutes), allowing the SidecarSet to continue reconciling other healthy Pods.

How to reproduce it (as minimally and precisely as possible):

  1. Create a SidecarSet and a Workload (e.g., CloneSet) with multiple Pods.
  2. Trigger a rolling update for the SidecarSet (e.g., update the sidecar image).
  3. While the SidecarSet is updating, continuously delete some Pods that are being updated (simulating a conflict or aggressive scaling down).
  4. Observe the SidecarSet controller logs. You may see it get stuck waiting for a Pod that no longer exists.

Anything else we need to know?:
It seems like a race condition where the expectation for the Pod update is registered, but the Pod deletion event is processed in a way that fails to clear this expectation (or the expectation is added after the deletion is processed). Since Kruise v1.3 seems to lack a timeout mechanism for UpdateExpectations in sidecarset_processor.go, it waits forever.

Questions:
Has this issue been fixed in newer versions (v1.4/v1.5/v1.6)?
If not, is there any plan to address this race condition or introduce a timeout mechanism for expectations?
If this behavior is by design (to ensure strict consistency), what is the recommended way to handle such stuck states without restarting the controller?
Thanks

Environment:

  • Kruise version: v1.3
  • Kubernetes version (use kubectl version): v1.17
  • Install details (e.g. helm install args): default helm installation

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions