-
Notifications
You must be signed in to change notification settings - Fork 834
Description
What happened:
I observed that the SidecarSet controller sometimes gets stuck in an infinite loop and stops updating any Pods. This happens when a Pod matched by the SidecarSet is deleted (e.g., by the workload controller or manually) during the rolling update process.
The controller logs repeatedly print the following message indefinitely:
sidecarset <name> matched pods has some update in flight: [<deleted-pod-name>], will sync later
Even after waiting for a long time (more than 10 minutes), the state does not recover. I have to restart the kruise-controller-manager to clear this state.
What you expected to happen:
The controller should handle the Pod deletion gracefully. If a Pod is deleted, the "update expectation" for that Pod should either be cleared immediately or time out after a reasonable duration (e.g., 5 minutes), allowing the SidecarSet to continue reconciling other healthy Pods.
How to reproduce it (as minimally and precisely as possible):
- Create a SidecarSet and a Workload (e.g., CloneSet) with multiple Pods.
- Trigger a rolling update for the SidecarSet (e.g., update the sidecar image).
- While the SidecarSet is updating, continuously delete some Pods that are being updated (simulating a conflict or aggressive scaling down).
- Observe the SidecarSet controller logs. You may see it get stuck waiting for a Pod that no longer exists.
Anything else we need to know?:
It seems like a race condition where the expectation for the Pod update is registered, but the Pod deletion event is processed in a way that fails to clear this expectation (or the expectation is added after the deletion is processed). Since Kruise v1.3 seems to lack a timeout mechanism for UpdateExpectations in sidecarset_processor.go, it waits forever.
Questions:
Has this issue been fixed in newer versions (v1.4/v1.5/v1.6)?
If not, is there any plan to address this race condition or introduce a timeout mechanism for expectations?
If this behavior is by design (to ensure strict consistency), what is the recommended way to handle such stuck states without restarting the controller?
Thanks
Environment:
- Kruise version: v1.3
- Kubernetes version (use
kubectl version): v1.17 - Install details (e.g. helm install args): default helm installation