-
Notifications
You must be signed in to change notification settings - Fork 231
Description
Summary
If the source-controller's informer cache is stale for a few (~3) seconds during a reconcile, the GitRepository can end up with a valid artifact but observedGeneration: 0. This makes it look permanently broken to flux get, kstatus, and downstream dependsOn Kustomizations, and it does not self-heal once the transient staleness resolves.
Analysis
The source-controller's SerialPatcher issues three patches per reconcile via Helper.Patch(), aggregated into a single error:
return kerrors.NewAggregate([]error{
h.patchStatusConditions(ctx, obj, ...), // 1. conditions (OptimisticLock)
h.patch(ctx, obj, ...), // 2. metadata/spec (no OCC)
h.patchStatus(ctx, obj, ...), // 3. status subresource (no OCC)
})These patch ops use different optimistic concurrency levels. patchStatusConditions uses MergeFromWithOptimisticLock, which embeds the base object's resourceVersion in the patch payload. The other two use plain MergeFrom, which applies unconditionally without any version check.
This means patchStatusConditions is the only one that will conflict under informer staleness. If the cache is stale, the API server rejects the patch based on observed RV. The method retries 5 times with exponential backoff, but if staleness persists for more than a few (~3) seconds, all retries fail with a 409 and the error propagates upwards.
Because the three patches are aggregated, one failing and two succeeding produces a partial write: patchStatus succeeds, but patchStatusConditions fails to write the conditions. The object on the API server ends up with a valid artifact but no conditions.
Helper.Patch() is called not only in the deferred SummarizeAndPatch, but also during intermediate sp.Patch() calls within the reconcile pipeline itself. When the timeout propagates out, it raises a *serror.Generic
error. ComputeReconcileResult receives this error and uses a switch statement to decide whether to set observedGeneration. In this case, the non-ignored *serror.Generic does not set it. The result is an object with a valid artifact but still has observedGeneration: 0.
How it gets stuck
On the next reconcile, the controller reads the object back. The observedGeneration: 0 state means obj.Generation != obj.Status.ObservedGeneration, which triggers progressive status patching early in the reconcile pipeline.
This sp.Patch() call goes through Helper.Patch() again, including patchStatusConditions. If the informer is no longer stale, this succeeds.
ComputeReconcileResult now gets nil error and calls addPatchOptionWithStatusObservedGeneration.
But this function has a guard: len(obj.GetConditions()) > 0. This check runs inside ComputeReconcileResult
(line 201,
which calls ComputeReconcileResult, whose guard is at reconcile.go line 229) BEFORE SetSummary adds conditions
(line 209).
The in-memory object has no conditions at this point, the guard fails silently, and observedGeneration remains 0.
The controller cycles on this state indefinitely (stuck).
Trigger
a few seconds of continuous informer cache staleness during a single reconcile, causing patchStatusConditions to timeout on repeated 409s.
Suggested fix
The root issue is that transient informer staleness prevents observedGeneration from being set, even though the actual reconciliation work succeeded.
The error switch in
ComputeReconcileResult handles *serror.Stalling, *serror.Waiting, *serror.Generic, and nil, but does not explicitly handle the case where a *serror.Generic wraps a transient patch timeout. Adding a case that recognizes this situation and still sets observedGeneration (since the reconcile pipeline itself succeeded) would eliminate this fixed point in the overall reconciliation flow.
Context
I found this with a simulation framework I'm developing called kamera using simulated informer cache lag. Happy to cut a PR to fix!