Skip to content

Bug: GitRepository reconciliation flow gets permanently stuck after transient informer staleness #2012

@tgoodwin

Description

@tgoodwin

Summary

If the source-controller's informer cache is stale for a few (~3) seconds during a reconcile, the GitRepository can end up with a valid artifact but observedGeneration: 0. This makes it look permanently broken to flux get, kstatus, and downstream dependsOn Kustomizations, and it does not self-heal once the transient staleness resolves.

Analysis

The source-controller's SerialPatcher issues three patches per reconcile via Helper.Patch(), aggregated into a single error:

return kerrors.NewAggregate([]error{
    h.patchStatusConditions(ctx, obj, ...),  // 1. conditions (OptimisticLock)
    h.patch(ctx, obj, ...),                  // 2. metadata/spec (no OCC)
    h.patchStatus(ctx, obj, ...),            // 3. status subresource (no OCC)
})

These patch ops use different optimistic concurrency levels. patchStatusConditions uses MergeFromWithOptimisticLock, which embeds the base object's resourceVersion in the patch payload. The other two use plain MergeFrom, which applies unconditionally without any version check.

This means patchStatusConditions is the only one that will conflict under informer staleness. If the cache is stale, the API server rejects the patch based on observed RV. The method retries 5 times with exponential backoff, but if staleness persists for more than a few (~3) seconds, all retries fail with a 409 and the error propagates upwards.

Because the three patches are aggregated, one failing and two succeeding produces a partial write: patchStatus succeeds, but patchStatusConditions fails to write the conditions. The object on the API server ends up with a valid artifact but no conditions.

Helper.Patch() is called not only in the deferred SummarizeAndPatch, but also during intermediate sp.Patch() calls within the reconcile pipeline itself. When the timeout propagates out, it raises a *serror.Generic
error. ComputeReconcileResult receives this error and uses a switch statement to decide whether to set observedGeneration. In this case, the non-ignored *serror.Generic does not set it. The result is an object with a valid artifact but still has observedGeneration: 0.

How it gets stuck

On the next reconcile, the controller reads the object back. The observedGeneration: 0 state means obj.Generation != obj.Status.ObservedGeneration, which triggers progressive status patching early in the reconcile pipeline.
This sp.Patch() call goes through Helper.Patch() again, including patchStatusConditions. If the informer is no longer stale, this succeeds.

ComputeReconcileResult now gets nil error and calls addPatchOptionWithStatusObservedGeneration.
But this function has a guard: len(obj.GetConditions()) > 0. This check runs inside ComputeReconcileResult
(line 201,
which calls ComputeReconcileResult, whose guard is at reconcile.go line 229) BEFORE SetSummary adds conditions
(line 209).
The in-memory object has no conditions at this point, the guard fails silently, and observedGeneration remains 0.

The controller cycles on this state indefinitely (stuck).

Trigger

a few seconds of continuous informer cache staleness during a single reconcile, causing patchStatusConditions to timeout on repeated 409s.

Suggested fix

The root issue is that transient informer staleness prevents observedGeneration from being set, even though the actual reconciliation work succeeded.

The error switch in
ComputeReconcileResult handles *serror.Stalling, *serror.Waiting, *serror.Generic, and nil, but does not explicitly handle the case where a *serror.Generic wraps a transient patch timeout. Adding a case that recognizes this situation and still sets observedGeneration (since the reconcile pipeline itself succeeded) would eliminate this fixed point in the overall reconciliation flow.

Context

I found this with a simulation framework I'm developing called kamera using simulated informer cache lag. Happy to cut a PR to fix!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions