Skip to content

feat(ledger): add v3 reconciler with Raft StatefulSet support#434

Open
gfyrag wants to merge 5 commits intomainfrom
feat/ledger-v3-support
Open

feat(ledger): add v3 reconciler with Raft StatefulSet support#434
gfyrag wants to merge 5 commits intomainfrom
feat/ledger-v3-support

Conversation

@gfyrag
Copy link
Copy Markdown
Contributor

@gfyrag gfyrag commented Mar 25, 2026

Summary

  • Add version-branched reconciler for Ledger v3 (Raft consensus + Pebble embedded storage)
  • v3 deploys a StatefulSet (instead of Deployment) with a headless service for Raft peer discovery
  • Gateway compatibility: ClusterIP service maps port 8080 → container port 9000
  • All v3-specific config (replicas, PVC sizes, Pebble/Raft tunables) driven by Settings CRD — no changes to the Ledger CRD type

What's new

Resource Purpose
StatefulSet/ledger Raft cluster nodes with OrderedReady policy
Service/ledger-raft (headless) Peer DNS discovery for Raft
Service/ledger (ClusterIP) Gateway-facing, port 8080→9000
3 PVCs per pod wal (5Gi), data (10Gi), cold-cache (10Gi)

Settings keys

Key Default Description
ledger.v3.replicas 3 Raft node count (must be odd)
ledger.v3.cluster-id "default" Cluster ID
ledger.v3.persistence.{wal,data,cold-cache}.size 5/10/10 Gi PVC sizes
ledger.v3.persistence.{wal,data,cold-cache}.storage-class "" Storage classes
ledger.v3.pebble.* "" Pebble tunables
ledger.v3.raft.* "" Raft tunables

Version gate

isV3 := !semver.IsValid(version) || semver.Compare(version, "v3.0.0-alpha") >= 0

v2 code path is completely untouched.

Test plan

  • go build ./... and go vet ./... pass
  • Create a Ledger CR with version: v3.0.0 in a Stack, verify StatefulSet, headless Service, ClusterIP Service, and PVCs are created
  • Verify gateway routes traffic to ledger via port 8080
  • Verify existing v2 Ledger CRs still reconcile correctly (no regression)
  • Test with odd replica counts (1, 3, 5) and verify even counts are rejected

🤖 Generated with Claude Code

Ledger v3 uses Raft consensus with Pebble embedded storage instead of
PostgreSQL. This adds a version-branched reconciler that creates a
StatefulSet (with headless service for peer discovery) when the ledger
version is >= v3.0.0-alpha.

Key changes:
- Version gate in Reconcile(): v3+ skips Database/migrations entirely
- StatefulSet with OrderedReady policy, 3 PVCs (wal, data, cold-cache)
- Headless service (ledger-raft) for Raft peer DNS discovery
- ClusterIP service mapping port 8080→9000 for gateway compatibility
- Pod entrypoint script: computes node-id from ordinal, bootstrap/join
- Settings-driven: replicas, PVC sizes, storage classes, Pebble/Raft tunables
- RBAC marker for apps/statefulsets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gfyrag gfyrag requested a review from a team as a code owner March 25, 2026 15:58
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • helm/operator/templates/gen/rbac.authorization.k8s.io_v1_clusterrole_formance-manager-role.yaml is excluded by !**/gen/**, !**/*.yaml, !**/gen/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7111619a-aec7-486d-9073-beb3601dc89a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

When a Stack version is semver-major v3, reconciliation immediately delegates to a new v3 flow that installs a Gateway, a headless Raft Service, and a StatefulSet with per-pod PVCs, probes, env/command wiring and v3-specific settings; RBAC/owner metadata updated for StatefulSets.

Changes

Cohort / File(s) Summary
Version-gated reconciliation & RBAC
internal/resources/ledgers/init.go
Detects semver major v3 and short-circuits to reconcileV3(...). Adds apps/statefulsets kubebuilder RBAC and registers apps/v1.StatefulSet as an owner for v1beta1.Ledger.
V3 reconciliation implementation
internal/resources/ledgers/v3.go
New v3 reconciler: resolves image, creates Gateway with health endpoint, creates headless Raft Service (ledger-raft), and installs/updates a StatefulSet named ledger with ordered pod management, odd-replicas validation, three PVC templates (wal,data,cold-cache), pod template, container ports, probes, env vars, preStop deregistration, and bootstrapping/join command logic.
Tests: Ledger v3 controller
internal/tests/ledger_v3_controller_test.go
New Ginkgo/Gomega tests asserting v3 behavior: headless Service, StatefulSet ownership and config (default 3 replicas, OrderedReady, svcName), PVC templates/mounts, container ports/probes/lifecycle, env vars (CLUSTER_ID, POD_NAME/NAMESPACE), Gateway health, no Database resource, and settings-driven overrides.
Docs: Ledger v3 & settings reference
docs/04-Modules/03-Ledger.md, docs/09-Configuration reference/01-Settings.md
Adds Ledger v3 documentation and settings: Raft-based architecture, resources created (headless Service, StatefulSet, PVCs), new settings (replicas, cluster-id, PVC sizes/storageClass, Pebble and Raft tuning knobs) and notes about odd quorum and PostgreSQL not required.

Sequence Diagram

sequenceDiagram
    participant Controller as Controller
    participant Init as Reconciler\n(init.go)
    participant V3 as ReconcileV3\n(v3.go)
    participant K8s as Kubernetes API

    Controller->>Init: Trigger reconciliation
    Init->>Init: Parse version -> isV3?
    alt isV3 == true
        Init->>V3: Delegate to v3 reconciliation
        V3->>K8s: Resolve image/config
        K8s-->>V3: Image/config
        V3->>K8s: Create/Update Gateway (Service + HTTP health)
        V3->>K8s: Create/Update headless Service (ledger-raft)
        V3->>K8s: Create/Update StatefulSet (pod template, PVCs, probes)
        K8s-->>V3: Resources applied
        V3-->>Init: v3 reconcile result
    else isV3 == false
        Init->>K8s: Proceed with existing v2 reconciliation
        K8s-->>Init: v2 resources reconciled
    end
    Init-->>Controller: Reconciliation complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A hop, a stitch — v3 springs bright,
Pods line up in ordered flight,
Raft peers whisper, volumes keep,
Health checks hum while pebble sleeps.
I nibble configs — the cluster dreams.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(ledger): add v3 reconciler with Raft StatefulSet support' accurately and specifically describes the main change: adding a v3 reconciler with Raft-based StatefulSet support.
Description check ✅ Passed The description is directly related to the changeset, providing a detailed summary of v3 features, Settings keys, version gating logic, and a test plan aligned with the code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ledger-v3-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Document all v3-specific Settings keys in both:
- Settings reference table (01-Settings.md)
- Ledger module page (03-Ledger.md) with architecture overview,
  YAML examples, and tables for persistence/Pebble/Raft tunables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/resources/ledgers/init.go`:
- Around line 44-47: The early return when isV3 is true skips v2 cleanup; before
calling reconcileV3(ctx, stack, ledger, version) invoke a v2-cleanup routine
(e.g., cleanupV2Resources or similar) that explicitly deletes v2-only objects:
the old v2 StatefulSet(s), Services, the legacy "ledger" and "ledger-worker"
workloads, and the reindex CronJob; place this call immediately before the
return in the isV3 branch so reconcileV3 can assume v2 resources are removed
(use existing stack/ledger helpers or add a small helper function to perform the
deletes).

In `@internal/resources/ledgers/v3.go`:
- Around line 270-273: The addresses for ADVERTISE_ADDR and BOOTSTRAP_ADDR
currently hardcode the cluster domain (".cluster.local"); change them to not
include the fixed domain and instead either append only the ".svc" suffix or
read a configurable env var (e.g., CLUSTER_DOMAIN or KUBE_DNS_DOMAIN) and use
that when building the hostnames. Update the two format strings that construct
ADVERTISE_ADDR and BOOTSTRAP_ADDR (which use headlessSvc, POD_NAME,
POD_NAMESPACE, v3PortRaft, v3PortGRPC) to omit ".cluster.local" or use the env
var, and apply the same fix to other occurrences (e.g., the similar code in
internal/resources/ledgers/reindex.go).
- Around line 106-139: The StatefulSet reconcile currently rebuilds and sets
spec.VolumeClaimTemplates every run (via buildV3VolumeClaimTemplates and the
core.CreateOrUpdate for appsv1.StatefulSet), but Kubernetes treats
volumeClaimTemplates as immutable and will reject updates; remove volume claim
template reconciliation or make persistence settings create-only: stop updating
spec.VolumeClaimTemplates in the CreateOrUpdate callback (leave existing
templates untouched if the StatefulSet already exists) and/or stop watching
ledger.v3.persistence in internal/resources/ledgers/init.go so changes to
ledger.v3.persistence are treated as create-time-only, or alternatively
implement explicit PVC resize/migration logic outside the StatefulSet
reconcilation flow if you need to change storage after creation.
- Around line 308-327: The code currently uses resource.MustParse(sizeStr) when
building the PersistentVolumeClaim in the function that reads sizeStr from
settings.GetStringOrDefault; replace this with resource.ParseQuantity(sizeStr),
check the returned error, and return that error (or a wrapped validation error)
instead of allowing a panic, so the PVC construction
(corev1.PersistentVolumeClaim / corev1.PersistentVolumeClaimSpec) only proceeds
when the quantity is valid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e0d012e-99af-42d6-9146-d5029484c003

📥 Commits

Reviewing files that changed from the base of the PR and between fafb30a and 9ca91be.

📒 Files selected for processing (2)
  • internal/resources/ledgers/init.go
  • internal/resources/ledgers/v3.go

Comment on lines +44 to +47
isV3 := !semver.IsValid(version) || semver.Compare(version, "v3.0.0-alpha") >= 0
if isV3 {
return reconcileV3(ctx, stack, ledger, version)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clean up v2-only resources before handing off to the v3 reconciler.

This early return skips every v2-side delete path. On an in-place v2→v3 upgrade, reconcileV3 only creates the new StatefulSet/Services, so old resources like the v2 ledger / ledger-worker workloads and the reindex CronJob can survive the version flip and keep running.

Suggested direction
 isV3 := !semver.IsValid(version) || semver.Compare(version, "v3.0.0-alpha") >= 0
 if isV3 {
+	if err := cleanupV2Resources(ctx, stack, ledger); err != nil {
+		return err
+	}
 	return reconcileV3(ctx, stack, ledger, version)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/init.go` around lines 44 - 47, The early return
when isV3 is true skips v2 cleanup; before calling reconcileV3(ctx, stack,
ledger, version) invoke a v2-cleanup routine (e.g., cleanupV2Resources or
similar) that explicitly deletes v2-only objects: the old v2 StatefulSet(s),
Services, the legacy "ledger" and "ledger-worker" workloads, and the reindex
CronJob; place this call immediately before the return in the isV3 branch so
reconcileV3 can assume v2 resources are removed (use existing stack/ledger
helpers or add a small helper function to perform the deletes).

Comment on lines +106 to +139
volumeClaims, err := buildV3VolumeClaimTemplates(ctx, stackName)
if err != nil {
return err
}

podTemplate, err := buildV3PodTemplate(ctx, stack, ledger, image)
if err != nil {
return err
}

headlessSvcName := "ledger-raft"
stsName := "ledger"

_, _, err = core.CreateOrUpdate[*appsv1.StatefulSet](ctx, types.NamespacedName{
Name: stsName,
Namespace: stackName,
},
func(t *appsv1.StatefulSet) error {
t.Spec = appsv1.StatefulSetSpec{
Replicas: &replicas,
ServiceName: headlessSvcName,
PodManagementPolicy: appsv1.OrderedReadyPodManagement,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app.kubernetes.io/name": "ledger",
},
},
Template: *podTemplate,
VolumeClaimTemplates: volumeClaims,
}
return nil
},
core.WithController[*appsv1.StatefulSet](ctx.GetScheme(), ledger),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd "v3.go|init.go" internal/resources/ledgers/

Repository: formancehq/operator

Length of output: 130


🏁 Script executed:

cat -n internal/resources/ledgers/v3.go | head -150

Repository: formancehq/operator

Length of output: 5264


🏁 Script executed:

cat -n internal/resources/ledgers/init.go

Repository: formancehq/operator

Length of output: 5306


🏁 Script executed:

cat -n internal/resources/ledgers/v3.go | sed -n '140,350p'

Repository: formancehq/operator

Length of output: 7868


🏁 Script executed:

grep -n "buildV3VolumeClaimTemplates" internal/resources/ledgers/v3.go -A 30

Repository: formancehq/operator

Length of output: 2234


🌐 Web query:

Kubernetes StatefulSet volumeClaimTemplates immutable update validation

💡 Result:

In Kubernetes StatefulSets, the volumeClaimTemplates field in the spec is immutable after creation. Attempts to update it trigger validation that forbids changes to spec fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'revisionHistoryLimit', 'persistentVolumeClaimRetentionPolicy', and 'minReadySeconds'. This is enforced by the StatefulSet controller's update validation logic, as evidenced by consistent error messages across sources and the ongoing KEP #4650 (as of 2026-01) to enable limited updates via a feature gate (UpdateVolumeClaimTemplate), which remains in alpha/open status. Workarounds for changes like PVC size include: - Edit existing PVCs directly (if StorageClass allows expansion). - Delete the StatefulSet with --cascade=orphan (preserves Pods/PVCs), update the manifest, and re-apply. Official docs do not list immutable fields explicitly but confirm via API reference and behavior.

Citations:


Persistence settings won't reconcile after StatefulSet is created.

internal/resources/ledgers/init.go watches Settings, but this code rebuilds spec.volumeClaimTemplates on each reconcile. Kubernetes treats volumeClaimTemplates as immutable on update and rejects spec changes outside a strict allowlist (replicas, ordinals, template, updateStrategy, revisionHistoryLimit, persistentVolumeClaimRetentionPolicy, minReadySeconds), so changing ledger.v3.persistence.* settings will fail with validation errors. Either remove persistence settings from the watch or enforce them as create-time-only with no further reconciliation, otherwise implement explicit PVC resize/migration logic outside the StatefulSet template.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 106 - 139, The StatefulSet
reconcile currently rebuilds and sets spec.VolumeClaimTemplates every run (via
buildV3VolumeClaimTemplates and the core.CreateOrUpdate for appsv1.StatefulSet),
but Kubernetes treats volumeClaimTemplates as immutable and will reject updates;
remove volume claim template reconciliation or make persistence settings
create-only: stop updating spec.VolumeClaimTemplates in the CreateOrUpdate
callback (leave existing templates untouched if the StatefulSet already exists)
and/or stop watching ledger.v3.persistence in internal/resources/ledgers/init.go
so changes to ledger.v3.persistence are treated as create-time-only, or
alternatively implement explicit PVC resize/migration logic outside the
StatefulSet reconcilation flow if you need to change storage after creation.

Comment on lines +270 to +273
fmt.Sprintf(`ADVERTISE_ADDR="${POD_NAME}.%s.${POD_NAMESPACE}.svc.cluster.local:%d"`, headlessSvc, v3PortRaft),
// First pod (ordinal 0) bootstraps if no checkpoint exists yet, otherwise normal start.
// Other pods join pod-0.
fmt.Sprintf(`BOOTSTRAP_ADDR="ledger-0.%s.${POD_NAMESPACE}.svc.cluster.local:%d"`, headlessSvc, v3PortGRPC),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the actual code file and surrounding context
cat -n internal/resources/ledgers/v3.go | sed -n '260,280p'

Repository: formancehq/operator

Length of output: 1174


🏁 Script executed:

# Search for related DNS/domain configurations in the codebase
rg -i "cluster\.local|cluster-domain|cluster.domain" --type go

Repository: formancehq/operator

Length of output: 1079


🏁 Script executed:

# Check if there are any existing configurable domain patterns
rg -i "domain|dns|hostname" internal/resources/ledgers/ -A 2 -B 2 --type go | head -50

Repository: formancehq/operator

Length of output: 518


🌐 Web query:

Kubernetes kubelet --cluster-domain custom DNS suffix configuration

💡 Result:

The Kubernetes kubelet supports configuring a custom DNS cluster domain suffix using the --cluster-domain command-line flag or, preferably, the clusterDomain field in the KubeletConfiguration file. Command-line flag (deprecated but functional): kubelet --cluster-domain=your.custom.domain This configures the kubelet to append the specified domain to container DNS search paths, enabling short names like my-svc.my-ns to resolve as my-svc.my-ns.your.custom.domain. Default is typically cluster.local in most distributions. Recommended configuration file method (use --config=/path/to/config.yaml): apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration clusterDomain: your.custom.domain clusterDNS: - 10.96.0.10 # Cluster DNS IP, e.g., CoreDNS service IP The kubelet generates /etc/resolv.conf in pods with search domains including the pod namespace, svc, and clusterDomain (e.g., search myns.svc.your.custom.domain svc.your.custom.domain your.custom.domain). With kubeadm, include KubeletConfiguration in the init/join config file, or set ClusterConfiguration.networking.dnsDomain which propagates to kubelet. Restart kubelet after changes. This affects pod/service DNS resolution cluster-wide; all kubelets must use the same value for consistency.

Citations:


Don't hardcode cluster.local into Raft peer addresses.

Kubernetes allows operators to configure a custom DNS cluster domain via the kubelet --cluster-domain flag (default is cluster.local). Building ADVERTISE_ADDR and BOOTSTRAP_ADDR as pod.svc.cluster.local will fail peer discovery on clusters using non-default domains. Make the cluster domain configurable or use namespace-local service names that don't rely on the domain suffix.

Note: This pattern is repeated elsewhere in the codebase (e.g., internal/resources/ledgers/reindex.go) and should be addressed consistently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 270 - 273, The addresses for
ADVERTISE_ADDR and BOOTSTRAP_ADDR currently hardcode the cluster domain
(".cluster.local"); change them to not include the fixed domain and instead
either append only the ".svc" suffix or read a configurable env var (e.g.,
CLUSTER_DOMAIN or KUBE_DNS_DOMAIN) and use that when building the hostnames.
Update the two format strings that construct ADVERTISE_ADDR and BOOTSTRAP_ADDR
(which use headlessSvc, POD_NAME, POD_NAMESPACE, v3PortRaft, v3PortGRPC) to omit
".cluster.local" or use the env var, and apply the same fix to other occurrences
(e.g., the similar code in internal/resources/ledgers/reindex.go).

Comment on lines +308 to +327
sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
if err != nil {
return nil, err
}

storageClass, err := settings.GetStringOrEmpty(ctx, stackName, strings.Split(s.storageClassKey, ".")...)
if err != nil {
return nil, err
}

pvc := corev1.PersistentVolumeClaim{
ObjectMeta: metav1.ObjectMeta{
Name: s.name,
},
Spec: corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
Resources: corev1.VolumeResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: resource.MustParse(sizeStr),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -type f -name "v3.go" -path "*/ledgers/*" 2>/dev/null | head -20

Repository: formancehq/operator

Length of output: 1882


🏁 Script executed:

find . -type f -name "v3.go" 2>/dev/null | head -20

Repository: formancehq/operator

Length of output: 97


🏁 Script executed:

find . -type f -path "*/resources/ledgers/v3.go" 2>/dev/null

Repository: formancehq/operator

Length of output: 97


🏁 Script executed:

git ls-files | grep -E "ledgers.*v3\.go"

Repository: formancehq/operator

Length of output: 95


🏁 Script executed:

git ls-files | grep -E "v3\.go" | head -10

Repository: formancehq/operator

Length of output: 95


🏁 Script executed:

sed -n '308,327p' internal/resources/ledgers/v3.go

Repository: formancehq/operator

Length of output: 733


🏁 Script executed:

sed -n '300,330p' internal/resources/ledgers/v3.go

Repository: formancehq/operator

Length of output: 1159


🌐 Web query:

k8s.io resource.MustParse panic behavior vs ParseQuantity

💡 Result:

In the k8s.io/apimachinery/pkg/api/resource package, MustParse(str string) Quantity parses the given string into a Quantity and panics if parsing fails (e.g., invalid format or numeric issues). It explicitly calls ParseQuantity internally and panics on any non-nil error with a message like "cannot parse '%v': %v". In contrast, ParseQuantity(str string) (Quantity, error) performs the same parsing but returns an error (such as ErrFormatWrong for invalid format or ErrNumeric for unable to parse numeric part) instead of panicking, allowing the caller to handle failures gracefully. Use MustParse only when the input is guaranteed valid (e.g., tests or hardcoded values). Use ParseQuantity for untrusted input to avoid panics. Note: Recent issues (as of late 2025) show edge cases near math.MaxInt64 where MustParse could produce invalid quantities without panicking, but this is a parsing bug being fixed, not a change in panic behavior for invalid inputs.

Citations:


🌐 Web query:

kubernetes apimachinery resource.MustParse documentation

💡 Result:

resource.MustParse is a function in the Kubernetes apimachinery package at k8s.io/apimachinery/pkg/api/resource. It parses a string into a Quantity and panics on error, intended for cases where the input is known to be valid, such as tests. Signature: func MustParse(str string) Quantity It calls ParseQuantity internally and panics with fmt.Errorf("cannot parse '%v': %v", str, err) if parsing fails. Quantity represents fixed-point numbers for resources like CPU and memory, supporting formats: BinarySI (e.g., "5Gi"), DecimalSI (e.g., "5G"), DecimalExponent (e.g., "12e6"). Example: memorySize := resource.MustParse("5Gi") fmt.Printf("memorySize = %v (%v)\n", memorySize.Value, memorySize.Format) Usage in resource specs: Resources: v1.ResourceRequirements{ Limits: v1.ResourceList{ "cpu": resource.MustParse("1"), "memory": resource.MustParse("1Gi"), }, Requests: v1.ResourceList{ "cpu": resource.MustParse("500m"), "memory": resource.MustParse("512Mi"), }, },

Citations:


🏁 Script executed:

grep -n "MustParse" internal/resources/ledgers/v3.go

Repository: formancehq/operator

Length of output: 125


🏁 Script executed:

grep -n "resource\." internal/resources/ledgers/v3.go | head -20

Repository: formancehq/operator

Length of output: 125


🏁 Script executed:

head -50 internal/resources/ledgers/v3.go

Repository: formancehq/operator

Length of output: 1540


Replace resource.MustParse with resource.ParseQuantity to avoid panics on invalid PVC size settings.

The sizeStr value comes from user-controlled settings via GetStringOrDefault, which means a typo in ledger.v3.persistence.*.size will cause the reconciler to panic instead of returning a validation error. Use resource.ParseQuantity to handle invalid input gracefully.

Safer parsing
 	for _, s := range specs {
 		sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
 		if err != nil {
 			return nil, err
 		}
+		sizeQty, err := resource.ParseQuantity(sizeStr)
+		if err != nil {
+			return nil, fmt.Errorf("invalid %s value %q: %w", s.sizeKey, sizeStr, err)
+		}
 
 		storageClass, err := settings.GetStringOrEmpty(ctx, stackName, strings.Split(s.storageClassKey, ".")...)
 		if err != nil {
 			return nil, err
 		}
@@
 			Spec: corev1.PersistentVolumeClaimSpec{
 				AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
 				Resources: corev1.VolumeResourceRequirements{
 					Requests: corev1.ResourceList{
-						corev1.ResourceStorage: resource.MustParse(sizeStr),
+						corev1.ResourceStorage: sizeQty,
 					},
 				},
 			},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
if err != nil {
return nil, err
}
storageClass, err := settings.GetStringOrEmpty(ctx, stackName, strings.Split(s.storageClassKey, ".")...)
if err != nil {
return nil, err
}
pvc := corev1.PersistentVolumeClaim{
ObjectMeta: metav1.ObjectMeta{
Name: s.name,
},
Spec: corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
Resources: corev1.VolumeResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: resource.MustParse(sizeStr),
},
sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
if err != nil {
return nil, err
}
sizeQty, err := resource.ParseQuantity(sizeStr)
if err != nil {
return nil, fmt.Errorf("invalid %s value %q: %w", s.sizeKey, sizeStr, err)
}
storageClass, err := settings.GetStringOrEmpty(ctx, stackName, strings.Split(s.storageClassKey, ".")...)
if err != nil {
return nil, err
}
pvc := corev1.PersistentVolumeClaim{
ObjectMeta: metav1.ObjectMeta{
Name: s.name,
},
Spec: corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
Resources: corev1.VolumeResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: sizeQty,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 308 - 327, The code currently
uses resource.MustParse(sizeStr) when building the PersistentVolumeClaim in the
function that reads sizeStr from settings.GetStringOrDefault; replace this with
resource.ParseQuantity(sizeStr), check the returned error, and return that error
(or a wrapped validation error) instead of allowing a panic, so the PVC
construction (corev1.PersistentVolumeClaim / corev1.PersistentVolumeClaimSpec)
only proceeds when the quantity is valid.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/04-Modules/03-Ledger.md`:
- Around line 104-110: Update the "Ledger v3" section to clarify that the global
"Requirements" above apply to v2 (not v3) by adding a short note or adjusting
headings; specifically modify the "Ledger v3" heading/content to include a
sentence like "Note: the top-level requirements apply to Ledger v2 — Ledger v3
uses embedded Pebble storage and does not require PostgreSQL or a message
broker." Ensure you reference the "Ledger v3" section and the existing
"Requirements" wording so readers won't misconfigure installs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22a15665-0d3d-4966-a6ee-1cff0073e5a3

📥 Commits

Reviewing files that changed from the base of the PR and between 9ca91be and dd00278.

📒 Files selected for processing (2)
  • docs/04-Modules/03-Ledger.md
  • docs/09-Configuration reference/01-Settings.md
✅ Files skipped from review due to trivial changes (1)
  • docs/09-Configuration reference/01-Settings.md

Comment on lines +104 to +110
## Ledger v3

Ledger v3 is an architecturally different version that uses Raft consensus with Pebble embedded storage instead of PostgreSQL. When the version is `>= v3.0.0-alpha`, the operator deploys a **StatefulSet** instead of a Deployment.

### Requirements

Ledger v3 does **not** require PostgreSQL or a message broker. Storage is fully embedded (Pebble LSM).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify version-scoped requirements to avoid contradictory guidance.

Lines 106 and 110 correctly describe v3, but the page still starts with global requirements that say PostgreSQL is required. Please add a short note in this new v3 section (or adjust headings) that the top requirements apply to v2, so readers don’t misconfigure installs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/04-Modules/03-Ledger.md` around lines 104 - 110, Update the "Ledger v3"
section to clarify that the global "Requirements" above apply to v2 (not v3) by
adding a short note or adjusting headings; specifically modify the "Ledger v3"
heading/content to include a sentence like "Note: the top-level requirements
apply to Ledger v2 — Ledger v3 uses embedded Pebble storage and does not require
PostgreSQL or a message broker." Ensure you reference the "Ledger v3" section
and the existing "Requirements" wording so readers won't misconfigure installs.

- Add 20 Ginkgo integration tests for v3 StatefulSet reconciler
- Prefix all v3 settings with "module." for consistency
- Fix version gate to use semver.Major == "v3" (avoids breaking v2 tests)
- Remove redundant services.Create (GatewayHTTPAPI handles ClusterIP service)
- Update docs with correct settings key paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
internal/resources/ledgers/v3.go (3)

300-318: ⚠️ Potential issue | 🟠 Major

Avoid resource.MustParse on Settings input.

sizeStr is user-controlled here. A typo like 20GG will panic this reconcile path instead of returning a validation error.

🛡️ Safer quantity parsing
 	sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
 	if err != nil {
 		return nil, err
 	}
+	sizeQty, err := resource.ParseQuantity(sizeStr)
+	if err != nil {
+		return nil, fmt.Errorf("invalid %s value %q: %w", s.sizeKey, sizeStr, err)
+	}
 
 	storageClass, err := settings.GetStringOrEmpty(ctx, stackName, strings.Split(s.storageClassKey, ".")...)
 	if err != nil {
 		return nil, err
 	}
@@
 				Resources: corev1.VolumeResourceRequirements{
 					Requests: corev1.ResourceList{
-						corev1.ResourceStorage: resource.MustParse(sizeStr),
+						corev1.ResourceStorage: sizeQty,
 					},
 				},
 			},
Does k8s.io/apimachinery/pkg/api/resource.MustParse panic on invalid quantities?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 300 - 318, The code currently
calls resource.MustParse(sizeStr) when building the PersistentVolumeClaim ( PVC
creation using s.name and the sizeStr obtained via settings.GetStringOrDefault),
which will panic on invalid user input; replace MustParse with
resource.ParseQuantity and handle the error: call
resource.ParseQuantity(sizeStr), return a clear validation/error from the
reconcile function if parsing fails (with context including
s.sizeKey/stackName), and only use the returned Quantity in the PVC Spec when
parsing succeeds.

111-126: ⚠️ Potential issue | 🟠 Major

volumeClaimTemplates can't be treated as a live-reconciled setting.

internal/resources/ledgers/init.go watches Settings changes, and this callback rewrites t.Spec.VolumeClaimTemplates on every reconcile. Kubernetes rejects StatefulSet updates that mutate spec.volumeClaimTemplates, so changing module.ledger.v3.persistence.* after the first create will wedge reconciliation. Either make these settings create-time-only or manage PVC expansion outside the StatefulSet template.

Can Kubernetes StatefulSet spec.volumeClaimTemplates be changed after the StatefulSet is created?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 111 - 126, The reconcile
currently overwrites t.Spec.VolumeClaimTemplates inside the CreateOrUpdate
callback for the StatefulSet (see the anonymous func(t *appsv1.StatefulSet) in
CreateOrUpdate[*appsv1.StatefulSet]), which mutates spec.volumeClaimTemplates
and will be rejected by Kubernetes on updates; change the logic so
VolumeClaimTemplates are only set at creation time (or omitted from the Update
path): detect whether t.ObjectMeta.Generation or t.Spec.VolumeClaimTemplates is
already present and skip replacing t.Spec.VolumeClaimTemplates on updates, or
move PVC size/expansion handling out of the StatefulSet template into separate
PVC patch/expand code; also adjust the Settings watcher in
internal/resources/ledgers/init.go to treat module.ledger.v3.persistence.* as
create-time-only or trigger your separate PVC expansion flow instead of
reconciling the StatefulSet template.

262-265: ⚠️ Potential issue | 🟠 Major

Don't hardcode .cluster.local into peer addresses.

Clusters can run with a non-default DNS suffix, so these FQDNs break Raft discovery outside cluster.local. Make the cluster domain configurable or derive addresses without baking in the suffix.

Can Kubernetes clusters use a DNS cluster domain other than cluster.local?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 262 - 265, The peer FQDNs are
hardcoded with ".cluster.local" which breaks clusters using a different DNS
suffix; change the ADVERTISE_ADDR and BOOTSTRAP_ADDR string templates to not
embed ".cluster.local" and instead append a cluster domain variable (e.g., use
an env var like CLUSTER_DOMAIN or the Kubernetes-provided
KUBERNETES_SERVICE_DNS_DOMAIN with a sensible fallback) when building the
addresses; update the templates that reference ADVERTISE_ADDR and BOOTSTRAP_ADDR
(using headlessSvc, v3PortRaft, v3PortGRPC, POD_NAME and POD_NAMESPACE) to
include the cluster domain variable rather than the literal ".cluster.local" so
the domain is configurable at runtime.
internal/resources/ledgers/init.go (1)

44-46: ⚠️ Potential issue | 🟠 Major

Clean up v2 resources before taking the v3 fast-path.

This early return still skips the existing v2 delete path, so an in-place v2→v3 upgrade can leave the old Deployment, worker, and reindex CronJob running beside the new StatefulSet.

🧹 Suggested direction
 isV3 := semver.IsValid(version) && semver.Major(version) == "v3"
 if isV3 {
+	if err := cleanupV2Resources(ctx, stack, ledger); err != nil {
+		return err
+	}
 	return reconcileV3(ctx, stack, ledger, version)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/init.go` around lines 44 - 46, The early return
for the v3 fast-path (the isV3 check using semver.IsValid/semver.Major that
currently calls reconcileV3 and returns) skips the existing v2 delete path and
leaves v2 resources running; before returning from the v3 branch, invoke the v2
cleanup/delete logic (the existing v2 delete path that removes the v2
Deployment, worker, and reindex CronJob) so v2 resources are removed during an
in-place v2→v3 upgrade—i.e., call or reuse the v2 cleanup function/logic
immediately before or inside the isV3 branch and only then proceed to return
reconcileV3(ctx, stack, ledger, version).
🧹 Nitpick comments (2)
internal/tests/ledger_v3_controller_test.go (2)

163-169: Assert the generated ledger Service here too.

reconcileV3 now relies on the GatewayHTTPAPI reconciler to materialize the ClusterIP Service, so only checking the CR leaves the 8080 → targetPort http / container port 9000 compatibility path untested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/tests/ledger_v3_controller_test.go` around lines 163 - 169, The test
currently only asserts the GatewayHTTPAPI CR but not the generated ClusterIP
Service; update the test after loading the GatewayHTTPAPI (httpAPI) to also load
the Service resource named core.GetObjectName(stack.Name, "ledger") into a
v1.Service struct and assert that its spec.ports contains a port 8080 that
targets port named "http" (or targetPort 9000) so the 8080→targetPort mapping
created by the GatewayHTTPAPI reconciler (reconcileV3 path) is verified; use the
same Eventually/LoadResource pattern (or Expect) as used for httpAPI and
reference types GatewayHTTPAPI, reconcileV3, and the Service object to locate
where to add the assertion.

186-204: Add the even-replica rejection case alongside this happy path.

installV3StatefulSet explicitly rejects even replica counts, but this suite never proves that validation. A module.ledger.v3.replicas=4 case would lock in the behavior promised by the PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/tests/ledger_v3_controller_test.go` around lines 186 - 204, Add a
parallel negative test that verifies even replica counts are rejected: create a
new Context using the same replicasSetting variable but set value "4" for the
key "module.ledger.v3.replicas" and in JustBeforeEach attempt to
Create(replicasSetting) and assert it returns an error (or that no StatefulSet
named "ledger" is ever created via LoadResource). Reference the existing symbols
replicasSetting, Create, Delete, LoadResource and the installV3StatefulSet
behavior to locate where to add an It block that expects creation to fail (or
that *sts.Spec.Replicas is not set) when the value is even.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/tests/ledger_v3_controller_test.go`:
- Around line 171-175: Modify the "Should NOT create a Database object" test to
first wait for a positive v3 signal (e.g., assert existence of the StatefulSet
or the headless Service created by the controller) before asserting Database
absence: replace the immediate Consistently call that uses LoadResource("",
core.GetObjectName(stack.Name, "ledger"), &v1beta1.Database{}) with a two-step
approach—1) Block until the StatefulSet or headless Service is observed (use the
test helper that GETs the StatefulSet/Service by name and expects it to
Eventually(Succeed)), then 2) run the Consistently loop that calls
LoadResource(..., &v1beta1.Database{}) and ShouldNot(Succeed()) to ensure the
Database remains absent for the short window; keep references to LoadResource,
core.GetObjectName(stack.Name, "ledger"), v1beta1.Database and the test "Should
NOT create a Database object" so the change is localized.

---

Duplicate comments:
In `@internal/resources/ledgers/init.go`:
- Around line 44-46: The early return for the v3 fast-path (the isV3 check using
semver.IsValid/semver.Major that currently calls reconcileV3 and returns) skips
the existing v2 delete path and leaves v2 resources running; before returning
from the v3 branch, invoke the v2 cleanup/delete logic (the existing v2 delete
path that removes the v2 Deployment, worker, and reindex CronJob) so v2
resources are removed during an in-place v2→v3 upgrade—i.e., call or reuse the
v2 cleanup function/logic immediately before or inside the isV3 branch and only
then proceed to return reconcileV3(ctx, stack, ledger, version).

In `@internal/resources/ledgers/v3.go`:
- Around line 300-318: The code currently calls resource.MustParse(sizeStr) when
building the PersistentVolumeClaim ( PVC creation using s.name and the sizeStr
obtained via settings.GetStringOrDefault), which will panic on invalid user
input; replace MustParse with resource.ParseQuantity and handle the error: call
resource.ParseQuantity(sizeStr), return a clear validation/error from the
reconcile function if parsing fails (with context including
s.sizeKey/stackName), and only use the returned Quantity in the PVC Spec when
parsing succeeds.
- Around line 111-126: The reconcile currently overwrites
t.Spec.VolumeClaimTemplates inside the CreateOrUpdate callback for the
StatefulSet (see the anonymous func(t *appsv1.StatefulSet) in
CreateOrUpdate[*appsv1.StatefulSet]), which mutates spec.volumeClaimTemplates
and will be rejected by Kubernetes on updates; change the logic so
VolumeClaimTemplates are only set at creation time (or omitted from the Update
path): detect whether t.ObjectMeta.Generation or t.Spec.VolumeClaimTemplates is
already present and skip replacing t.Spec.VolumeClaimTemplates on updates, or
move PVC size/expansion handling out of the StatefulSet template into separate
PVC patch/expand code; also adjust the Settings watcher in
internal/resources/ledgers/init.go to treat module.ledger.v3.persistence.* as
create-time-only or trigger your separate PVC expansion flow instead of
reconciling the StatefulSet template.
- Around line 262-265: The peer FQDNs are hardcoded with ".cluster.local" which
breaks clusters using a different DNS suffix; change the ADVERTISE_ADDR and
BOOTSTRAP_ADDR string templates to not embed ".cluster.local" and instead append
a cluster domain variable (e.g., use an env var like CLUSTER_DOMAIN or the
Kubernetes-provided KUBERNETES_SERVICE_DNS_DOMAIN with a sensible fallback) when
building the addresses; update the templates that reference ADVERTISE_ADDR and
BOOTSTRAP_ADDR (using headlessSvc, v3PortRaft, v3PortGRPC, POD_NAME and
POD_NAMESPACE) to include the cluster domain variable rather than the literal
".cluster.local" so the domain is configurable at runtime.

---

Nitpick comments:
In `@internal/tests/ledger_v3_controller_test.go`:
- Around line 163-169: The test currently only asserts the GatewayHTTPAPI CR but
not the generated ClusterIP Service; update the test after loading the
GatewayHTTPAPI (httpAPI) to also load the Service resource named
core.GetObjectName(stack.Name, "ledger") into a v1.Service struct and assert
that its spec.ports contains a port 8080 that targets port named "http" (or
targetPort 9000) so the 8080→targetPort mapping created by the GatewayHTTPAPI
reconciler (reconcileV3 path) is verified; use the same Eventually/LoadResource
pattern (or Expect) as used for httpAPI and reference types GatewayHTTPAPI,
reconcileV3, and the Service object to locate where to add the assertion.
- Around line 186-204: Add a parallel negative test that verifies even replica
counts are rejected: create a new Context using the same replicasSetting
variable but set value "4" for the key "module.ledger.v3.replicas" and in
JustBeforeEach attempt to Create(replicasSetting) and assert it returns an error
(or that no StatefulSet named "ledger" is ever created via LoadResource).
Reference the existing symbols replicasSetting, Create, Delete, LoadResource and
the installV3StatefulSet behavior to locate where to add an It block that
expects creation to fail (or that *sts.Spec.Replicas is not set) when the value
is even.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 37140e16-8592-48e3-a693-5ce75bf95db7

📥 Commits

Reviewing files that changed from the base of the PR and between dd00278 and dd2cca4.

⛔ Files ignored due to path filters (1)
  • config/rbac/role.yaml is excluded by !**/*.yaml
📒 Files selected for processing (5)
  • docs/04-Modules/03-Ledger.md
  • docs/09-Configuration reference/01-Settings.md
  • internal/resources/ledgers/init.go
  • internal/resources/ledgers/v3.go
  • internal/tests/ledger_v3_controller_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/04-Modules/03-Ledger.md
  • docs/09-Configuration reference/01-Settings.md

Comment on lines +171 to +175
It("Should NOT create a Database object", func() {
Consistently(func() error {
return LoadResource("", core.GetObjectName(stack.Name, "ledger"), &v1beta1.Database{})
}).ShouldNot(Succeed())
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wait for a positive v3 signal before asserting Database absence.

This can pass before the controller has finished its first reconcile. First wait for the StatefulSet or headless Service to exist, then keep asserting that the Database object stays absent for a short window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/tests/ledger_v3_controller_test.go` around lines 171 - 175, Modify
the "Should NOT create a Database object" test to first wait for a positive v3
signal (e.g., assert existence of the StatefulSet or the headless Service
created by the controller) before asserting Database absence: replace the
immediate Consistently call that uses LoadResource("",
core.GetObjectName(stack.Name, "ledger"), &v1beta1.Database{}) with a two-step
approach—1) Block until the StatefulSet or headless Service is observed (use the
test helper that GETs the StatefulSet/Service by name and expects it to
Eventually(Succeed)), then 2) run the Consistently loop that calls
LoadResource(..., &v1beta1.Database{}) and ShouldNot(Succeed()) to ensure the
Database remains absent for the short window; keep references to LoadResource,
core.GetObjectName(stack.Name, "ledger"), v1beta1.Database and the test "Should
NOT create a Database object" so the change is localized.

On scale-down, the preStop lifecycle hook:
1. Calls POST /_admin/deregister to remove the node from the Raft cluster
2. Cleans the WAL directory so future re-joins start as fresh learners

This ensures clean Raft membership management during StatefulSet scaling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
internal/resources/ledgers/v3.go (3)

323-342: ⚠️ Potential issue | 🟠 Major

Replace resource.MustParse for setting-derived PVC sizes.

sizeStr comes from Settings, so invalid values can panic the reconciler here. Use resource.ParseQuantity and return a validation error instead.

Safer quantity parsing
 	for _, s := range specs {
 		sizeStr, err := settings.GetStringOrDefault(ctx, stackName, s.defaultSize, strings.Split(s.sizeKey, ".")...)
 		if err != nil {
 			return nil, err
 		}
+		sizeQty, err := resource.ParseQuantity(sizeStr)
+		if err != nil {
+			return nil, fmt.Errorf("invalid %s value %q: %w", s.sizeKey, sizeStr, err)
+		}
@@
 				Resources: corev1.VolumeResourceRequirements{
 					Requests: corev1.ResourceList{
-						corev1.ResourceStorage: resource.MustParse(sizeStr),
+						corev1.ResourceStorage: sizeQty,
 					},
 				},
 			},
 		}
#!/bin/bash
set -euo pipefail

# Show setting-derived size flow and parse call site
rg -n 'GetStringOrDefault\(.*persistence.*size|MustParse\(|ParseQuantity\(' internal/resources/ledgers/v3.go -C3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 323 - 342, Replace the call to
resource.MustParse(sizeStr) inside the PVC creation (the code around sizeStr and
the corev1.PersistentVolumeClaimSpec in v3.go) with safe parsing: call
resource.ParseQuantity(sizeStr), handle the returned (q, err) and if err != nil
return a validation error (wrap or return err) instead of panicking; ensure the
parsed Quantity is used to populate the Requests map for corev1.ResourceStorage
so reconciler never panics on invalid setting-derived sizes.

269-273: ⚠️ Potential issue | 🟠 Major

Do not hardcode cluster.local in peer addresses.

Using fixed .svc.cluster.local breaks discovery on clusters with a custom DNS domain. Prefer namespace-local .svc addresses or make cluster domain configurable.

Safer address construction
-		fmt.Sprintf(`ADVERTISE_ADDR="${POD_NAME}.%s.${POD_NAMESPACE}.svc.cluster.local:%d"`, headlessSvc, v3PortRaft),
+		fmt.Sprintf(`ADVERTISE_ADDR="${POD_NAME}.%s.${POD_NAMESPACE}.svc:%d"`, headlessSvc, v3PortRaft),
 ...
-		fmt.Sprintf(`BOOTSTRAP_ADDR="ledger-0.%s.${POD_NAMESPACE}.svc.cluster.local:%d"`, headlessSvc, v3PortGRPC),
+		fmt.Sprintf(`BOOTSTRAP_ADDR="ledger-0.%s.${POD_NAMESPACE}.svc:%d"`, headlessSvc, v3PortGRPC),
#!/bin/bash
set -euo pipefail
rg -n 'cluster\.local' internal/resources/ledgers/v3.go -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 269 - 273, The peer address
strings currently hardcode ".svc.cluster.local" when building ADVERTISE_ADDR and
BOOTSTRAP_ADDR; change them to avoid the fixed cluster domain by either using
the namespace-local ".svc" suffix or reading a configurable CLUSTER_DOMAIN env
var. Update the fmt.Sprintf calls that build ADVERTISE_ADDR and BOOTSTRAP_ADDR
(the lines that reference headlessSvc, v3PortRaft and v3PortGRPC) to produce
"%s.${POD_NAMESPACE}.svc:%d" or "%s.${POD_NAMESPACE}.svc.%s:%d" driven by
CLUSTER_DOMAIN, and ensure any new env var is read early so the same variable is
used for both ADVERTISE_ADDR and BOOTSTRAP_ADDR.

111-127: ⚠️ Potential issue | 🟠 Major

Avoid reconciling volumeClaimTemplates after StatefulSet creation.

VolumeClaimTemplates is effectively immutable for normal StatefulSet updates. Rebuilding it from Settings in this mutate block can cause forbidden updates once the StatefulSet exists.

#!/bin/bash
set -euo pipefail

# Verify StatefulSet mutate path is always assigning VolumeClaimTemplates
rg -n 'CreateOrUpdate\[\*appsv1.StatefulSet\]|VolumeClaimTemplates|buildV3VolumeClaimTemplates' internal/resources/ledgers/v3.go -C3

# Verify persistence settings are consumed by v3 reconciler
rg -n 'module\.ledger\.v3\.persistence|buildV3VolumeClaimTemplates' internal/resources/ledgers/v3.go -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 111 - 127, The mutate block
currently unconditionally overwrites t.Spec.VolumeClaimTemplates causing
forbidden updates; modify the closure passed to
CreateOrUpdate[*appsv1.StatefulSet] so that you only assign VolumeClaimTemplates
when the StatefulSet is being created or has no existing templates (e.g., check
t.ObjectMeta.CreationTimestamp.IsZero() or len(t.Spec.VolumeClaimTemplates) ==
0) and otherwise leave t.Spec.VolumeClaimTemplates untouched; continue to use
buildV3VolumeClaimTemplates to construct the templates but only set them
conditionally inside the function literal for t *appsv1.StatefulSet.
internal/tests/ledger_v3_controller_test.go (1)

185-189: ⚠️ Potential issue | 🟡 Minor

Wait for a positive v3 signal before asserting Database absence.

This assertion can pass before reconciliation starts. First wait for StatefulSet/headless Service creation, then run Consistently on Database absence.

Suggested test stabilization
 It("Should NOT create a Database object", func() {
+	sts := &appsv1.StatefulSet{}
+	Eventually(func() error {
+		return LoadResource(stack.Name, "ledger", sts)
+	}).Should(Succeed())
+
 	Consistently(func() error {
 		return LoadResource("", core.GetObjectName(stack.Name, "ledger"), &v1beta1.Database{})
 	}).ShouldNot(Succeed())
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/tests/ledger_v3_controller_test.go` around lines 185 - 189, The test
currently asserts Database absence immediately which can race with
reconciliation; modify the test ("Should NOT create a Database object") to first
wait for the positive v3 signal by polling for the created StatefulSet and
headless Service (e.g., use LoadResource/Expect/Eventually against
core.GetObjectName(stack.Name, "<statefulset-name>") and the headless Service)
and only after those resources exist run the Consistently check calling
LoadResource(..., &v1beta1.Database{}) to assert it does not appear; keep the
same resource-name construction via core.GetObjectName and the existing
Consistently block but gate it behind an initial Eventually/Wait for the
StatefulSet and headless Service to stabilize.
🧹 Nitpick comments (1)
internal/tests/ledger_v3_controller_test.go (1)

200-218: Add a negative-path test for even replica settings.

Current coverage proves custom odd replicas, but not enforcement failure for even values. Add a module.ledger.v3.replicas=4 case and assert reconcile does not produce the v3 StatefulSet.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/tests/ledger_v3_controller_test.go` around lines 200 - 218, Add a
negative-path test that ensures even replica values are rejected: create a new
Context similar to the existing "with custom replicas setting" but set
replicasSetting = settings.New(uuid.NewString(), "module.ledger.v3.replicas",
"4", stack.Name), call Create(replicasSetting) in JustBeforeEach and
Delete(replicasSetting) in AfterEach, then assert reconcile does not produce the
v3 StatefulSet by calling LoadResource(stack.Name, "ledger", sts) and expecting
it to return an error or for Eventually(func(g Gomega) error { return
LoadResource(stack.Name, "ledger", sts) }).ShouldNot(Succeed()) (or equivalent
negative assertion) so the test verifies no StatefulSet is created for an even
replica value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/resources/ledgers/v3.go`:
- Around line 94-96: The current validation only enforces oddness and allows
non-positive values like -1; update the guard around the replicas variable to
first check replicas > 0 and return a clear error (e.g.,
"module.ledger.v3.replicas must be positive, got %d") before checking oddness,
so the function handling replica validation (the block using the replicas
variable) fails fast with an explicit positive-count error and then retains the
existing oddness check.

---

Duplicate comments:
In `@internal/resources/ledgers/v3.go`:
- Around line 323-342: Replace the call to resource.MustParse(sizeStr) inside
the PVC creation (the code around sizeStr and the
corev1.PersistentVolumeClaimSpec in v3.go) with safe parsing: call
resource.ParseQuantity(sizeStr), handle the returned (q, err) and if err != nil
return a validation error (wrap or return err) instead of panicking; ensure the
parsed Quantity is used to populate the Requests map for corev1.ResourceStorage
so reconciler never panics on invalid setting-derived sizes.
- Around line 269-273: The peer address strings currently hardcode
".svc.cluster.local" when building ADVERTISE_ADDR and BOOTSTRAP_ADDR; change
them to avoid the fixed cluster domain by either using the namespace-local
".svc" suffix or reading a configurable CLUSTER_DOMAIN env var. Update the
fmt.Sprintf calls that build ADVERTISE_ADDR and BOOTSTRAP_ADDR (the lines that
reference headlessSvc, v3PortRaft and v3PortGRPC) to produce
"%s.${POD_NAMESPACE}.svc:%d" or "%s.${POD_NAMESPACE}.svc.%s:%d" driven by
CLUSTER_DOMAIN, and ensure any new env var is read early so the same variable is
used for both ADVERTISE_ADDR and BOOTSTRAP_ADDR.
- Around line 111-127: The mutate block currently unconditionally overwrites
t.Spec.VolumeClaimTemplates causing forbidden updates; modify the closure passed
to CreateOrUpdate[*appsv1.StatefulSet] so that you only assign
VolumeClaimTemplates when the StatefulSet is being created or has no existing
templates (e.g., check t.ObjectMeta.CreationTimestamp.IsZero() or
len(t.Spec.VolumeClaimTemplates) == 0) and otherwise leave
t.Spec.VolumeClaimTemplates untouched; continue to use
buildV3VolumeClaimTemplates to construct the templates but only set them
conditionally inside the function literal for t *appsv1.StatefulSet.

In `@internal/tests/ledger_v3_controller_test.go`:
- Around line 185-189: The test currently asserts Database absence immediately
which can race with reconciliation; modify the test ("Should NOT create a
Database object") to first wait for the positive v3 signal by polling for the
created StatefulSet and headless Service (e.g., use
LoadResource/Expect/Eventually against core.GetObjectName(stack.Name,
"<statefulset-name>") and the headless Service) and only after those resources
exist run the Consistently check calling LoadResource(..., &v1beta1.Database{})
to assert it does not appear; keep the same resource-name construction via
core.GetObjectName and the existing Consistently block but gate it behind an
initial Eventually/Wait for the StatefulSet and headless Service to stabilize.

---

Nitpick comments:
In `@internal/tests/ledger_v3_controller_test.go`:
- Around line 200-218: Add a negative-path test that ensures even replica values
are rejected: create a new Context similar to the existing "with custom replicas
setting" but set replicasSetting = settings.New(uuid.NewString(),
"module.ledger.v3.replicas", "4", stack.Name), call Create(replicasSetting) in
JustBeforeEach and Delete(replicasSetting) in AfterEach, then assert reconcile
does not produce the v3 StatefulSet by calling LoadResource(stack.Name,
"ledger", sts) and expecting it to return an error or for Eventually(func(g
Gomega) error { return LoadResource(stack.Name, "ledger", sts)
}).ShouldNot(Succeed()) (or equivalent negative assertion) so the test verifies
no StatefulSet is created for an even replica value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ba146dd-79d7-439f-ad46-58146905853a

📥 Commits

Reviewing files that changed from the base of the PR and between dd2cca4 and 24a53f1.

📒 Files selected for processing (2)
  • internal/resources/ledgers/v3.go
  • internal/tests/ledger_v3_controller_test.go

Comment on lines +94 to +96
if replicas%2 == 0 {
return fmt.Errorf("module.ledger.v3.replicas must be odd, got %d", replicas)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject non-positive replica counts explicitly.

Odd-only validation allows invalid negative values (e.g., -1). Enforce replicas > 0 together with oddness to fail fast with a clear message.

Proposed guard
-	if replicas%2 == 0 {
-		return fmt.Errorf("module.ledger.v3.replicas must be odd, got %d", replicas)
+	if replicas <= 0 || replicas%2 == 0 {
+		return fmt.Errorf("module.ledger.v3.replicas must be a positive odd number, got %d", replicas)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if replicas%2 == 0 {
return fmt.Errorf("module.ledger.v3.replicas must be odd, got %d", replicas)
}
if replicas <= 0 || replicas%2 == 0 {
return fmt.Errorf("module.ledger.v3.replicas must be a positive odd number, got %d", replicas)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/ledgers/v3.go` around lines 94 - 96, The current
validation only enforces oddness and allows non-positive values like -1; update
the guard around the replicas variable to first check replicas > 0 and return a
clear error (e.g., "module.ledger.v3.replicas must be positive, got %d") before
checking oddness, so the function handling replica validation (the block using
the replicas variable) fails fast with an explicit positive-count error and then
retains the existing oddness check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants