Skip to content

Add provisioning controller, REST API, and STS credential brokering#332

Merged
benben merged 22 commits intomainfrom
eric/provisioning-controller
Mar 25, 2026
Merged

Add provisioning controller, REST API, and STS credential brokering#332
benben merged 22 commits intomainfrom
eric/provisioning-controller

Conversation

@EDsCODE
Copy link
Copy Markdown
Contributor

@EDsCODE EDsCODE commented Mar 19, 2026

Summary

Provisioning controller — reconciles Duckling CRs via K8s dynamic client:

  • State machine: pending → provisioning → ready, deleting → deleted
  • Polls CR status for S3, Aurora, secrets, IAM readiness
  • 30-min timeout, 10-min grace period for transient Crossplane sync errors
  • CAS updates to prevent state machine races

REST API on :8080 (admin + provisioning + dashboard):

  • POST /api/v1/orgs/:id/provision — creates Duckling CR
  • POST /api/v1/orgs/:id/deprovision — tears down infrastructure
  • GET /api/v1/orgs/:id/warehouse/status — returns current status
  • Orgs auto-created on first provision call

Port layout:

  • :5432 — PostgreSQL wire protocol (unchanged)
  • :8080 — API server (admin CRUD, provisioning, dashboard) — new
  • :9090 — metrics + health (unchanged, used by Prometheus scrapes and K8s probes)

AuthX-Duckgres-Internal-Secret header (aligns with PostHog's internal service pattern). Dashboard cookie auth preserved for browser sessions.

CR spec matches current XRDspec.metadataStore.aurora.{minACU, maxACU}, no extra fields. Status reads: bucketName, auroraEndpoint, auroraPassword, Crossplane conditions.

STS AssumeRole credential brokering — workers in the shared control plane don't have per-org IAM roles. When an org uses AWS S3, the control plane calls sts:AssumeRole on arn:aws:iam::{accountId}:role/duckling-{orgID} to mint short-lived credentials and passes them to workers during activation:

  • S3SessionToken added to DuckLakeConfig + buildConfigSecret
  • Existing 5-min credential refresh loop handles rotation
  • Falls back to aws_sdk (pod identity) when STS broker not configured
  • Config: --aws-account-id / --aws-region

Depends on #340 (team → org refactor).

Companion PR

TODOs before production

  • Create secret
  • Add internal-secret property to posthog-duckgres-tokens-mw-{dev,prod-us} in AWS Secrets Manager (replacing admin-token)
  • Add DUCKGRES_INTERNAL_SECRET property to posthog-django-shared-secrets in AWS Secrets Manager (same value as above, all 3 envs)
  • Set up DUCKGRES_INTERNAL_SECRET + wire DUCKGRES_API_URL into PostHog web deployment — PostHog/charts#9291
  • Add port 8080 to duckgres Helm chart Service/Deployment — PostHog/charts#9290
  • Add sts:AssumeRole permission on duckling-* to the control plane's IAM role — PostHog/posthog-cloud-infra#7203
  • Add CP role trust to Duckling composition + set DUCKGRES_AWS_ACCOUNT_ID/DUCKGRES_AWS_REGION on duckgres deployment — PostHog/charts#9292

Test plan

  • Unit tests: fake K8s dynamic client, all state transitions, CAS updates
  • Provisioning API tests: provision, deprovision, retry after failure, auto-create org
  • go build -tags kubernetes . and go vet pass
  • Local QA: OrbStack K8s + mock Duckling CRD, full provision → ready → deprovision flow from PostHog UI

🤖 Generated with Claude Code

Rename the multi-tenant "team" concept to "org" throughout:
- DB tables: duckgres_teams → duckgres_orgs, duckgres_team_users → duckgres_org_users
- DB column: team_name → org_id
- Go types: Team → Org, TeamUser → OrgUser, TeamConfig → OrgConfig,
  TeamStack → OrgStack, TeamRouter → OrgRouter, etc.
- API routes: /teams → /orgs
- K8s labels: duckgres/team → duckgres/org
- Prometheus metrics: duckgres_team_* → duckgres_org_*
- Files: team_router.go → org_router.go, team_reserved_pool.go →
  org_reserved_pool.go, teams.html → orgs.html

Note: config store DB must be recreated (GORM AutoMigrate creates
new tables but does not rename existing ones).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds end-to-end provisioning flow: PostHog calls a REST API to initiate
provisioning, the controller drives Crossplane via Duckling CRs to create
per-team AWS resources (Aurora, S3, IAM), and updates the config store as
resources come up. The team router gates worker stack creation on warehouse
readiness and uses per-team namespace/SA from the warehouse config.

New packages:
- controlplane/provisioner: reconciliation loop + K8s dynamic client for
  Duckling CRs (pending→provisioning→ready, deleting→deleted, failure handling)
- controlplane/provisioning: production-facing REST API on separate port
  (POST /teams/:id/provision, POST /teams/:id/deprovision, GET /teams/:id/warehouse)

Key design decisions:
- Provisioning API runs on separate port (:9091) from admin (:9090)
- Separate bearer token (--provisioning-token) for provisioning vs admin
- Controller uses WarehouseStore interface for testability
- CAS (compare-and-swap) updates prevent state races
- Synced=False tolerance (5min grace period) for Crossplane transients
- Teams are auto-created on first provision call
- Non-K8s builds get stub (provisioning package has no build tag)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE force-pushed the eric/provisioning-controller branch from 8a0d5b1 to c7efd6e Compare March 23, 2026 05:10
@EDsCODE EDsCODE changed the base branch from main to eric/team-to-org-refactor March 23, 2026 05:11
Base automatically changed from eric/team-to-org-refactor to main March 23, 2026 18:16
EDsCODE and others added 5 commits March 23, 2026 11:25
Resolve conflicts from the team→org rename (#340), shared warm-worker
activation (#341), and kind CI migration (#342). Preserve provisioning
controller additions (warehouse readiness gating, provisioning API
server, provisioning states) on top of the new org naming and
warm-worker activation features.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fold the separate admin (:9090) and provisioning (:9091) HTTP servers
into a single unified API server on :8080. Both share the same Gin
engine, auth middleware, and /api/v1 route group.

Remove ProvisioningPort, ProvisioningToken config fields, CLI flags,
and env vars (DUCKGRES_PROVISIONING_TOKEN, DUCKGRES_PROVISIONING_PORT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move aurora config under spec.metadataStore.aurora to match the
Crossplane XRD definition. Remove orgID from spec (not in schema).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Authorization: Bearer with X-Duckgres-Internal-Secret header
to align with PostHog's internal service auth pattern. Rename config:
- --admin-token → --internal-secret
- DUCKGRES_ADMIN_TOKEN → DUCKGRES_INTERNAL_SECRET
- cfg.AdminToken → cfg.InternalSecret

Dashboard cookie auth preserved as fallback for browser sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE changed the title Add provisioning controller and REST API for managed warehouses Add provisioning controller, REST API, and org refactor Mar 24, 2026
The Crossplane composition was refactored: K8s workloads (Deployment,
Service, Namespace, etc.) are now managed by the duckgres Helm chart,
not Crossplane. The Duckling CR only provisions AWS infrastructure.

- Remove `image` from CR spec and provision API (no longer in XRD)
- Simplify DucklingStatus to only fields the XRD provides:
  bucketName, auroraEndpoint, auroraPassword, conditions
- Remove status fields that no longer exist: namespace, region,
  serviceAccountName, iamRoleArn, duckgres*, auroraPort, etc.
- Simplify reconcileProvisioning: track S3, Aurora, secrets, and
  IAM (via Ready condition) — no longer track warehouse_database
- Ready = infrastructure ready (S3 + Aurora + secrets + IAM)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE changed the title Add provisioning controller, REST API, and org refactor Add provisioning controller and REST API Mar 24, 2026
Workers in the shared control plane don't have per-org IAM roles via
pod identity. When an org uses AWS S3 (provider=aws), the control
plane now calls STS AssumeRole on the org's duckling IAM role to mint
short-lived credentials, and passes them to the worker during activation.

- New STSBroker wrapping aws-sdk-go-v2/service/sts
- Add S3SessionToken to DuckLakeConfig + buildConfigSecret SESSION_TOKEN
- SharedWorkerActivator uses STS broker when available, falls back to
  aws_sdk (pod identity) when not configured
- Deterministic role ARN: arn:aws:iam::{accountId}:role/duckling-{orgID}
- Config: --aws-account-id / DUCKGRES_AWS_ACCOUNT_ID,
  --aws-region / DUCKGRES_AWS_REGION
- needsCredentialRefresh returns true for STS temporary creds

Prerequisite IAM changes (separate):
- Control plane role needs sts:AssumeRole on duckling-* roles
- Duckling role trust policy needs to allow the control plane role

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE changed the title Add provisioning controller and REST API Add provisioning controller, REST API, and STS credential brokering Mar 24, 2026
EDsCODE and others added 7 commits March 24, 2026 11:19
- Fix K8s manifest port 9090→8080 to match code change (fixes CI)
- Add ProvisioningStartedAt timestamp for accurate timeout tracking
- Bump Synced=False grace period from 5→10 min (Aurora cold starts)
- Add warehouseStatusResponse DTO to avoid leaking internal config
- Allow deprovision from provisioning state (not just ready/failed)
- Validate max_acu > 0 on provision request
- Remove dead namespace override code in createOrgStack
- Fix InternalSecret alignment in config_resolution.go
- Fix fakeStore missing metadata_store_port/kind/engine handlers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dump pod status, describe, and logs on deploy-multitenant-kind failure
- Fix control-plane-multitenant-local.yaml port 9090→8080
- Add image/aurora_min_acu/aurora_max_acu columns to kind seed SQL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin and provisioning APIs share the /api/v1 router group.
Gin requires the same wildcard name on a given path segment —
:name (admin) and :id (provisioning) on /orgs/:param conflicted,
causing a panic at startup.

Standardize on :id across both APIs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e conflict

Admin API already registers GET /orgs/:id/warehouse (full warehouse
config). Provisioning status endpoint moved to /orgs/:id/warehouse/status
(lifecycle-only DTO) to avoid duplicate route panic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod was Running but never became Ready without a readiness probe,
causing kubectl wait --for=condition=available to time out.
Probe hits GET /health on the API server port (8080).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve readiness probe conflict: take main's probe timing
(2s interval, failureThreshold 15) with our port name (api).
Update local manifest probe port admin → api to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts from warm-pool observability PR (#344) which removed
SharedWarmWorkers flag. Keep AWSAccountID/AWSRegion and stsBroker
additions. Fix NewOrgReservedPool call sites in new warm_pool_metrics_test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE marked this pull request as ready for review March 24, 2026 20:46
EDsCODE and others added 5 commits March 24, 2026 13:49
The metrics server (:9090) stays running for Prometheus scraping and
health probes. The API server (:8080) serves admin API, provisioning
API, and dashboard only. Previously the API server replaced the metrics
server, killing /metrics during the switchover.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The STS broker no longer constructs role ARNs from account ID + org name.
Instead it receives the full IAM role ARN from the config store's
WorkerIdentity.IAMRoleARN field (populated from the Duckling status).

Removed:
- --aws-account-id flag / DUCKGRES_AWS_ACCOUNT_ID env var
- STSBroker.accountID field
- STSBroker.RoleARNForOrg() method
- Fallback ARN construction in shared_worker_activator

Kept:
- --aws-region / DUCKGRES_AWS_REGION (still needed for STS client)
Align DucklingStatus and parseDucklingStatus with the new Duckling XRD:
  status.metadataStore: type, endpoint, password, user, database
  status.dataStore: type, bucketName
  status.iamRoleArn

Controller now writes:
- metadata_store_username and metadata_store_database_name from status
- worker_identity_iam_role_arn from status.iamRoleArn
- metadata_store_kind from status.metadataStore.type

Create also includes dataStore.type: s3bucket in the Duckling CR spec.
@benben benben force-pushed the eric/provisioning-controller branch from 1555923 to 71b8e83 Compare March 25, 2026 14:41
@benben benben force-pushed the eric/provisioning-controller branch from 71b8e83 to fdcb66c Compare March 25, 2026 14:43
@benben benben merged commit 3733075 into main Mar 25, 2026
18 checks passed
@benben benben deleted the eric/provisioning-controller branch March 25, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants