Add provisioning controller, REST API, and STS credential brokering#332
Merged
Add provisioning controller, REST API, and STS credential brokering#332
Conversation
Rename the multi-tenant "team" concept to "org" throughout: - DB tables: duckgres_teams → duckgres_orgs, duckgres_team_users → duckgres_org_users - DB column: team_name → org_id - Go types: Team → Org, TeamUser → OrgUser, TeamConfig → OrgConfig, TeamStack → OrgStack, TeamRouter → OrgRouter, etc. - API routes: /teams → /orgs - K8s labels: duckgres/team → duckgres/org - Prometheus metrics: duckgres_team_* → duckgres_org_* - Files: team_router.go → org_router.go, team_reserved_pool.go → org_reserved_pool.go, teams.html → orgs.html Note: config store DB must be recreated (GORM AutoMigrate creates new tables but does not rename existing ones). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
Adds end-to-end provisioning flow: PostHog calls a REST API to initiate provisioning, the controller drives Crossplane via Duckling CRs to create per-team AWS resources (Aurora, S3, IAM), and updates the config store as resources come up. The team router gates worker stack creation on warehouse readiness and uses per-team namespace/SA from the warehouse config. New packages: - controlplane/provisioner: reconciliation loop + K8s dynamic client for Duckling CRs (pending→provisioning→ready, deleting→deleted, failure handling) - controlplane/provisioning: production-facing REST API on separate port (POST /teams/:id/provision, POST /teams/:id/deprovision, GET /teams/:id/warehouse) Key design decisions: - Provisioning API runs on separate port (:9091) from admin (:9090) - Separate bearer token (--provisioning-token) for provisioning vs admin - Controller uses WarehouseStore interface for testability - CAS (compare-and-swap) updates prevent state races - Synced=False tolerance (5min grace period) for Crossplane transients - Teams are auto-created on first provision call - Non-K8s builds get stub (provisioning package has no build tag) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8a0d5b1 to
c7efd6e
Compare
Resolve conflicts from the team→org rename (#340), shared warm-worker activation (#341), and kind CI migration (#342). Preserve provisioning controller additions (warehouse readiness gating, provisioning API server, provisioning states) on top of the new org naming and warm-worker activation features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fold the separate admin (:9090) and provisioning (:9091) HTTP servers into a single unified API server on :8080. Both share the same Gin engine, auth middleware, and /api/v1 route group. Remove ProvisioningPort, ProvisioningToken config fields, CLI flags, and env vars (DUCKGRES_PROVISIONING_TOKEN, DUCKGRES_PROVISIONING_PORT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move aurora config under spec.metadataStore.aurora to match the Crossplane XRD definition. Remove orgID from spec (not in schema). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Authorization: Bearer with X-Duckgres-Internal-Secret header to align with PostHog's internal service auth pattern. Rename config: - --admin-token → --internal-secret - DUCKGRES_ADMIN_TOKEN → DUCKGRES_INTERNAL_SECRET - cfg.AdminToken → cfg.InternalSecret Dashboard cookie auth preserved as fallback for browser sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Crossplane composition was refactored: K8s workloads (Deployment, Service, Namespace, etc.) are now managed by the duckgres Helm chart, not Crossplane. The Duckling CR only provisions AWS infrastructure. - Remove `image` from CR spec and provision API (no longer in XRD) - Simplify DucklingStatus to only fields the XRD provides: bucketName, auroraEndpoint, auroraPassword, conditions - Remove status fields that no longer exist: namespace, region, serviceAccountName, iamRoleArn, duckgres*, auroraPort, etc. - Simplify reconcileProvisioning: track S3, Aurora, secrets, and IAM (via Ready condition) — no longer track warehouse_database - Ready = infrastructure ready (S3 + Aurora + secrets + IAM) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Workers in the shared control plane don't have per-org IAM roles via
pod identity. When an org uses AWS S3 (provider=aws), the control
plane now calls STS AssumeRole on the org's duckling IAM role to mint
short-lived credentials, and passes them to the worker during activation.
- New STSBroker wrapping aws-sdk-go-v2/service/sts
- Add S3SessionToken to DuckLakeConfig + buildConfigSecret SESSION_TOKEN
- SharedWorkerActivator uses STS broker when available, falls back to
aws_sdk (pod identity) when not configured
- Deterministic role ARN: arn:aws:iam::{accountId}:role/duckling-{orgID}
- Config: --aws-account-id / DUCKGRES_AWS_ACCOUNT_ID,
--aws-region / DUCKGRES_AWS_REGION
- needsCredentialRefresh returns true for STS temporary creds
Prerequisite IAM changes (separate):
- Control plane role needs sts:AssumeRole on duckling-* roles
- Duckling role trust policy needs to allow the control plane role
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix K8s manifest port 9090→8080 to match code change (fixes CI) - Add ProvisioningStartedAt timestamp for accurate timeout tracking - Bump Synced=False grace period from 5→10 min (Aurora cold starts) - Add warehouseStatusResponse DTO to avoid leaking internal config - Allow deprovision from provisioning state (not just ready/failed) - Validate max_acu > 0 on provision request - Remove dead namespace override code in createOrgStack - Fix InternalSecret alignment in config_resolution.go - Fix fakeStore missing metadata_store_port/kind/engine handlers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dump pod status, describe, and logs on deploy-multitenant-kind failure - Fix control-plane-multitenant-local.yaml port 9090→8080 - Add image/aurora_min_acu/aurora_max_acu columns to kind seed SQL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin and provisioning APIs share the /api/v1 router group. Gin requires the same wildcard name on a given path segment — :name (admin) and :id (provisioning) on /orgs/:param conflicted, causing a panic at startup. Standardize on :id across both APIs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e conflict Admin API already registers GET /orgs/:id/warehouse (full warehouse config). Provisioning status endpoint moved to /orgs/:id/warehouse/status (lifecycle-only DTO) to avoid duplicate route panic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod was Running but never became Ready without a readiness probe, causing kubectl wait --for=condition=available to time out. Probe hits GET /health on the API server port (8080). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve readiness probe conflict: take main's probe timing (2s interval, failureThreshold 15) with our port name (api). Update local manifest probe port admin → api to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts from warm-pool observability PR (#344) which removed SharedWarmWorkers flag. Keep AWSAccountID/AWSRegion and stsBroker additions. Fix NewOrgReservedPool call sites in new warm_pool_metrics_test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metrics server (:9090) stays running for Prometheus scraping and health probes. The API server (:8080) serves admin API, provisioning API, and dashboard only. Previously the API server replaced the metrics server, killing /metrics during the switchover. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The STS broker no longer constructs role ARNs from account ID + org name. Instead it receives the full IAM role ARN from the config store's WorkerIdentity.IAMRoleARN field (populated from the Duckling status). Removed: - --aws-account-id flag / DUCKGRES_AWS_ACCOUNT_ID env var - STSBroker.accountID field - STSBroker.RoleARNForOrg() method - Fallback ARN construction in shared_worker_activator Kept: - --aws-region / DUCKGRES_AWS_REGION (still needed for STS client)
Align DucklingStatus and parseDucklingStatus with the new Duckling XRD: status.metadataStore: type, endpoint, password, user, database status.dataStore: type, bucketName status.iamRoleArn Controller now writes: - metadata_store_username and metadata_store_database_name from status - worker_identity_iam_role_arn from status.iamRoleArn - metadata_store_kind from status.metadataStore.type Create also includes dataStore.type: s3bucket in the Duckling CR spec.
1555923 to
71b8e83
Compare
71b8e83 to
fdcb66c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Provisioning controller — reconciles Duckling CRs via K8s dynamic client:
REST API on
:8080(admin + provisioning + dashboard):POST /api/v1/orgs/:id/provision— creates Duckling CRPOST /api/v1/orgs/:id/deprovision— tears down infrastructureGET /api/v1/orgs/:id/warehouse/status— returns current statusPort layout:
:5432— PostgreSQL wire protocol (unchanged):8080— API server (admin CRUD, provisioning, dashboard) — new:9090— metrics + health (unchanged, used by Prometheus scrapes and K8s probes)Auth —
X-Duckgres-Internal-Secretheader (aligns with PostHog's internal service pattern). Dashboard cookie auth preserved for browser sessions.CR spec matches current XRD —
spec.metadataStore.aurora.{minACU, maxACU}, no extra fields. Status reads:bucketName,auroraEndpoint,auroraPassword, Crossplane conditions.STS AssumeRole credential brokering — workers in the shared control plane don't have per-org IAM roles. When an org uses AWS S3, the control plane calls
sts:AssumeRoleonarn:aws:iam::{accountId}:role/duckling-{orgID}to mint short-lived credentials and passes them to workers during activation:S3SessionTokenadded toDuckLakeConfig+buildConfigSecretaws_sdk(pod identity) when STS broker not configured--aws-account-id/--aws-regionDepends on #340 (team → org refactor).
Companion PR
TODOs before production
internal-secretproperty toposthog-duckgres-tokens-mw-{dev,prod-us}in AWS Secrets Manager (replacingadmin-token)DUCKGRES_INTERNAL_SECRETproperty toposthog-django-shared-secretsin AWS Secrets Manager (same value as above, all 3 envs)DUCKGRES_INTERNAL_SECRET+ wireDUCKGRES_API_URLinto PostHog web deployment — PostHog/charts#9291sts:AssumeRolepermission onduckling-*to the control plane's IAM role — PostHog/posthog-cloud-infra#7203DUCKGRES_AWS_ACCOUNT_ID/DUCKGRES_AWS_REGIONon duckgres deployment — PostHog/charts#9292Test plan
go build -tags kubernetes .andgo vetpass🤖 Generated with Claude Code