Skip to content

Claude/kubernetes migration plan kq jw d#359

Merged
t0mdavid-m merged 18 commits intomainfrom
claude/kubernetes-migration-plan-KQJwD
Apr 4, 2026
Merged

Claude/kubernetes migration plan kq jw d#359
t0mdavid-m merged 18 commits intomainfrom
claude/kubernetes-migration-plan-KQJwD

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented Apr 4, 2026

Summary by CodeRabbit

  • Chores
    • Updated Docker build process to use full Dockerfile configuration
    • Enhanced Kubernetes CI/CD pipeline to properly handle Traefik ingress routing
    • Refactored configuration management to use settings overrides
    • Updated deployment image pull policies and startup procedures
    • Added Traefik-based routing configuration for improved traffic management

claude and others added 18 commits March 5, 2026 19:58
Decompose the monolithic Docker container into Kubernetes workloads:
- Streamlit Deployment with health probes and session affinity
- Redis Deployment + Service for job queue
- RQ Worker Deployment for background workflows
- CronJob for workspace cleanup
- Ingress with WebSocket support and cookie-based sticky sessions
- Shared PVC (ReadWriteMany) for workspace data
- ConfigMap for runtime configuration (replaces build-time settings)
- Kustomize base + template-app overlay for multi-app deployment

Code changes:
- Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml
- Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py

CI/CD:
- Add build-and-push-image.yml to push Docker images to ghcr.io
- Add k8s-manifests-ci.yml for manifest validation and kind integration tests

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
kustomization.yaml is a Kustomize config file, not a standard K8s resource,
so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The integration-test job now uses a matrix with Dockerfile_simple and
Dockerfile. Each matrix entry checks if its Dockerfile exists before
running — all steps are guarded with an `if` condition so they skip
gracefully when a Dockerfile is absent. This allows downstream forks
that only have one Dockerfile to pass CI without errors.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with
  cinder-csi storage class (required by de.NBI KKP cluster)
- Increase PVC storage to 500Gi
- Add namespace: openms to kustomization.yaml
- Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU)
  so all workspace-mounting pods fit on a single node

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which
requires all pods mounting it to run on the same node. Without explicit
affinity rules, the scheduler was failing silently, leaving pods in
Pending state with no events.

Adds a `volume-group: workspaces` label and podAffinity with
requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment,
rq-worker deployment, and cleanup cronjob. This ensures the scheduler
explicitly co-locates all workspace-consuming pods on the same node.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The controller pod being Ready doesn't guarantee the admission webhook
service is accepting connections. Add a polling loop that waits for the
webhook endpoint to have an IP assigned before applying the Ingress
resource, preventing "connection refused" errors during kustomize apply.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The kustomize overlay deploys into the openms namespace, but the
verification steps (Redis wait, Redis ping, deployment checks) were
querying the default namespace, causing "no matching resources found".

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Replace the unreliable endpoint-IP polling with a retry loop on
kubectl apply (up to 5 attempts with backoff). This handles the race
where the ingress-nginx admission webhook has an endpoint IP but isn't
yet accepting TCP connections.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Kustomize namePrefix renames the Redis service to template-app-redis,
but the REDIS_URL env var in streamlit and rq-worker deployments still
referenced the unprefixed name "redis", causing the rq-worker to
CrashLoopBackOff with "Name or service not known".

Add JSON patches in the overlay to set the correct prefixed hostname.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The cluster uses Traefik, not nginx, so the nginx Ingress annotations
are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all
routing and sticky session cookie for Streamlit session affinity.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
…ests

kubeconform doesn't know the Traefik IngressRoute CRD schema, and the
kind cluster in integration tests doesn't have Traefik installed. Skip
the IngressRoute in kubeconform validation and filter it out with yq
before applying to the kind cluster.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Kustomize namePrefix doesn't rewrite service references inside CRDs,
so the IngressRoute was pointing to 'streamlit' instead of
'template-app-streamlit', causing Traefik to return 404.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The ConfigMap was replacing the entire settings.json, losing keys like
"version" and "repository-name" that the app expects (causing KeyError).
Now the ConfigMap only contains deployment-specific overrides, which are
merged into the Docker image's base settings.json at container startup
using jq.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Addresses CodeRabbit review: if jq merge fails, the container should
not start with unmerged settings.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
With IfNotPresent, rollout restarts reuse the cached image even when a
new version has been pushed with the same tag. Always ensures Kubernetes
pulls the latest image on every pod start.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Switch CI to build the full Docker image with OpenMS and TOPP tools,
not the lightweight pyOpenMS-only image.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
@t0mdavid-m t0mdavid-m merged commit b185cf0 into main Apr 4, 2026
9 of 10 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 077d7b7d-6d1a-4c9a-b5b7-c8e6598f6a25

📥 Commits

Reviewing files that changed from the base of the PR and between 93170fa and 6d42f0f.

📒 Files selected for processing (8)
  • .github/workflows/build-and-push-image.yml
  • .github/workflows/k8s-manifests-ci.yml
  • k8s/base/configmap.yaml
  • k8s/base/kustomization.yaml
  • k8s/base/rq-worker-deployment.yaml
  • k8s/base/streamlit-deployment.yaml
  • k8s/base/traefik-ingressroute.yaml
  • k8s/overlays/template-app/kustomization.yaml

📝 Walkthrough

Walkthrough

This PR updates the deployment infrastructure by switching to a full Dockerfile, introducing Traefik-based routing via an IngressRoute resource, refactoring the ConfigMap to use a settings-overrides pattern, and implementing runtime settings merging in pod startup scripts. CI workflows are updated to exclude Traefik resources from validation.

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
.github/workflows/build-and-push-image.yml, .github/workflows/k8s-manifests-ci.yml
Switched Docker build from Dockerfile_simple to Dockerfile; updated CI validation and deployment steps to exclude Traefik IngressRoute resources using kubeconform ignoring rules and yq filtering.
Kubernetes Configuration
k8s/base/configmap.yaml, k8s/base/kustomization.yaml
Renamed ConfigMap data key from settings.json to settings-overrides.json with simplified content (online_deployment: true only); added traefik-ingressroute.yaml to Kustomize resources.
Kubernetes Deployments
k8s/base/rq-worker-deployment.yaml, k8s/base/streamlit-deployment.yaml
Updated imagePullPolicy to Always; added startup script logic to merge /app/settings.json and /app/settings-overrides.json using jq; changed config mount to point to settings-overrides.json instead of settings.json.
Kubernetes Routing
k8s/base/traefik-ingressroute.yaml
Added new Traefik IngressRoute resource routing requests matching PathPrefix(\/`)to thestreamlit` service (port 8501) with sticky session support via cookie-based routing.
Kubernetes Overlay Patches
k8s/overlays/template-app/kustomization.yaml
Extended patch rules to configure Redis URI for streamlit and rq-worker Deployments; added patch to update IngressRoute service name for the overlay environment.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Traefik
    participant Service as Streamlit Service
    participant Pod as Streamlit Pod
    
    Client->>Traefik: HTTP Request (PathPrefix: /)
    Traefik->>Service: Route via IngressRoute<br/>(sticky session cookie)
    Service->>Pod: Forward to Pod
    Pod->>Pod: Startup: Merge settings<br/>(base + overrides)
    Pod->>Pod: Launch Streamlit App
    Pod->>Service: Ready to serve
    Service->>Traefik: Response
    Traefik->>Client: HTTP Response
Loading
sequenceDiagram
    participant Pod as Pod Startup
    participant ConfigMap
    participant Filesystem
    
    Pod->>ConfigMap: Read settings-overrides.json
    Pod->>Filesystem: Read /app/settings.json<br/>(base settings)
    Pod->>Pod: Merge using jq<br/>(base + overrides)
    Pod->>Filesystem: Write merged result<br/>to /app/settings.json
    Pod->>Pod: Launch application<br/>with merged settings
Loading

Possibly Related PRs

  • PR #220: Replaced/removed Dockerfile_simple and corresponding environment configurations; this PR completes the migration by switching the build pipeline to use the full Dockerfile.
  • PR #358: Contains identical changes to Kubernetes manifests (ConfigMap key rename, Traefik IngressRoute addition, deployment settings merge logic) and CI workflow Traefik exclusions.
  • PR #347: Introduced the Kubernetes and workflow changes that this PR modifies (Dockerfile selection, Traefik resource handling, and ConfigMap refactoring approach).

Poem

🐰 A Traefik path now routes the way,
Settings merge at startup's display,
Overrides dance with base so fine,
Docker's full strength, no more benign,
The warren's config now aligned! 🌿

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/kubernetes-migration-plan-KQJwD

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants