Skip to content

Add Modal GPU CI workflow#100

Draft
ryan-williams wants to merge 5 commits intomainfrom
modal-ci
Draft

Add Modal GPU CI workflow#100
ryan-williams wants to merge 5 commits intomainfrom
modal-ci

Conversation

@ryan-williams
Copy link
Copy Markdown
Collaborator

@ryan-williams ryan-williams commented Mar 17, 2026

Summary

Add Modal GPU infrastructure for CI, benchmarking, and training experiments.

CI (modal/ci.py + .github/workflows/gpu-e2e-modal.yml)

  • GPU e2e test on Modal L4, parallel to existing EC2-based gpu-e2e.yml
  • Faster cold start (~30s vs ~3-5min), simpler setup (no runner registration, no OIDC)
  • Produces identical results to EC2 L4 (val_loss=0.364269)

Training (modal/train.py)

  • Full training entrypoint for experiments on Modal GPUs (L4/A100/H100)
  • Uses electrai-data Volume with dataset_4 (2,885 samples, ~205 GiB)
  • Configurable model size, epochs, learning rate, WandB logging
  • Checkpoint persistence via electrai-checkpoints Volume
  • Replaces Lambda Labs for experiments (better GPU availability)

Benchmark (modal/benchmark.py + .github/workflows/gpu-benchmark-modal.yml)

  • Mirrors gpu-benchmark.yml but runs on Modal
  • Configurable sample count (subsample from 2,885 or use all)
  • modal run modal/benchmark.py --gpu A100 --samples 50 --epochs 5
  • Validated via push-triggered run on modal-benchmark branch (50 samples, 5 epochs, L4)
  • Note: gpu-benchmark-modal.yml uses workflow_dispatch only, so it won't be dispatchable until this PR merges to main

Data pipeline

  • modal/populate_volume.py: S3 → Modal Volume sync
  • Data provenance: Globus (Della) → S3 (s3://openathena/electrai/mp/chg_datasets/dataset_4/) → Modal Volume

Image construction

  • Dependencies read from pyproject.toml via pip_install_from_pyproject (no duplication)
  • retries=0 to prevent crash loops during iteration

Secrets required

  • MODAL_TOKEN_ID / MODAL_TOKEN_SECRET — repo secrets (set)
  • wandb-credentials — Modal secret with WANDB_API_KEY (set)
  • aws-credentials — Modal secret for populate_volume.py (set, uses SSO session token)

Test plan

  • modal run modal/ci.py — val_loss matches linux-gpu expected values
  • modal run modal/train.py --epochs 2 — trains on Volume data with WandB
  • GHA gpu-e2e-modal.yml triggers on PR, passes
  • modal/populate_volume.py — 5,771 files synced to Volume
  • modal run modal/benchmark.py — 50 samples, 5 epochs, L4, green
  • GHA gpu-benchmark-modal.yml — push-triggered iteration, green
  • Betsy validates training workflow as Lambda Labs replacement

ryan-williams and others added 3 commits March 25, 2026 16:02
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- `modal/ci.py`: Modal app that runs `run_training()` on an L4 GPU,
  checks val_loss against expected values. Tested locally — produces
  identical results to EC2 L4 (val_loss=0.364269).
- `.github/workflows/gpu-e2e-modal.yml`: GHA workflow that calls
  `modal run` from `ubuntu-latest`. Requires `MODAL_TOKEN_ID` and
  `MODAL_TOKEN_SECRET` secrets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- `modal/train.py`: Full training on Modal GPUs with `electrai-data`
  Volume (dataset_4, 2,885 samples). Supports GPU selection (L4/A100/H100),
  custom configs, WandB logging, checkpoint persistence.
- `modal/populate_volume.py`: Sync S3 → Modal Volume via `boto3`.
- Update spec with training docs, data provenance, secrets.

Data pipeline: Globus (Della) → S3 → Modal Volume (complete).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors `gpu-benchmark.yml` but runs on Modal instead of EC2:
- Uses `electrai-data` Volume (2,885 samples from dataset_4)
- Configurable GPU type, model size, sample count, epochs
- WandB logging with same tags/config as EC2 benchmark
- Subsample support for quick benchmarks vs full dataset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ming

- Fix `RhoData`: `strip()` + filter blank lines in filelist parsing
  (was treating trailing newlines as empty sample IDs → `.CHGCAR`)
- Add GHA summary step to Modal benchmark workflow (config table +
  results, matching EC2 benchmark format)
- WandB run names include dataset/samples/timestamp for local runs,
  GHA run number for CI runs
- Print parseable `BENCHMARK_*` output for GHA step summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant