Add Modal GPU CI workflow by ryan-williams · Pull Request #100 · Quantum-Accelerators/electrai

ryan-williams · 2026-03-17T13:23:19Z

Summary

Add Modal GPU infrastructure for CI, benchmarking, and training experiments.

CI (`modal/ci.py` + `.github/workflows/gpu-e2e-modal.yml`)

GPU e2e test on Modal L4, parallel to existing EC2-based gpu-e2e.yml
Faster cold start (~30s vs ~3-5min), simpler setup (no runner registration, no OIDC)
Produces identical results to EC2 L4 (val_loss=0.364269)

Training (`modal/train.py`)

Full training entrypoint for experiments on Modal GPUs (L4/A100/H100)
Uses electrai-data Volume with dataset_4 (2,885 samples, ~205 GiB)
Configurable model size, epochs, learning rate, WandB logging
Checkpoint persistence via electrai-checkpoints Volume
Replaces Lambda Labs for experiments (better GPU availability)

Benchmark (`modal/benchmark.py` + `.github/workflows/gpu-benchmark-modal.yml`)

Mirrors gpu-benchmark.yml but runs on Modal
Configurable sample count (subsample from 2,885 or use all)
modal run modal/benchmark.py --gpu A100 --samples 50 --epochs 5
Validated via push-triggered run on modal-benchmark branch (50 samples, 5 epochs, L4)
Note: gpu-benchmark-modal.yml uses workflow_dispatch only, so it won't be dispatchable until this PR merges to main

Data pipeline

modal/populate_volume.py: S3 → Modal Volume sync
Data provenance: Globus (Della) → S3 (s3://openathena/electrai/mp/chg_datasets/dataset_4/) → Modal Volume

Image construction

Dependencies read from pyproject.toml via pip_install_from_pyproject (no duplication)
retries=0 to prevent crash loops during iteration

Secrets required

MODAL_TOKEN_ID / MODAL_TOKEN_SECRET — repo secrets (set)
wandb-credentials — Modal secret with WANDB_API_KEY (set)
aws-credentials — Modal secret for populate_volume.py (set, uses SSO session token)

Test plan

modal run modal/ci.py — val_loss matches linux-gpu expected values
modal run modal/train.py --epochs 2 — trains on Volume data with WandB
GHA gpu-e2e-modal.yml triggers on PR, passes
modal/populate_volume.py — 5,771 files synced to Volume
modal run modal/benchmark.py — 50 samples, 5 epochs, L4, green
GHA gpu-benchmark-modal.yml — push-triggered iteration, green
Betsy validates training workflow as Lambda Labs replacement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- `modal/ci.py`: Modal app that runs `run_training()` on an L4 GPU, checks val_loss against expected values. Tested locally — produces identical results to EC2 L4 (val_loss=0.364269). - `.github/workflows/gpu-e2e-modal.yml`: GHA workflow that calls `modal run` from `ubuntu-latest`. Requires `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` secrets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- `modal/train.py`: Full training on Modal GPUs with `electrai-data` Volume (dataset_4, 2,885 samples). Supports GPU selection (L4/A100/H100), custom configs, WandB logging, checkpoint persistence. - `modal/populate_volume.py`: Sync S3 → Modal Volume via `boto3`. - Update spec with training docs, data provenance, secrets. Data pipeline: Globus (Della) → S3 → Modal Volume (complete). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mirrors `gpu-benchmark.yml` but runs on Modal instead of EC2: - Uses `electrai-data` Volume (2,885 samples from dataset_4) - Configurable GPU type, model size, sample count, epochs - WandB logging with same tags/config as EC2 benchmark - Subsample support for quick benchmarks vs full dataset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ming - Fix `RhoData`: `strip()` + filter blank lines in filelist parsing (was treating trailing newlines as empty sample IDs → `.CHGCAR`) - Add GHA summary step to Modal benchmark workflow (config table + results, matching EC2 benchmark format) - WandB run names include dataset/samples/timestamp for local runs, GHA run number for CI runs - Print parseable `BENCHMARK_*` output for GHA step summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams force-pushed the modal-ci branch from 9a10c6c to 7a60325 Compare March 25, 2026 03:38

ryan-williams and others added 3 commits March 25, 2026 16:02

Add Modal GPU CI spec

70ba3f1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams force-pushed the modal-ci branch from 7f24a20 to 72feff2 Compare March 25, 2026 20:07

ryan-williams force-pushed the modal-ci branch from 72feff2 to de96b49 Compare March 26, 2026 03:49

ryan-williams force-pushed the modal-ci branch from cbbdb02 to 0e4d54f Compare March 27, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Modal GPU CI workflow#100

Add Modal GPU CI workflow#100
ryan-williams wants to merge 5 commits intomainfrom
modal-ci

ryan-williams commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryan-williams commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CI (modal/ci.py + .github/workflows/gpu-e2e-modal.yml)

Training (modal/train.py)

Benchmark (modal/benchmark.py + .github/workflows/gpu-benchmark-modal.yml)

Data pipeline

Image construction

Secrets required

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryan-williams commented Mar 17, 2026 •

edited

Loading

CI (`modal/ci.py` + `.github/workflows/gpu-e2e-modal.yml`)

Training (`modal/train.py`)

Benchmark (`modal/benchmark.py` + `.github/workflows/gpu-benchmark-modal.yml`)