Skip to content

Initial version porting the eden configs over to the new evo2 recipe#1502

Open
jstjohn wants to merge 3 commits intomainfrom
jstjohn/evo2_llama_configs_and_savanna_convert
Open

Initial version porting the eden configs over to the new evo2 recipe#1502
jstjohn wants to merge 3 commits intomainfrom
jstjohn/evo2_llama_configs_and_savanna_convert

Conversation

@jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Mar 9, 2026

Description

This PR adds Eden (Llama 3.1) model support, Savanna/Vortex checkpoint converters, and a standardized model naming convention to the Megatron Bridge–based Evo2 recipe (bionemo-recipes/recipes/evo2_megatron/).

Eden (Llama 3.1) model support

  • New eden_provider.py defining EdenModelProvider and size-specific subclasses (eden_7b through eden_35b) that inherit from Llama31ModelProvider.
  • train.py now dispatches to gpt_forward_step for Eden models and automatically disables fp32_residual_connection (incompatible with standard TE LayerNormLinear layers — Hyena handles this via manual dtype casting, but GPT/Llama does not).
  • infer.py now initializes ProcessGroupCollection for non-Hyena providers (required by GPTModelProvider.provide()) and uses StaticInferenceContext instead of HyenaInferenceContext for Eden models. The flash_decode attribute is guarded to Hyena-only.
  • predict.py already worked architecture-agnostically via dynamic model loading; no changes required.

Checkpoint converters

  • savanna_to_mbridge.py — converts ARC Savanna .pt checkpoints (local or downloaded from Hugging Face via hf_hub_download) into MBridge distributed checkpoint format.
  • mbridge_to_vortex.py — exports MBridge checkpoints to ARC's single-file Vortex inference format, handling MLP weight splitting, Hyena filter pole/residue computation, and TE layernorm key remapping.
  • Both are registered as console scripts (evo2_convert_savanna_to_mbridge, evo2_export_mbridge_to_vortex).

Model naming convention

The previous model size keys (1b, 7b, 40b, 7b_arc_longcontext, …) were ambiguous — 7b referred to Striped Hyena while 7B referred to Llama. This PR replaces them with explicit, architecture-prefixed keys:

  • evo2_* for models matching public ARC checkpoints (e.g. evo2_1b_base, evo2_7b, evo2_40b_base). _base = 8K context, without it = 1M context.
  • striped_hyena_*_nv for NVIDIA-modified Hyena variants.
  • eden_* for Llama 3.1 variants.
  • Added evo2_20b config based on arcinstitute/savanna_evo2_20b.

Documentation updates

  • README.md — added model naming convention tables, Vortex export section with round-trip example, updated all CLI examples to new model keys.
  • checkpoint/README.md — updated --model-size documentation.
  • Both Jupyter notebooks (zeroshot_brca1.ipynb, fine-tuning-tutorial.ipynb) — updated MODEL_SIZE and --model-size references.

Usage

Training an Eden model:

torchrun --nproc-per-node 1 --no-python train_evo2 \
  --model-size eden_7b --num-layers 2 --max-steps 5 \
  --mock-data --seq-length 64 --mixed-precision-recipe bf16_mixed \
  --no-activation-checkpointing

Converting Savanna checkpoint to MBridge:

evo2_convert_savanna_to_mbridge \
  --savanna-ckpt-path arcinstitute/savanna_evo2_1b_base \
  --mbridge-ckpt-dir /tmp/mbridge_1b \
  --model-size evo2_1b_base \
  --tokenizer-path tokenizers/nucleotide_fast_tokenizer_256

Exporting MBridge to Vortex:

evo2_export_mbridge_to_vortex \
  --mbridge-ckpt-dir /tmp/mbridge_1b/iter_0000001 \
  --output-path /tmp/evo2_1b_vortex.pt \
  --model-size evo2_1b_base

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Signed-off-by: John St. John <jstjohn@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f788ed66-1d69-4368-887a-18890c069374

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jstjohn/evo2_llama_configs_and_savanna_convert

Comment @coderabbitai help to get the list of available commands and usage tips.

jstjohn added 2 commits March 9, 2026 20:45
Signed-off-by: John St. John <jstjohn@nvidia.com>
…tions

Signed-off-by: John St. John <jstjohn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant