Skip to content

Feature/huggingface migration#181

Open
avantikalal wants to merge 35 commits intomainfrom
feature/huggingface-migration
Open

Feature/huggingface migration#181
avantikalal wants to merge 35 commits intomainfrom
feature/huggingface-migration

Conversation

@avantikalal
Copy link
Collaborator

@avantikalal avantikalal commented Mar 4, 2026

Summary

Addresses #180

This PR migrates the gReLU model zoo from Weights & Biases to HuggingFace as the default backend. This is a breaking change for v1.1.0.

Key Changes

  • New HuggingFace API in grelu.resources:

    • list_models() / list_datasets() - browse the model zoo
    • load_model(repo_id, filename) - load models from HuggingFace
    • download_model() / download_dataset() - download files
    • get_model_info() / get_dataset_info() - get metadata including file lists
    • get_datasets_by_model() / get_models_by_dataset() / get_base_models() - lineage queries
  • Legacy wandb support preserved at grelu.resources.wandb

  • deprecation errors guide users from old API to new:

    # Old API raises DeprecationError with migration instructions
    grelu.resources.load_model(project="X", model_name="Y")
    # DeprecationError: grelu.resources.load_model() API has changed.
    #   - New (HuggingFace): load_model(repo_id='Genentech/X-model', filename='model.ckpt')
    #   - Legacy (wandb): use grelu.resources.wandb.load_model(...)
  • Updated pretrained models (BorzoiPretrainedModel, EnformerPretrainedModel) to download from HuggingFace

  • Modified some tests to use a mock genome file instead of downloading hg38 (which takes a long time and requires a network).

Migration Guide

  # Before (v1.0)
  import grelu.resources
  model = grelu.resources.load_model(project="human-atac-catlas", model_name="model")

  # After (v1.1.0)
  import grelu.resources
  model = grelu.resources.load_model(repo_id="Genentech/human-atac-catlas-model", filename="model.ckpt")

  # Or use get_model_info() to see available files first
  grelu.resources.get_model_info("Genentech/borzoi-model")["files"]
  # ['human_rep0.ckpt', 'human_rep1.ckpt', ...]

Test Plan

  • All new HuggingFace resource tests pass
  • Deprecation error tests verify messages guiding users to the new API
  • All 8 tutorials run successfully with new API
  • wandb tests marked with @pytest.mark.wandb (skipped by default)

Links

avantikalal and others added 30 commits March 3, 2026 12:47
Documents the breaking API changes to migrate gReLU model zoo from
wandb to HuggingFace as the default backend. Key decisions:
- New HuggingFace-native API with full repo IDs
- Legacy wandb functions moved to grelu.resources.wandb
- Lineage functions use HF model card metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 tasks covering dependency updates, code refactoring, tests,
and documentation updates for the model zoo migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TDD-style tests for the new HuggingFace API before implementation.
Tests cover list_models, list_datasets, download_model, download_dataset,
load_model, get_datasets_by_model, get_base_models, and verify utility
functions continue to work. All tests use mocking since functions don't
exist yet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace wandb-based model zoo functions with HuggingFace Hub API.
This provides a simpler, more standard interface for accessing
gReLU models and datasets hosted on HuggingFace.

New functions:
- list_models(), list_datasets(): List repos in gReLU collection
- download_model(), download_dataset(): Download files from HF
- load_model(): Download and load a LightningModel
- get_model_info(), get_dataset_info(): Get repository metadata
- get_datasets_by_model(), get_base_models(): Parse model card links
- get_models_by_dataset(): Find models using a dataset

Legacy wandb functions remain available via grelu.resources.wandb.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all tutorials to use the new HuggingFace-based resource functions:
- 1_inference.ipynb: load_model with repo_id for borzoi-model
- 2_finetune.ipynb: download_dataset for tutorial-2-data
- 3_train.ipynb: download_dataset with filename for microglia-scatac data
- 4_design.ipynb: load_model with repo_id for human-atac-catlas-model
- 5_variant.ipynb: load_model and download_dataset for variant tutorial
- 7_simulations.ipynb: load_model for catlas and enformer models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Helpful error messages when users try old wandb-style API,
guiding them to HuggingFace API or legacy wandb submodule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks: tests, stub functions, load_model detection, verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…age in tutorials

- visualize.py: Fix strand mapping that returned NaN for non-string values.
  Now handles: "+"/"-" strings, 1/-1 integers, "."/"*" (unstranded)
- tutorial 2: Use download_dataset path directly (returns file, not dir)
- tutorial 5: Specify filename parameter for variants.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cess

- Update DEFAULT_HF_COLLECTION with correct collection slug
- Pass token=False to api.get_collection() to avoid 403 errors when
  user has cached credentials that don't have org access

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now see all available files in a HuggingFace repo when
calling these functions, making it easier to find the right filename
parameter for load_model() or download_dataset().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These tests are already in test_resources.py where they belong.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
avantikalal and others added 4 commits March 4, 2026 14:37
- Replace 36 mock-based tests with 12 simple integration tests
- Tests call real HuggingFace API using dedicated test repos:
  - Genentech/test-model for model download/load tests
  - Genentech/test-data for dataset download tests
  - Genentech/human-atac-catlas-model for lineage tests
- Consolidate 8 deprecation tests into 2
- Simpler, more maintainable test code (520 → 139 lines)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove DEFAULT_WANDB_HOST import and wandb login code that is no longer
needed after HuggingFace migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update test_lightning.py to use tests/files/test.fa instead of
downloading hg38 genome. This prevents network failures during
test collection in GitHub Actions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test_genome.fa with chr10 and chr21 (2000bp each) for GC matching
- Add genome_file property to CustomGenome class for compatibility
- Update test_get_gc_matched_intervals to use bundled test genome
- Avoids network dependency on hg38 download in CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default to False to avoid PyTorch 2.6+ UnpicklingError when loading
checkpoints containing numpy arrays.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant