Feature/huggingface migration#181

Open

avantikalal wants to merge 35 commits intomainfrom

feature/huggingface-migration

Collaborator

avantikalal commented Mar 4, 2026 •

edited

Loading

Summary

Addresses #180

This PR migrates the gReLU model zoo from Weights & Biases to HuggingFace as the default backend. This is a breaking change for v1.1.0.

Key Changes

New HuggingFace API in grelu.resources:
- list_models() / list_datasets() - browse the model zoo
- load_model(repo_id, filename) - load models from HuggingFace
- download_model() / download_dataset() - download files
- get_model_info() / get_dataset_info() - get metadata including file lists
- get_datasets_by_model() / get_models_by_dataset() / get_base_models() - lineage queries
Legacy wandb support preserved at grelu.resources.wandb

deprecation errors guide users from old API to new:

# Old API raises DeprecationError with migration instructions
grelu.resources.load_model(project="X", model_name="Y")
# DeprecationError: grelu.resources.load_model() API has changed.
#   - New (HuggingFace): load_model(repo_id='Genentech/X-model', filename='model.ckpt')
#   - Legacy (wandb): use grelu.resources.wandb.load_model(...)

Updated pretrained models (BorzoiPretrainedModel, EnformerPretrainedModel) to download from HuggingFace
Modified some tests to use a mock genome file instead of downloading hg38 (which takes a long time and requires a network).

Migration Guide

  # Before (v1.0)
  import grelu.resources
  model = grelu.resources.load_model(project="human-atac-catlas", model_name="model")

  # After (v1.1.0)
  import grelu.resources
  model = grelu.resources.load_model(repo_id="Genentech/human-atac-catlas-model", filename="model.ckpt")

  # Or use get_model_info() to see available files first
  grelu.resources.get_model_info("Genentech/borzoi-model")["files"]
  # ['human_rep0.ckpt', 'human_rep1.ckpt', ...]

Test Plan

All new HuggingFace resource tests pass
Deprecation error tests verify messages guiding users to the new API
All 8 tutorials run successfully with new API
wandb tests marked with @pytest.mark.wandb (skipped by default)

Links

https://huggingface.co/collections/Genentech/grelu-model-zoo

avantikalal and others added 30 commits

March 3, 2026 12:47


          Add design doc for HuggingFace model zoo migration

abadcf6

Documents the breaking API changes to migrate gReLU model zoo from
wandb to HuggingFace as the default backend. Key decisions:
- New HuggingFace-native API with full repo IDs
- Legacy wandb functions moved to grelu.resources.wandb
- Lineage functions use HF model card metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add implementation plan for HuggingFace migration

8b35d25

11 tasks covering dependency updates, code refactoring, tests,
and documentation updates for the model zoo migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: add huggingface_hub dependency

6d7b4db

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          refactor: extract shared utils to resources/utils.py

571ae03

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          refactor: move wandb functions to resources/wandb.py

e3e7c75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          test: add failing tests for HuggingFace resource functions

91c4bca

Add TDD-style tests for the new HuggingFace API before implementation.
Tests cover list_models, list_datasets, download_model, download_dataset,
load_model, get_datasets_by_model, get_base_models, and verify utility
functions continue to work. All tests use mocking since functions don't
exist yet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: implement HuggingFace API for model zoo access

38d0d36

Replace wandb-based model zoo functions with HuggingFace Hub API.
This provides a simpler, more standard interface for accessing
gReLU models and datasets hosted on HuggingFace.

New functions:
- list_models(), list_datasets(): List repos in gReLU collection
- download_model(), download_dataset(): Download files from HF
- load_model(): Download and load a LightningModel
- get_model_info(), get_dataset_info(): Get repository metadata
- get_datasets_by_model(), get_base_models(): Parse model card links
- get_models_by_dataset(): Find models using a dataset

Legacy wandb functions remain available via grelu.resources.wandb.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: update pretrained models to download from HuggingFace

fc293ce

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          test: mark wandb tests with pytest marker, skip by default

c2b9e3f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: add breaking changes notice for HuggingFace migration

17ef096

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: rewrite model zoo tutorial for HuggingFace API

b17e15d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: update tutorials for new HuggingFace API

adc7bec

Update all tutorials to use the new HuggingFace-based resource functions:
- 1_inference.ipynb: load_model with repo_id for borzoi-model
- 2_finetune.ipynb: download_dataset for tutorial-2-data
- 3_train.ipynb: download_dataset with filename for microglia-scatac data
- 4_design.ipynb: load_model with repo_id for human-atac-catlas-model
- 5_variant.ipynb: load_model and download_dataset for variant tutorial
- 7_simulations.ipynb: load_model for catlas and enformer models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add design doc for deprecation messages

8ef94c8

Helpful error messages when users try old wandb-style API,
guiding them to HuggingFace API or legacy wandb submodule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add implementation plan for deprecation messages

c502ae2

4 tasks: tests, stub functions, load_model detection, verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          test: add failing tests for deprecation error messages

76d7783

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: add deprecation stubs for removed wandb functions

6c6e9e1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: add deprecation detection to load_model() for old kwargs

ee57811

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fix: specify filename for borzoi model in tutorial 1

e2e49c7


          fix: handle strand formats in plot_tracks and fix download_dataset us…

2d9de4b

…age in tutorials

- visualize.py: Fix strand mapping that returned NaN for non-string values.
  Now handles: "+"/"-" strings, 1/-1 integers, "."/"*" (unstranded)
- tutorial 2: Use download_dataset path directly (returns file, not dir)
- tutorial 5: Specify filename parameter for variants.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: update tutorial 1 with successful execution outputs

a1733e4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: change version to v1.1.0 in README

30e79a7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: add link to HuggingFace model zoo in README

5df3d36

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fix: update HuggingFace collection ID and disable token for public ac…

0214d3f

…cess

- Update DEFAULT_HF_COLLECTION with correct collection slug
- Pass token=False to api.get_collection() to avoid 403 errors when
  user has cached credentials that don't have org access

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          feat: add files list to get_model_info and get_dataset_info

8f9dcff

Users can now see all available files in a HuggingFace repo when
calling these functions, making it easier to find the right filename
parameter for load_model() or download_dataset().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          reran tutorials

63c09d4


          chore: remove planning docs from repo

8ace7f1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          change description

c0c4632


          revert: remove visualize.py strand fix (will be separate PR)

6c7cfd7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          docs: update test file docstring

4b7c06d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          refactor: remove duplicate utility tests from test_resources_hf.py

These tests are already in test_resources.py where they belong.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

avantikalal and others added 4 commits

March 4, 2026 14:37


          test: replace mock tests with integration tests for HuggingFace API

c5e1cea

- Replace 36 mock-based tests with 12 simple integration tests
- Tests call real HuggingFace API using dedicated test repos:
  - Genentech/test-model for model download/load tests
  - Genentech/test-data for dataset download tests
  - Genentech/human-atac-catlas-model for lineage tests
- Consolidate 8 deprecation tests into 2
- Simpler, more maintainable test code (520 → 139 lines)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fix: remove wandb dependency from test_models.py

95bd8e5

Remove DEFAULT_WANDB_HOST import and wandb login code that is no longer
needed after HuggingFace migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fix: use bundled test genome instead of downloading hg38 in CI

c9633eb

Update test_lightning.py to use tests/files/test.fa instead of
downloading hg38 genome. This prevents network failures during
test collection in GitHub Actions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fix: use bundled test genome for GC matching test

- Add test_genome.fa with chr10 and chr21 (2000bp each) for GC matching
- Add genome_file property to CustomGenome class for compatibility
- Update test_get_gc_matched_intervals to use bundled test genome
- Avoids network dependency on hg38 download in CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

avantikalal requested review from MuhammedHasan, gokceneraslan, jkanche and suragnair

March 5, 2026 00:20


          fix: add weights_only parameter to wandb.load_model

852e480

Default to False to avoid PyTorch 2.6+ UnpicklingError when loading
checkpoints containing numpy arrays.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet