Skip to content

aryateja2106/nested-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

33 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Nested Learning: Implementation from Scratch

Paper PDF License Python Docker LeCoder cGPU

Less Code, More Reproduction โ€” a LeCoder project
Built to learn Google Research's Nested Learning paper end-to-end and invite othersโ€”researchers, developers, product folks, and the simply curiousโ€”to explore, fork, and improve together.

Paper & Blog:


๐Ÿ› ๏ธ Built with LeCoder cGPU

This project was developed and tested using LeCoder cGPU CLI (GitHub)โ€”a production-grade command-line tool for seamless Google Colab GPU access.

Why LeCoder cGPU?
While building this implementation, we needed a robust way to:

  • Run experiments on A100 GPUs without leaving the terminal
  • Manage multiple Colab sessions for parallel experiments
  • Automate workflows with structured JSON output
  • Integrate GPU training into our development workflow

What we built:
LeCoder cGPU provides enterprise-grade features including:

  • ๐Ÿ” Secure OAuth2 authentication
  • ๐Ÿ““ Notebook management via Drive API
  • ๐Ÿš€ Remote code execution with kernel mode
  • ๐Ÿ“Š Execution history and monitoring
  • ๐Ÿ”„ Multi-session support (Colab Pro)
  • ๐Ÿ“ File transfer and synchronization
  • ๐Ÿค– AI agent integration (JSON output)

See it in action:
Check out our Enterprise Experiment Guide to see how we used LeCoder cGPU to run A100-accelerated training experiments with custom CUDA kernels.

Try it yourself:

# Install from npm (published package)
npm install -g lecoder-cgpu

# Authenticate
lecoder-cgpu auth

# Run experiment
./run_lecoder_experiment.sh full

๐ŸŽฏ What is Nested Learning?

In plain English: Nested Learning is a new way of thinking about deep learning models. Instead of viewing neural networks as fixed architectures, it treats them as nested optimization problems where different parts update at different speedsโ€”like having fast, short-term memory and slow, long-term memory working together.

For researchers: Nested Learning (NL) views models as nested, multi-level optimization problems, each with its own "context flow" and update frequency. Key insights:

  • Optimizers as associative memories: Adam, SGD with momentum compress gradients into memory.
  • Uniform architecture: Feedforward networks with different update clocks.
  • Pre-training as in-context learning over long contexts.
  • Continuum Memory System (CMS): Spectrum of fast/slow memories for long-/short-term storage.

๐Ÿš€ Quick Start

Option 1: Docker (Easiest)

# Clone the repository
git clone https://github.com/aryateja2106/nested-learning.git
cd nested-learning

# Run the interactive demo
docker compose up
# Opens at http://localhost:7860

Option 2: UV (Recommended for Development)

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Setup environment
uv venv .venv && source .venv/bin/activate
UV_PYTHON=.venv/bin/python uv pip install --python .venv/bin/python -r requirements.txt

# Run tests
uv run pytest tests/test_components.py

# Launch demo
uv run python demo/app.py

# Train a small model
uv run python train_hope.py --config small --steps 500 --batch-size 8

Option 3: Standard pip

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python demo/app.py

๐Ÿ““ Try the Notebook

Open notebooks/quickstart.ipynb in Jupyter or upload to Google Colab. It runs a quick sanity check in under 2 minutes (works on CPU, faster on GPU).


๐Ÿ—๏ธ Architecture Overview (HOPE)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Hope Architecture                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚           Self-Modifying Titans                      โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  M_key   โ”‚  โ”‚  M_value โ”‚  โ”‚ M_memory โ”‚           โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ (adapt)  โ”‚  โ”‚ (adapt)  โ”‚  โ”‚  (adapt) โ”‚           โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚   โ”‚
โ”‚  โ”‚         โ†“ Delta Gradient Descent (DGD)               โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                           โ†“                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚         Continuum Memory System (CMS)                โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ MLP^f1  โ”‚โ†’ โ”‚ MLP^f2  โ”‚โ†’ โ”‚ MLP^f3  โ”‚โ†’ โ”‚ MLP^fk  โ”‚ โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ (high)  โ”‚  โ”‚  (mid)  โ”‚  โ”‚  (low)  โ”‚  โ”‚ (lowest)โ”‚ โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Hope combines Self-Modifying Titans (adaptive memory that learns how to learn) with Continuum Memory System (multi-frequency memory blocks).


๐Ÿ“š Key Components

  • Delta Gradient Descent (DGD): Updates weights with adaptive decay tied to current input.
    (W_{t+1} = W_t (I - ฮท'_t x_t x_t^T) - ฮท'_t โˆ‡_y L(W_t; x_t) โŠ— x_t)
  • Continuum Memory System (CMS): Spectrum of MLP blocks with different update frequencies (fast โ†” slow).
  • Multi-scale Momentum Muon (M3): Fast + slow momentum with Newton-Schulz orthogonalization.
  • Self-Modifying Titans: Generates and updates its own memory valuesโ€”meta-learning at inference time.

๐Ÿ”ฌ For Researchers & Developers

Training Presets

# Quick test (runs in minutes)
uv run python train_hope.py --config small  --steps 500  --optimizer adamw

# Balanced run
uv run python train_hope.py --config medium --steps 2000 --optimizer m3

# Full training (GPU recommended)
uv run python train_hope.py --config large  --steps 5000 --optimizer dgd

Using Components

from src.core.optimizers import DeltaGradientDescent, M3Optimizer
from src.core.memory import ContinuumMemorySystem
from src.models.hope import Hope, HopeConfig

# Create model
config = HopeConfig(
    d_model=512,
    num_layers=6,
    cms_num_levels=4,
    cms_base_chunk_size=16
)
model = Hope(config)

# Use novel optimizers
optimizer = M3Optimizer(model.parameters(), lr=1e-4)

โ˜๏ธ Colab GPU via LeCoder cGPU CLI

Production-grade CLI for Colab GPU accessโ€”built alongside this project to enable seamless GPU-accelerated development.

Quick Start

# Install LeCoder cGPU CLI from source
cd lecoder-cgpu && npm install && npm link && cd ..

# Authenticate
lecoder-cgpu auth

# Run enterprise experiment (A100 optimized)
./run_lecoder_experiment.sh train a100 1000

Complete Workflow Example

# 1. Check authentication and GPU availability
lecoder-cgpu status --json

# 2. Create experiment notebook
lecoder-cgpu notebook create "HOPE-Experiment" --template gpu

# 3. Upload project files
lecoder-cgpu copy ./src /content/nested-learning/src

# 4. Run training with structured output
lecoder-cgpu run --json --mode kernel "
import torch
from src.models.hope import Hope, HopeConfig
# ... your code ...
"

# 5. Monitor execution history
lecoder-cgpu logs --stats

# 6. Check GPU utilization
lecoder-cgpu status --json

Enterprise Experiment Suite

Run the complete enterprise continual learning pipeline:

# Quick GPU test and benchmark
./run_lecoder_experiment.sh quick

# Full training experiment (A100 optimized)
./run_lecoder_experiment.sh train a100 1000

# CUDA performance benchmark
./run_lecoder_experiment.sh benchmark

# Complete workflow showcase
./run_lecoder_experiment.sh full

Multi-Session Management (Colab Pro)

# List active sessions
lecoder-cgpu sessions list --stats

# Run parallel experiments
lecoder-cgpu --session <id1> run "python exp1.py"
lecoder-cgpu --session <id2> run "python exp2.py"

# Switch between sessions
lecoder-cgpu sessions switch <session-id>

JSON Output for Automation

# Get structured results for AI agents
lecoder-cgpu run --json --mode kernel "your_code_here"

# Query execution history
lecoder-cgpu logs --stats --json

# Monitor runtime status
lecoder-cgpu status --json

๐Ÿ“š Full Documentation: See LeCoder cGPU Integration Guide for complete workflow, benchmarks, and best practices.

Tested on: A100 (Colab Pro+), T4 (Colab Pro), L4 (Free tier). Small configs run on CPU.


๐Ÿ’ผ Enterprise Use Case: Continual Learning Pipeline

Real-world Application: Customer Intelligence System

This implementation includes a complete enterprise use case demonstrating how HOPE enables continual learning for business applications.

Business Problem

Traditional ML models suffer from catastrophic forgettingโ€”when learning new patterns, they forget previous knowledge. This is critical for:

  • Customer support systems that need to remember previous interactions
  • Market analysis tools that adapt to changing conditions
  • Enterprise AI that learns continuously without expensive retraining

Our Solution

The Enterprise Continual Learning Pipeline (src/experiments/enterprise_pipeline.py) demonstrates:

  1. Long-term Memory: CMS maintains customer pattern memory across different update frequencies
  2. Real-time Adaptation: Self-Modifying Titans adapt to new feedback patterns instantly
  3. No Catastrophic Forgetting: DGD optimizer prevents knowledge loss when learning new segments

Performance Benchmarks (A100)

Metric CPU A100 Speedup
Training throughput ~50 tokens/s ~5,000 tokens/s 100x
Memory update latency ~10ms ~0.1ms 100x
Full training (1000 steps) ~2 hours ~2 minutes 60x

Run Enterprise Experiment

# A100-optimized training with CUDA acceleration
python -m src.experiments.enterprise_pipeline --config a100 --steps 1000

# Or via LeCoder cGPU CLI
./run_lecoder_experiment.sh train a100 1000

See: Enterprise Experiment Guide for complete documentation.


๐Ÿ“‚ Project Structure

src/
  core/              # optimizers (DGD, M3), CMS
  models/            # Titans, Hope architecture
  experiments/       # enterprise pipeline, CUDA kernels
train_hope.py        # training entrypoint with presets (AMP on)
demo/app.py          # Gradio interactive demo
tests/               # unit tests
notebooks/           # quickstart notebook
docs/
  ALGORITHMS.md      # algorithm notes
  LECODER_CGPU_GUIDE.md  # LeCoder cGPU integration guide
run_lecoder_experiment.sh  # enterprise experiment runner
requirements.txt

๐ŸŽ Bonus: The Skill That Built This

This implementation was created using the LeCoder Paper-to-Code Skillโ€”a methodology for AI agents to systematically implement research papers from scratch.

Want to use it? The skill is available in .claude/skills/paper-to-code/. Download it as a ZIP and upload to Claude or other AI agents to implement your own papers.

What's included:

  • Complete workflow: PDF โ†’ Markdown โ†’ Algorithm โ†’ Code โ†’ Test
  • Deep-dive guides on paper analysis and implementation patterns
  • Best practices for packaging and testing

๐Ÿค Contributing

  • Fork, open issues/PRs, or share logs/resultsโ€”all backgrounds welcome.
  • Keep PRs small and include pytest output when touching code paths.
  • Curious how this was built? The skill that created this is includedโ€”check .claude/skills/paper-to-code/.

๐Ÿ™ Acknowledgments

  • Research: "Nested Learning: The Illusion of Deep Learning Architecture" (Behrouz, Razaviyayn, Zhong, Mirrokni).
  • Blog: Google Research introduction (link above).
  • Tools:
    • LeCoder cGPU CLI - Production-grade CLI for Colab GPU access (built alongside this project)
    • cgpu - Original inspiration for Colab-from-terminal access
  • Inspiration: Open-source efforts that make cutting-edge research runnable and teachable.

๐Ÿ”’ Security Note

Past commits contained credentials that are now removed; rotate/regenerate any exposed keys. .gitignore excludes common secret patternsโ€”please keep secrets out of the repo.


๐Ÿ“œ Citation

@inproceedings{behrouz2025nested,
  title={Nested Learning: The Illusion of Deep Learning Architecture},
  author={Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peilin and Mirrokni, Vahab},
  booktitle={NeurIPS},
  year={2025}
}

โญ If this helped you, please star the repo! It helps others discover this implementation and encourages more open-source research reproductions.

About

Open source implementation of google's nested learning paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors