Skip to content

JMonde/distributed-gpu-training

Repository files navigation

Distributed GPU Training System

A comprehensive distributed neural network training system that scales across multiple GPUs and machines.

Features

  • Multiple Parallelism Strategies: Data Parallel, Model Parallel, and Pipeline Parallel training
  • Ring AllReduce: Efficient gradient synchronization using the Ring AllReduce algorithm
  • Fault Tolerance: Automatic checkpointing and recovery from failures
  • Real-time Monitoring: Live dashboard with GPU utilization, training metrics, and cluster health
  • Elastic Scaling: Add or remove nodes dynamically
  • Mock Implementation: Full mock database and simulated training for development

Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • npm or yarn

Installation

# Install Python dependencies
pip install -r requirements.txt

# Install Node.js dependencies
cd dashboard
npm install

Running with Mock Database

# Set mock database environment variable
export MOCK_DB=true

# Start the API server
python -m uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

# In another terminal, start the dashboard
cd dashboard
npm run dev

Access the Dashboard

Open your browser to http://localhost:3000

API Endpoints

Training Jobs

Method Endpoint Description
POST /api/v1/training/jobs Create new training job
GET /api/v1/training/jobs List all training jobs
GET /api/v1/training/jobs/:id Get job details
PUT /api/v1/training/jobs/:id/pause Pause running job
PUT /api/v1/training/jobs/:id/resume Resume paused job
DELETE /api/v1/training/jobs/:id Cancel training job

Cluster Management

Method Endpoint Description
GET /api/v1/cluster/nodes List cluster nodes
POST /api/v1/cluster/nodes Add worker node
DELETE /api/v1/cluster/nodes/:id Remove worker node
GET /api/v1/cluster/gpu-status Get all GPU statuses

Metrics & Monitoring

Method Endpoint Description
GET /api/v1/metrics/jobs/:id Get job metrics
GET /api/v1/metrics/cluster/summary Cluster-wide metrics
WS /api/v1/metrics/stream/:jobId Real-time metrics stream

Checkpoints

Method Endpoint Description
GET /api/v1/checkpoints/jobs/:id List job checkpoints
DELETE /api/v1/checkpoints/:id Delete checkpoint
POST /api/v1/checkpoints/:id/restore Restore from checkpoint

Experiments

Method Endpoint Description
POST /api/v1/experiments Create benchmark experiment
GET /api/v1/experiments List experiments
POST /api/v1/experiments/:id/analyze Run analysis on results

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         User Interface                          │
│                    (CLI / Web Dashboard / API)                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Training Coordinator                       │
│              (Central Training Manager Service)                 │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│   Worker Node 1  │ │   Worker Node 2  │ │   Worker Node N  │
│   (GPU x4)       │ │   (GPU x4)       │ │   (GPU xM)       │
└──────────────────┘ └──────────────────┘ └──────────────────┘

Parallelism Strategies

Data Parallelism

Each GPU gets a complete copy of the model and processes a different subset of the batch. Gradients are averaged via AllReduce.

GPU0 → model copy → batch subset 0 → gradients 0
GPU1 → model copy → batch subset 1 → gradients 1
GPU2 → model copy → batch subset 2 → gradients 2
GPU3 → model copy → batch subset 3 → gradients 3
                ↓
         AllReduce(gradients)

Model Parallelism

The model is split across multiple GPUs, with each GPU holding a portion of the layers.

GPU0 → layers 0-3
GPU1 → layers 4-7
GPU2 → layers 8-11
GPU3 → layers 12-15

Pipeline Parallelism

Different GPUs process different pipeline stages with the 1F1B (One-Forward-One-Backward) schedule.

GPU0 → Stage 0 (layers 0-3)
GPU1 → Stage 1 (layers 4-7)
GPU2 → Stage 2 (layers 8-11)
GPU3 → Stage 3 (layers 12-15)

Testing

Run API Tests

pytest tests/api/ -v --cov=api --cov-report=html

Run Distributed Training Tests

pytest tests/distributed/ -v

Run Frontend Tests

cd dashboard
npm run test

Project Structure

distributed-gpu-training/
├── api/                    # FastAPI application
│   ├── routes/            # API route handlers
│   ├── middleware/        # Authentication middleware
│   └── websocket/         # WebSocket handlers
├── cluster/               # Cluster management
├── distributed/           # Distributed training core
├── parallelism/           # Parallelism implementations
├── training/              # Training loop and checkpointing
├── monitoring/            # Metrics collection
├── db/                    # Database layer
├── dashboard/             # React frontend
├── tests/                 # Test suites
└── scripts/               # Utility scripts

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors