Skip to content

PunithVT/LLM-Deploy

Repository files navigation

LLM Inference Deployment

AI-Powered Inference Platform - Deploy OpenAI's GPT-OSS-20B on AWS EC2 with GPU acceleration using llama.cpp.


Project Overview

This project provides a complete deployment solution for serving OpenAI's GPT-OSS-20B model via an OpenAI-compatible API on AWS EC2 infrastructure:

Model Purpose Port
GPT-OSS-20B General-purpose reasoning, math, coding, and tool use 8080

The model runs as a systemd service with automatic restart, GPU offloading, and API key authentication.


High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AWS EC2 Instance                         │
│                  (g4dn.xlarge / g5.xlarge)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│           ┌─────────────────────┐                           │
│           │  llama-gpt-oss      │   systemd                 │
│           │  (Port 8080)        │   service                 │
│           └────────┬────────────┘                           │
│                    │                                        │
│           ┌────────▼────────────┐                           │
│           │   llama.cpp Server  │                           │
│           │   (CUDA accelerated)│                           │
│           └────────┬────────────┘                           │
│                    │                                        │
│           ┌────────▼────────────┐                           │
│           │    NVIDIA GPU       │                           │
│           │   (T4 / A10 / etc)  │                           │
│           └─────────────────────┘                           │
│                                                             │
│  Model:   /home/ubuntu/llm-deployment/models/               │
│           └── gpt-oss-20b.gguf                              │
└─────────────────────────────────────────────────────────────┘

Main Components

Directory Structure

Math_Phy_Rooman/
├── setup.sh                    # Main deployment script (root level)
├── config/
│   └── llama.env               # API key configuration
├── llama-gpt-oss.service       # Systemd service for GPT-OSS-20B
├── scripts/
│   ├── download_models.sh      # Model download helper
│   ├── health_check.sh         # Endpoint health verification
│   ├── start.sh                # Start service
│   └── stop.sh                 # Stop service
├── llm-deployment/             # Self-contained deployment package
│   ├── setup.sh                # Comprehensive setup script
│   ├── test-api.sh             # API endpoint testing
│   ├── llama-gpt-oss.service   # Systemd service (GPT-OSS-20B)
│   ├── api-key.env             # API key configuration
│   └── README.md               # Deployment-specific docs
└── huggingface.pem             # AWS SSH key (gitignored)

Key Files

File Purpose
setup.sh Installs dependencies, builds llama.cpp with CUDA, configures systemd service
llama-gpt-oss.service Systemd unit file for GPT-OSS-20B server
config/llama.env Environment file containing API_KEY for authentication
scripts/download_models.sh Downloads GGUF model from Hugging Face
llm-deployment/test-api.sh Validates API endpoint is responding correctly

Key Features & Functionality

  • GPU Acceleration: Full CUDA support with all model layers offloaded to GPU (-ngl 99)
  • OpenAI-Compatible API: Drop-in replacement for OpenAI's /v1/chat/completions endpoint
  • API Key Authentication: Secure endpoint with Bearer token authentication
  • Auto-Recovery: Systemd service automatically restarts on failure
  • Mixture-of-Experts: GPT-OSS-20B activates only 3.6B parameters per token for efficient inference
  • 128K Context Support: Native 128K token context window (default configured to 8192)
  • Jinja Templates: Chat template support enabled via --jinja flag

About GPT-OSS-20B

Specification Value
Developer OpenAI
Total Parameters 21B
Active Parameters 3.6B per token (MoE)
Architecture Transformer with Mixture-of-Experts
Native Quantization MXFP4 (~4.25 bits/param)
Model Size on Disk ~12-13GB
Context Window Up to 128K tokens
License Apache 2.0

GPT-OSS-20B matches or exceeds OpenAI o3-mini on many benchmarks including math (AIME), coding (Codeforces), and general reasoning (MMLU).


Prerequisites & Dependencies

AWS Infrastructure

  • Instance Type: g4dn.xlarge, g5.xlarge, or similar with NVIDIA GPU
  • AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0 (Ubuntu 20.04/22.04+)
  • Storage: Minimum 50GB for OS + llama.cpp + model
  • Security Group Ports:
    • 22 (SSH)
    • 8080 (GPT-OSS-20B API)

Software Dependencies (installed by setup.sh)

  • build-essential, cmake, git, curl, wget, make, g++
  • NVIDIA CUDA Toolkit (if not pre-installed on AMI)
  • llama.cpp (cloned and built automatically)

Model File (must be downloaded separately)


Local Setup & Usage Instructions

Step 1: Transfer Repository to EC2

# From your local machine
scp -i huggingface.pem -r llm-deployment ubuntu@<EC2_PUBLIC_IP>:~/

Step 2: SSH and Prepare Environment

# Connect to EC2 instance
ssh -i huggingface.pem ubuntu@<EC2_PUBLIC_IP>

# Navigate to deployment folder
cd ~/llm-deployment

# Fix Windows line endings (if applicable)
sed -i 's/\r$//' setup.sh test-api.sh llama-gpt-oss.service api-key.env

# Make scripts executable
chmod +x setup.sh test-api.sh

Step 3: Run Setup Script

sudo ./setup.sh

The script will:

  1. Install build dependencies and CUDA toolkit
  2. Clone and compile llama.cpp with GPU support
  3. Verify model file exists
  4. Configure and enable systemd service
  5. Optionally start the service

Step 4: Download Model (if not present)

cd ~/llm-deployment/models
wget -O gpt-oss-20b.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf

Step 5: Start Service

sudo systemctl start llama-gpt-oss

Configuration Options

Environment Variables

Variable Location Description
API_KEY /etc/default/llama-cpp Bearer token for API authentication

To update the API key:

sudo nano /etc/default/llama-cpp
# Edit: API_KEY=your_new_secure_key
sudo systemctl restart llama-gpt-oss

Service Parameters (in systemd file)

Parameter Default Description
--ctx-size 8192 Maximum context window tokens
-ngl 99 GPU layers (99 = offload all)
--parallel 1 Concurrent request handling
--host 0.0.0.0 Listen address
--port 8080 Service port
--jinja enabled Chat template support

Common Workflows & Commands

Service Management

# Check status
sudo systemctl status llama-gpt-oss

# Start/Stop/Restart
sudo systemctl start llama-gpt-oss
sudo systemctl stop llama-gpt-oss
sudo systemctl restart llama-gpt-oss

# View logs
sudo journalctl -u llama-gpt-oss -f

API Usage (OpenAI-Compatible)

# Query GPT-OSS-20B
curl -X POST "http://<EC2_PUBLIC_IP>:8080/v1/chat/completions" \
    -H "Authorization: Bearer <YOUR_API_KEY>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-20b",
        "messages": [{"role": "user", "content": "What is the derivative of x^3?"}],
        "max_tokens": 256
    }'

Testing Endpoint

# Run automated test
./test-api.sh <YOUR_API_KEY>

# Or run health check
./scripts/health_check.sh

Technical Specifications

Specification Value
Backend llama.cpp
API Format OpenAI /v1/chat/completions
GPU Support NVIDIA CUDA
Model OpenAI GPT-OSS-20B (MoE)
Quantization MXFP4 (native ~4.25-bit)
Context Size 8192 tokens (configurable up to 128K)
Process Manager systemd

License

Model: Apache 2.0 (OpenAI GPT-OSS-20B)


Contributing

Not specified in code.

About

AI-Powered Inference Platform - Deploy OpenAI's GPT-OSS-20B on AWS EC2 with GPU acceleration using llama.cpp.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages