LLM Inference Deployment

AI-Powered Inference Platform - Deploy OpenAI's GPT-OSS-20B on AWS EC2 with GPU acceleration using llama.cpp.

Project Overview

This project provides a complete deployment solution for serving OpenAI's GPT-OSS-20B model via an OpenAI-compatible API on AWS EC2 infrastructure:

Model	Purpose	Port
GPT-OSS-20B	General-purpose reasoning, math, coding, and tool use	8080

The model runs as a systemd service with automatic restart, GPU offloading, and API key authentication.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AWS EC2 Instance                         │
│                  (g4dn.xlarge / g5.xlarge)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│           ┌─────────────────────┐                           │
│           │  llama-gpt-oss      │   systemd                 │
│           │  (Port 8080)        │   service                 │
│           └────────┬────────────┘                           │
│                    │                                        │
│           ┌────────▼────────────┐                           │
│           │   llama.cpp Server  │                           │
│           │   (CUDA accelerated)│                           │
│           └────────┬────────────┘                           │
│                    │                                        │
│           ┌────────▼────────────┐                           │
│           │    NVIDIA GPU       │                           │
│           │   (T4 / A10 / etc)  │                           │
│           └─────────────────────┘                           │
│                                                             │
│  Model:   /home/ubuntu/llm-deployment/models/               │
│           └── gpt-oss-20b.gguf                              │
└─────────────────────────────────────────────────────────────┘

Main Components

Directory Structure

Math_Phy_Rooman/
├── setup.sh                    # Main deployment script (root level)
├── config/
│   └── llama.env               # API key configuration
├── llama-gpt-oss.service       # Systemd service for GPT-OSS-20B
├── scripts/
│   ├── download_models.sh      # Model download helper
│   ├── health_check.sh         # Endpoint health verification
│   ├── start.sh                # Start service
│   └── stop.sh                 # Stop service
├── llm-deployment/             # Self-contained deployment package
│   ├── setup.sh                # Comprehensive setup script
│   ├── test-api.sh             # API endpoint testing
│   ├── llama-gpt-oss.service   # Systemd service (GPT-OSS-20B)
│   ├── api-key.env             # API key configuration
│   └── README.md               # Deployment-specific docs
└── huggingface.pem             # AWS SSH key (gitignored)

Key Files

File	Purpose
`setup.sh`	Installs dependencies, builds llama.cpp with CUDA, configures systemd service
`llama-gpt-oss.service`	Systemd unit file for GPT-OSS-20B server
`config/llama.env`	Environment file containing `API_KEY` for authentication
`scripts/download_models.sh`	Downloads GGUF model from Hugging Face
`llm-deployment/test-api.sh`	Validates API endpoint is responding correctly

Key Features & Functionality

GPU Acceleration: Full CUDA support with all model layers offloaded to GPU (-ngl 99)
OpenAI-Compatible API: Drop-in replacement for OpenAI's /v1/chat/completions endpoint
API Key Authentication: Secure endpoint with Bearer token authentication
Auto-Recovery: Systemd service automatically restarts on failure
Mixture-of-Experts: GPT-OSS-20B activates only 3.6B parameters per token for efficient inference
128K Context Support: Native 128K token context window (default configured to 8192)
Jinja Templates: Chat template support enabled via --jinja flag

About GPT-OSS-20B

Specification	Value
Developer	OpenAI
Total Parameters	21B
Active Parameters	3.6B per token (MoE)
Architecture	Transformer with Mixture-of-Experts
Native Quantization	MXFP4 (~4.25 bits/param)
Model Size on Disk	~12-13GB
Context Window	Up to 128K tokens
License	Apache 2.0

GPT-OSS-20B matches or exceeds OpenAI o3-mini on many benchmarks including math (AIME), coding (Codeforces), and general reasoning (MMLU).

Prerequisites & Dependencies

AWS Infrastructure

Instance Type: g4dn.xlarge, g5.xlarge, or similar with NVIDIA GPU
AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0 (Ubuntu 20.04/22.04+)
Storage: Minimum 50GB for OS + llama.cpp + model
Security Group Ports:
- 22 (SSH)
- 8080 (GPT-OSS-20B API)

Software Dependencies (installed by `setup.sh`)

build-essential, cmake, git, curl, wget, make, g++
NVIDIA CUDA Toolkit (if not pre-installed on AMI)
llama.cpp (cloned and built automatically)

Model File (must be downloaded separately)

gpt-oss-20b.gguf (~12-13GB) from ggml-org/gpt-oss-20b-GGUF

Local Setup & Usage Instructions

Step 1: Transfer Repository to EC2

# From your local machine
scp -i huggingface.pem -r llm-deployment ubuntu@<EC2_PUBLIC_IP>:~/

Step 2: SSH and Prepare Environment

# Connect to EC2 instance
ssh -i huggingface.pem ubuntu@<EC2_PUBLIC_IP>

# Navigate to deployment folder
cd ~/llm-deployment

# Fix Windows line endings (if applicable)
sed -i 's/\r$//' setup.sh test-api.sh llama-gpt-oss.service api-key.env

# Make scripts executable
chmod +x setup.sh test-api.sh

Step 3: Run Setup Script

sudo ./setup.sh

The script will:

Install build dependencies and CUDA toolkit
Clone and compile llama.cpp with GPU support
Verify model file exists
Configure and enable systemd service
Optionally start the service

Step 4: Download Model (if not present)

cd ~/llm-deployment/models
wget -O gpt-oss-20b.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf

Step 5: Start Service

sudo systemctl start llama-gpt-oss

Configuration Options

Environment Variables

Variable	Location	Description
`API_KEY`	`/etc/default/llama-cpp`	Bearer token for API authentication

To update the API key:

sudo nano /etc/default/llama-cpp
# Edit: API_KEY=your_new_secure_key
sudo systemctl restart llama-gpt-oss

Service Parameters (in systemd file)

Parameter	Default	Description
`--ctx-size`	8192	Maximum context window tokens
`-ngl`	99	GPU layers (99 = offload all)
`--parallel`	1	Concurrent request handling
`--host`	0.0.0.0	Listen address
`--port`	8080	Service port
`--jinja`	enabled	Chat template support

Common Workflows & Commands

Service Management

# Check status
sudo systemctl status llama-gpt-oss

# Start/Stop/Restart
sudo systemctl start llama-gpt-oss
sudo systemctl stop llama-gpt-oss
sudo systemctl restart llama-gpt-oss

# View logs
sudo journalctl -u llama-gpt-oss -f

API Usage (OpenAI-Compatible)

# Query GPT-OSS-20B
curl -X POST "http://<EC2_PUBLIC_IP>:8080/v1/chat/completions" \
    -H "Authorization: Bearer <YOUR_API_KEY>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-20b",
        "messages": [{"role": "user", "content": "What is the derivative of x^3?"}],
        "max_tokens": 256
    }'

Testing Endpoint

# Run automated test
./test-api.sh <YOUR_API_KEY>

# Or run health check
./scripts/health_check.sh

Technical Specifications

Specification	Value
Backend	llama.cpp
API Format	OpenAI `/v1/chat/completions`
GPU Support	NVIDIA CUDA
Model	OpenAI GPT-OSS-20B (MoE)
Quantization	MXFP4 (native ~4.25-bit)
Context Size	8192 tokens (configurable up to 128K)
Process Manager	systemd

License

Model: Apache 2.0 (OpenAI GPT-OSS-20B)

Contributing

Not specified in code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
llm-deployment		llm-deployment
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
DEPLOYMENT_INSTRUCTIONS.md		DEPLOYMENT_INSTRUCTIONS.md
README.md		README.md
llama-gpt-oss.service		llama-gpt-oss.service
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Deployment

Project Overview

High-Level Architecture

Main Components

Directory Structure

Key Files

Key Features & Functionality

About GPT-OSS-20B

Prerequisites & Dependencies

AWS Infrastructure

Software Dependencies (installed by `setup.sh`)

Model File (must be downloaded separately)

Local Setup & Usage Instructions

Step 1: Transfer Repository to EC2

Step 2: SSH and Prepare Environment

Step 3: Run Setup Script

Step 4: Download Model (if not present)

Step 5: Start Service

Configuration Options

Environment Variables

Service Parameters (in systemd file)

Common Workflows & Commands

Service Management

API Usage (OpenAI-Compatible)

Testing Endpoint

Technical Specifications

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Deployment

Project Overview

High-Level Architecture

Main Components

Directory Structure

Key Files

Key Features & Functionality

About GPT-OSS-20B

Prerequisites & Dependencies

AWS Infrastructure

Software Dependencies (installed by setup.sh)

Model File (must be downloaded separately)

Local Setup & Usage Instructions

Step 1: Transfer Repository to EC2

Step 2: SSH and Prepare Environment

Step 3: Run Setup Script

Step 4: Download Model (if not present)

Step 5: Start Service

Configuration Options

Environment Variables

Service Parameters (in systemd file)

Common Workflows & Commands

Service Management

API Usage (OpenAI-Compatible)

Testing Endpoint

Technical Specifications

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Software Dependencies (installed by `setup.sh`)

Packages