AI-Powered Inference Platform - Deploy OpenAI's GPT-OSS-20B on AWS EC2 with GPU acceleration using llama.cpp.
This project provides a complete deployment solution for serving OpenAI's GPT-OSS-20B model via an OpenAI-compatible API on AWS EC2 infrastructure:
| Model | Purpose | Port |
|---|---|---|
| GPT-OSS-20B | General-purpose reasoning, math, coding, and tool use | 8080 |
The model runs as a systemd service with automatic restart, GPU offloading, and API key authentication.
┌─────────────────────────────────────────────────────────────┐
│ AWS EC2 Instance │
│ (g4dn.xlarge / g5.xlarge) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ llama-gpt-oss │ systemd │
│ │ (Port 8080) │ service │
│ └────────┬────────────┘ │
│ │ │
│ ┌────────▼────────────┐ │
│ │ llama.cpp Server │ │
│ │ (CUDA accelerated)│ │
│ └────────┬────────────┘ │
│ │ │
│ ┌────────▼────────────┐ │
│ │ NVIDIA GPU │ │
│ │ (T4 / A10 / etc) │ │
│ └─────────────────────┘ │
│ │
│ Model: /home/ubuntu/llm-deployment/models/ │
│ └── gpt-oss-20b.gguf │
└─────────────────────────────────────────────────────────────┘
Math_Phy_Rooman/
├── setup.sh # Main deployment script (root level)
├── config/
│ └── llama.env # API key configuration
├── llama-gpt-oss.service # Systemd service for GPT-OSS-20B
├── scripts/
│ ├── download_models.sh # Model download helper
│ ├── health_check.sh # Endpoint health verification
│ ├── start.sh # Start service
│ └── stop.sh # Stop service
├── llm-deployment/ # Self-contained deployment package
│ ├── setup.sh # Comprehensive setup script
│ ├── test-api.sh # API endpoint testing
│ ├── llama-gpt-oss.service # Systemd service (GPT-OSS-20B)
│ ├── api-key.env # API key configuration
│ └── README.md # Deployment-specific docs
└── huggingface.pem # AWS SSH key (gitignored)
| File | Purpose |
|---|---|
setup.sh |
Installs dependencies, builds llama.cpp with CUDA, configures systemd service |
llama-gpt-oss.service |
Systemd unit file for GPT-OSS-20B server |
config/llama.env |
Environment file containing API_KEY for authentication |
scripts/download_models.sh |
Downloads GGUF model from Hugging Face |
llm-deployment/test-api.sh |
Validates API endpoint is responding correctly |
- GPU Acceleration: Full CUDA support with all model layers offloaded to GPU (
-ngl 99) - OpenAI-Compatible API: Drop-in replacement for OpenAI's
/v1/chat/completionsendpoint - API Key Authentication: Secure endpoint with Bearer token authentication
- Auto-Recovery: Systemd service automatically restarts on failure
- Mixture-of-Experts: GPT-OSS-20B activates only 3.6B parameters per token for efficient inference
- 128K Context Support: Native 128K token context window (default configured to 8192)
- Jinja Templates: Chat template support enabled via
--jinjaflag
| Specification | Value |
|---|---|
| Developer | OpenAI |
| Total Parameters | 21B |
| Active Parameters | 3.6B per token (MoE) |
| Architecture | Transformer with Mixture-of-Experts |
| Native Quantization | MXFP4 (~4.25 bits/param) |
| Model Size on Disk | ~12-13GB |
| Context Window | Up to 128K tokens |
| License | Apache 2.0 |
GPT-OSS-20B matches or exceeds OpenAI o3-mini on many benchmarks including math (AIME), coding (Codeforces), and general reasoning (MMLU).
- Instance Type:
g4dn.xlarge,g5.xlarge, or similar with NVIDIA GPU - AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0 (Ubuntu 20.04/22.04+)
- Storage: Minimum 50GB for OS + llama.cpp + model
- Security Group Ports:
22(SSH)8080(GPT-OSS-20B API)
build-essential,cmake,git,curl,wget,make,g++- NVIDIA CUDA Toolkit (if not pre-installed on AMI)
- llama.cpp (cloned and built automatically)
gpt-oss-20b.gguf(~12-13GB) from ggml-org/gpt-oss-20b-GGUF
# From your local machine
scp -i huggingface.pem -r llm-deployment ubuntu@<EC2_PUBLIC_IP>:~/# Connect to EC2 instance
ssh -i huggingface.pem ubuntu@<EC2_PUBLIC_IP>
# Navigate to deployment folder
cd ~/llm-deployment
# Fix Windows line endings (if applicable)
sed -i 's/\r$//' setup.sh test-api.sh llama-gpt-oss.service api-key.env
# Make scripts executable
chmod +x setup.sh test-api.shsudo ./setup.shThe script will:
- Install build dependencies and CUDA toolkit
- Clone and compile llama.cpp with GPU support
- Verify model file exists
- Configure and enable systemd service
- Optionally start the service
cd ~/llm-deployment/models
wget -O gpt-oss-20b.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.ggufsudo systemctl start llama-gpt-oss| Variable | Location | Description |
|---|---|---|
API_KEY |
/etc/default/llama-cpp |
Bearer token for API authentication |
To update the API key:
sudo nano /etc/default/llama-cpp
# Edit: API_KEY=your_new_secure_key
sudo systemctl restart llama-gpt-oss| Parameter | Default | Description |
|---|---|---|
--ctx-size |
8192 | Maximum context window tokens |
-ngl |
99 | GPU layers (99 = offload all) |
--parallel |
1 | Concurrent request handling |
--host |
0.0.0.0 | Listen address |
--port |
8080 | Service port |
--jinja |
enabled | Chat template support |
# Check status
sudo systemctl status llama-gpt-oss
# Start/Stop/Restart
sudo systemctl start llama-gpt-oss
sudo systemctl stop llama-gpt-oss
sudo systemctl restart llama-gpt-oss
# View logs
sudo journalctl -u llama-gpt-oss -f# Query GPT-OSS-20B
curl -X POST "http://<EC2_PUBLIC_IP>:8080/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "What is the derivative of x^3?"}],
"max_tokens": 256
}'# Run automated test
./test-api.sh <YOUR_API_KEY>
# Or run health check
./scripts/health_check.sh| Specification | Value |
|---|---|
| Backend | llama.cpp |
| API Format | OpenAI /v1/chat/completions |
| GPU Support | NVIDIA CUDA |
| Model | OpenAI GPT-OSS-20B (MoE) |
| Quantization | MXFP4 (native ~4.25-bit) |
| Context Size | 8192 tokens (configurable up to 128K) |
| Process Manager | systemd |
Model: Apache 2.0 (OpenAI GPT-OSS-20B)
Not specified in code.