A smart proxy server for Cerebras API with intelligent key rotation, request routing, and API key management.
- 🔄 Smart API Key Rotation - Automatic rotation on rate limits (429) with cooldown tracking
- 🚀 Strategic Routing - Routes large requests (configurable threshold, default 120k tokens) to alternative APIs (Synthetic/Z.ai)
- 🖼️ Vision Model Routing - Automatically routes image requests to Qwen vision model
- ⚡ Fallback on Cooldown - Routes to alternative APIs when all Cerebras keys are rate-limited
- 🔧 Smart Error Handling - Auto-retries with alternative APIs on 400/503 errors and embedded quota errors from Cerebras
- 🔐 Incoming API Key Management - SQLite-based authentication for client requests
- 🛠️ Auto Tool Call Validation - Fixes missing tool responses automatically
- 📝 Request/Response Logging - Optional filesystem logging for debugging
- 📊 Status Monitoring - Built-in
/_statusendpoint
- Clone and configure:
git clone git@github.com:janfeddersen-wq/glm_awsomify_proxy.git
cd glm_awsomify_proxy
cp .env.example .env- Edit
.envwith your Cerebras API keys:
CEREBRAS_API_KEYS={"key1":"sk-xxx","key2":"sk-yyy"}- Start the proxy:
docker-compose up -dThe proxy runs at http://localhost:18080
pip install -r requirements.txt
export CEREBRAS_API_KEYS='{"key1":"sk-xxx","key2":"sk-yyy"}'
python proxy_server.pyProtect your proxy with client authentication using SQLite-based API keys.
Set in .env:
ENABLE_INCOMING_AUTH=true# Add a new client API key
python manage_keys.py add "Client Name"
# Output: sk-abc123... (give this to your client)
# List all API keys with usage stats
python manage_keys.py list
# Revoke an API key (by API key, ID, or name)
python manage_keys.py revoke sk-abc123... # by API key
python manage_keys.py revoke 5 # by ID from list output
python manage_keys.py revoke "Client Name" # by name
# Re-enable a revoked API key (by API key, ID, or name)
python manage_keys.py enable 5 # by ID
python manage_keys.py enable "Client Name" # by name
# View statistics
python manage_keys.py stats# Add key
docker-compose exec cerebras-proxy python manage_keys.py add "Client Name"
# List keys
docker-compose exec cerebras-proxy python manage_keys.py list
# Revoke key (by API key, ID, or name)
docker-compose exec cerebras-proxy python manage_keys.py revoke 5Clients must include the API key in requests:
curl -X POST http://localhost:18080/chat/completions \
-H "Authorization: Bearer sk-abc123..." \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b","messages":[...]}'Large requests are automatically routed to alternative APIs based on a configurable token threshold (default: 120k tokens):
Token Estimation: Uses Content-Length header with empirically-determined ratio of 4.7 bytes/token based on 248 real API request samples. Fast and accurate without parsing request body.
- Primary: Synthetic API (
api.synthetic.new) - Model:hf:zai-org/GLM-4.6 - Fallback: Z.ai API (
api.z.ai) - Model:glm-4.6
Set in .env:
SYNTHETIC_API_KEY=sk-your-synthetic-key
ZAI_API_KEY=sk-your-zai-keyNormal-sized requests continue using Cerebras API.
Requests containing images are automatically detected and routed to a vision-capable model.
The proxy scans the messages array for OpenAI-style image content:
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
]
}]
}When detected, the request is routed to:
- API: Synthetic API (
api.synthetic.new) - Model:
hf:Qwen/Qwen3-VL-235B-A22B-Instruct
Set in .env:
SYNTHETIC_API_KEY=sk-your-synthetic-keycurl -X POST http://localhost:18080/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}]
}'The proxy will automatically use the Qwen vision model regardless of the requested model.
When all Cerebras API keys are rate-limited, enable automatic fallback to alternative APIs instead of waiting for cooldown.
Set in .env:
FALLBACK_ON_COOLDOWN=true
SYNTHETIC_API_KEY=sk-your-synthetic-key
ZAI_API_KEY=sk-your-zai-keyWithout Fallback (default):
- All Cerebras keys hit rate limit → Wait 60s for cooldown → Retry
With Fallback enabled:
- Key gets 429/500 → Marked as rate-limited
- All Cerebras keys now rate-limited? → Instantly route to Synthetic API → Falls back to Z.ai if needed → ⚡ No waiting!
Trigger Points:
- Before retry loop: If all keys already rate-limited
- Inside retry loop: After marking a key as rate-limited (429/500), checks if all keys are now exhausted
Use Case: During high-traffic periods when all Cerebras keys are exhausted, this provides faster response times by utilizing alternative APIs instead of waiting for cooldowns.
The proxy automatically routes to alternative APIs when Cerebras encounters certain errors, providing seamless failover without manual intervention.
400 Context Length Exceeded:
- Cerebras returns:
{"code": "context_length_exceeded", "message": "...Current length is 132032 while limit is 131072"} - Action: Automatically route to Synthetic API → Falls back to Z.ai if needed
- Benefit: Seamlessly uses higher-capacity alternative APIs when requests exceed Cerebras's context window
503 Service Unavailable:
- Cerebras returns: 503 (service temporarily unavailable)
- Action: Automatically route to Synthetic API → Falls back to Z.ai if needed
- Benefit: Maintains availability during Cerebras downtime or maintenance
Embedded Token Quota Error:
- Cerebras returns: 200 OK with embedded error in response body:
{"choices": [{"message": {"content": "API Error: 403 {\"error\":{\"type\":\"new_api_error\",\"message\":\"token quota is not enough, token remain quota: ¥0.155328, need quota: ¥0.162586...\"}}"}}]} - Detection: Proxy checks for "token quota is not enough" pattern in
choices[0].message.content - Action: Automatically route to Synthetic API → Falls back to Z.ai if needed
- Benefit: Handles quota exhaustion errors from underlying API providers that Cerebras wraps
Requirements: SYNTHETIC_API_KEY and/or ZAI_API_KEY must be configured for error handling to work.
| Variable | Default | Description |
|---|---|---|
CEREBRAS_API_KEYS |
required | JSON object with Cerebras API keys |
CEREBRAS_COOLDOWN |
60 |
Cooldown seconds after rate limiting |
TOKEN_THRESHOLD |
120000 |
Token threshold for routing to alternative APIs |
ENABLE_INCOMING_AUTH |
false |
Enable client API key authentication |
INCOMING_KEY_DB |
./data/incoming_keys.db |
SQLite database path |
SYNTHETIC_API_KEY |
- | API key for Synthetic API |
ZAI_API_KEY |
- | API key for Z.ai API |
FALLBACK_ON_COOLDOWN |
false |
Route to alternative APIs when all Cerebras keys are rate-limited |
LOG_REQUESTS |
true |
Enable request/response logging |
LOG_DIR |
./logs |
Directory for log files |
Docker volumes automatically persist data:
./logs/- Request/response logs./data/- SQLite database for API keys
- Sticks with one Cerebras API key until rate limited (429) or error (500)
- Automatically switches to next available key
- Tracks cooldown periods (default 60s)
- Waits for available key instead of failing
Client Request
↓
[Verify Incoming API Key] (if ENABLE_INCOMING_AUTH=true)
↓
[Estimate Token Count from Message Content]
↓
> TOKEN_THRESHOLD? → Route to Synthetic API → Fails? → Route to Z.ai API
↓
[Check for Image Content]
↓
Has images? → Route to Synthetic API with Qwen Vision Model
↓
< TOKEN_THRESHOLD? → [Check if all Cerebras keys rate-limited]
↓ ↓
↓ Yes + FALLBACK_ON_COOLDOWN=true?
↓ ↓
↓ Route to Synthetic/Z.ai API
↓
↓ No or disabled → Route to Cerebras API (with smart rotation/wait)
↓ ↓
↓ Returns 400 context_length_exceeded or 503?
↓ ↓
↓ Route to Synthetic/Z.ai API
↓
[Fix Tool Calls if needed]
↓
[Log Request/Response] (if LOG_REQUESTS=true)
↓
Return to Client
Check proxy status:
curl http://localhost:18080/_statusResponse:
{
"keys": [
{
"name": "key1",
"available": true,
"rate_limited_for": 0,
"error_count": 0
}
],
"current_key": "key1"
}The SQLite database tracks:
api_key- The client API keyname- Descriptive namecreated_at- Creation timestamprevoked- Revoked statuslast_used_at- Last request timestamprequest_count- Total requests made
curl -X POST http://localhost:18080/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'# 1. Create client API key
python manage_keys.py add "Production Client"
# Output: sk-abc123...
# 2. Client uses the key
curl -X POST http://localhost:18080/chat/completions \
-H "Authorization: Bearer sk-abc123..." \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'docker-compose down
docker-compose build --no-cache
docker-compose up -dThe database is auto-created on first use of manage_keys.py. Ensure the ./data/ directory has write permissions.
Check that ./logs/ directory exists and is writable. Verify LOG_REQUESTS=true in .env.
MIT License