An OpenAI-compatible inference server optimized for Apple Silicon. Hot-swap between MLX models, stream responses in real-time, process images and PDFs, and integrate with any OpenAI client.
Published MLX models: https://huggingface.co/henrybravo
- 🚀 FastAPI-powered - Modern async API with automatic documentation
- 🔄 Hot-swap between different models without server restart
- ⚡ MLX-optimized performance with GPU acceleration
- 🔒 Automatic memory management and cleanup
- 🎯 Full OpenAI API compatibility - Drop-in replacement for OpenAI endpoints
- ⏱️ Built-in timeout protection and error handling
- 📊 Comprehensive logging system with rotation
- 🏗️ Modular architecture - Clean separation of concerns
- 📖 Interactive API docs - Automatic Swagger/OpenAPI documentation
- 🔧 System service support - Install as macOS launchd service with automatic startup and crash recovery
- 🌊 Response Streaming - Real-time token delivery with Server-Sent Events (90%+ latency reduction)
- 🔧 Function Calling - OpenAI-compatible tool/function calling for agent frameworks
- 👁️ Vision Model Support - Process images and PDFs with multimodal models (mlx-vlm)
MLX Router supports any MLX model that is available locally. Models are automatically discovered from configured directories and can be loaded dynamically without server restart.
Models are loaded from local directories (default: $HOME/models) with automatic configuration detection:
- Automatic Discovery: Models placed in the configured directory are automatically detected
- Dynamic Configuration: Model parameters are extracted from local
config.jsonfiles - Hot-Swapping: Switch between models without restarting the server
- Fallback Support: Downloads any HuggingFace model to the custom directory if not found locally
- Cache Management: Existing models in
~/.cache/huggingface/hubcan be symlinked to the custom directory
The checked-in config.json is intentionally minimal:
- it keeps the public model path anonymized as
/Users/username/models - it includes one sample model entry for
henrybravo/Qwen3.5-35B-A3B-4bit - it relies on local model discovery from
model_directory
Download models into your local model directory with the helper tools, then restart the server and inspect /v1/models to see what is available.
$HOME/models/
├── mlx-community--Llama-3.3-70B-Instruct-4bit/
│ └── snapshots/
│ └── [commit-hash]/
│ ├── config.json
│ ├── tokenizer.json
│ └── model.safetensors
└── custom-models/
└── my-model/
├── config.json
├── tokenizer.json
└── model.safetensors
Use the --config argument to load additional model configurations from config.json when you want explicit per-model overrides.
You can use the helper tools in helper_tools/ directory:
mlx_downloader.py- Download MLX models to custom directories (supportsMLX_MODEL_DIR)mlx_lmstudio_linker.py- Link models to LM Studiomlx_model_parameter_discovery_tool.py- Auto-discover optimal parameters
graph TD
A[User / Agent] -->|HTTP Request| B(FastAPI Server)
B --> C[MLX Router]
C --> D[Model 1]
C --> E[Model 2]
D --> F[Inference Result]
E --> F
F -->|HTTP Response| A
The project features a clean modular architecture:
mlx_router/
├── api ## FastAPI application
│ └── app.py # API endpoint handlers
├── config ## Configuration management
│ └── model_config.py # ModelConfig class
└── core ## Core MLX logic
├── content.py # Content normalization utilities
├── manager.py # MLXModelManager class
├── patterns.py # Chat template patterns
├── resource_monitor.py # Memory monitoring
└── templates.py # Chat template handling
- MLXModelManager: Central class managing model loading, unloading, and generation with thread-safe operations
- ModelConfig: Configuration management for model-specific parameters
- ResourceMonitor: Apple Silicon memory monitoring and optimization
- FastAPI App: Modern HTTP API with automatic documentation
- config.json: External configuration file for model parameters, defaults, and operational settings
- Model-specific parameters include: max_tokens, temperature, top_p, top_k, min_p, chat_template, required_memory_gb
- Memory pressure-aware token limits for different system states
The router supports multiple chat template formats:
llama3: Llama 3.x models with<|begin_of_text|>formatdeepseek: DeepSeek models with instruction/response formatqwen: Qwen models with<|im_start|>formatphi4: Phi-4 models with<|user|>/<|assistant|>formatgeneric: Fallback templatechatml: OpenAI ChatML formatgpt-oss: GPT-OSS models with custom format
- Python 3.11+ (tested with 3.11 and 3.13)
- MLX-compatible GPU
uvpackage manager (recommended)
pip install mlx-routerOr with uv:
uv pip install mlx-routerThen start the server:
mlx-router --config /path/to/config.jsonThe checked-in config.json includes one sample model entry for henrybravo/Qwen3.5-35B-A3B-4bit and an anonymized model path (/Users/username/models). Before starting the server:
- download or copy an MLX model into your local model directory (for example
models--henrybravo--Qwen3.5-35B-A3B-4bit) - update
model_directoryin your config to your real local path - pass that config file to
mlx-router --config
You can use the helper tools in helper_tools/ to prepare local models.
Logs in Option A: when you run mlx-router directly from a shell, logs are written to stdout/stderr in that terminal. If you want persistent logs, redirect them yourself:
mlx-router --config /path/to/config.json 2>&1 | tee mlx-router.log- Clone the repository:
git clone https://github.com/henrybravo/mlx-router.git
cd mlx-router- Set up the environment using
uv:
pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install .requirements.txt is deprecated and kept only for temporary compatibility with older manual workflows.
- Start the server:
python main.py --config config.jsonFor production deployment as a macOS launchd user service that starts automatically at login:
Prerequisites:
- macOS (Darwin-based system)
- Python 3.11+ installed
uvpackage manager (optional, but recommended for faster installation)
# Install as a user service (no sudo required)
./install-launchd.sh
# Check service status
launchctl list | grep mlx-router
# View logs
tail -f ~/mlx_router_app/logs/mlx_router.logInstallation Features:
- ✅ Automatic dependency checking: Validates Python version and required files
- ⚡ uv support: Uses
uvfor faster package installation if available, falls back to pip - 🔄 Existing service handling: Safely handles reinstallation over existing services
- 🧪 Installation verification: Tests service startup and API availability
- 🔧 Self-healing dependencies: Automatically detects and fixes missing Python packages
- 🎯 Application testing: Verifies main.py imports work before service installation
- 🏠 User-space installation: Installs to
~/mlx_router_appwithout requiring elevated privileges
Configuration in Production Mode:
The installation script automatically:
- Copies
config.jsonto~/mlx_router_app/config.json - Configures the service to use this user-level config file
- The service runs with
--config ~/mlx_router_app/config.json - Enables automatic restart: Service automatically restarts if the application crashes
- Login persistence: Service starts automatically when you log in
Modifying Production Configuration:
# Backup current config (recommended)
cp ~/mlx_router_app/config.json ~/mlx_router_app/config.json.backup
# Edit the production config file
nano ~/mlx_router_app/config.json
# Restart service to apply changes
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
# Verify changes in logs
tail -f ~/mlx_router_app/logs/mlx-router.logService Management:
# Stop service
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
# Start service
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
# Uninstall service
./uninstall-launchd.shCustom Installation Directory:
# Install to a custom directory
INSTALL_DIR=/path/to/custom/dir ./install-launchd.sh
# Or set the environment variable first
export INSTALL_DIR=/path/to/custom/dir
./install-launchd.shOptionally: view help menu first:
python main.py -h- Start the server:
python main.py --config config.jsonOnce installed as a user service, MLX Router runs automatically and can be accessed immediately:
# Service runs automatically after installation
# Access API directly
curl -s http://localhost:8800/health | jqThe server will start on http://0.0.0.0:8800 by default.
- Swagger UI: http://localhost:8800/docs
- ReDoc: http://localhost:8800/redoc
GET /v1/models- List available models with memory requirementsPOST /v1/chat/completions- Generate chat completions (OpenAI compatible)GET /health- Server health checkGET /v1/health- Detailed health metrics
List Models:
curl -s http://localhost:8800/v1/models | jqChat Completion:
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gpt-oss-120b-MXFP4-Q8",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"stream": false,
"max_tokens": 1024
}' | jq -r '.choices[0].message.content'Streaming Response:
# asuming you have "stream": true in config.json
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
"messages": [
{"role": "user", "content": "Write a short poem about technology"}
]
}'curl -sN -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
"messages": [{"role": "user", "content": "Write a short poem about technology"}],
"stream": true
}' | sed -n 's/^data: //p' | grep -v '^\[DONE\]' | jq -rj '.choices[0].delta.content // empty' && echoFunction Calling:
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
"messages": [
{"role": "user", "content": "What is the weather like in San Francisco?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather information for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units"
}
},
"required": ["location"]
}
}
}
],
"max_tokens": 150
}' | jqHealth Check:
curl -s http://localhost:8800/v1/health | jqMLX Router supports vision/multimodal models for processing images and PDFs using the OpenAI-compatible multimodal content format. This feature uses mlx-vlm for vision model inference.
- Images: PNG, JPEG, WebP, BMP (via base64 data URI)
- PDFs: Automatically converted to images (requires
poppler)
For PDF support, install poppler:
# macOS (Homebrew)
brew install popplerAdd "supports_vision": true to your model configuration:
{
"models": {
"mlx-community/chandra-8bit": {
"max_tokens": 8192,
"temp": 0.7,
"chat_template": "generic",
"required_memory_gb": 4,
"supports_tools": false,
"supports_vision": true
}
}
}# Encode image to base64 and send request
IMAGE_BASE64=$(base64 -i /path/to/image.png)
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/chandra-8bit",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,'"$IMAGE_BASE64"'"
}
},
{
"type": "text",
"text": "OCR this image and extract all text"
}
]
}
],
"max_tokens": 2048,
"stream": false
}' | jq -r '.choices[0].message.content'# Encode PDF to base64 and send request
PDF_BASE64=$(base64 -i /path/to/document.pdf)
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/chandra-8bit",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:application/pdf;base64,'"$PDF_BASE64"'"
}
},
{
"type": "text",
"text": "OCR this document"
}
]
}
],
"max_tokens": 4096,
"stream": false
}' | jq -r '.choices[0].message.content'- mlx-community/chandra-8bit - High-accuracy OCR model based on Qwen3-VL, excellent for document text extraction
Other mlx-vlm compatible vision models should also work (e.g., LLaVA, Qwen-VL variants).
Vision features have been tested with:
- curl - Direct API calls (examples above)
- tests/test_vision_model.py - Automated test script for vision model functionality
- OpenWebUI - Web interface with image upload support
- Python requests - Programmatic access via test scripts
For a complete test script, see tests/test_vision_model.py.
The config.json file allows you to:
- Set memory pressure thresholds for different system states
- Configure default values and operational settings
- Add optional per-model overrides when auto-discovery is not enough
Development Mode:
- Use local
config.jsonin project directory - Specify with
python main.py --config config.json
Production Mode (User Service):
- Configuration automatically copied to
~/mlx_router_app/config.json - Service configured to use this user-level config
- Edit with:
nano ~/mlx_router_app/config.json - Restart service after changes to apply modifications
Global settings that apply to all models unless overridden:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens |
int | 4096 | Maximum tokens for generation |
timeout |
int | 120 | Generation timeout in seconds |
cache_size |
int | 2 | Number of models to keep cached |
memory_threshold_gb |
float | 80.0 | Memory threshold for pressure calculations |
safety_margin |
float | 1.2 | Multiplier for model memory requirements (1.2 = 20% buffer) |
swap_critical_percent |
float | 90.0 | Swap usage % that triggers critical pressure |
swap_high_percent |
float | 75.0 | Swap usage % that triggers high pressure |
stream |
bool | false | Enable streaming responses by default |
stream_chunk_size |
int | 8 | Number of tokens per streaming chunk |
streaming_format |
string | "sse" | Format: "sse", "json_lines", or "json_array" |
warmup_tokens |
int | 5 | Tokens generated during model warmup |
enable_function_calling |
bool | true | Enable tool/function calling support |
model |
string | - | Optional default model to preload on startup |
model_directory |
string | "$HOME/models" | Path to local model storage |
Network and server configuration:
| Parameter | Type | Default | Description |
|---|---|---|---|
ip |
string | "0.0.0.0" | IP address to bind to |
port |
int | 8800 | Port number |
debug |
bool | false | Enable debug logging |
Per-model configuration. Each model entry uses the model ID as key. This section is optional and can be empty if you want pure local auto-discovery.
| Parameter | Type | Required | Description |
|---|---|---|---|
max_tokens |
int | No | Maximum tokens for this model |
temp |
float | No | Temperature (0.0-2.0, lower = more deterministic) |
top_p |
float | No | Nucleus sampling threshold (0.0-1.0) |
top_k |
int | No | Top-k sampling limit |
min_p |
float | No | Minimum probability threshold |
chat_template |
string | Yes | Template format: "llama3", "qwen", "deepseek", "phi4", "chatml", "gpt-oss", "generic" |
reasoning_response |
string | No | "enable" or "disable" - for models with reasoning output (e.g., GPT-OSS, Phi-4) |
required_memory_gb |
float | Yes | RAM required to load the model |
supports_tools |
bool | No | Whether model supports function calling |
supports_vision |
bool | No | Whether model supports image/PDF input (requires mlx-vlm) |
memory_pressure_max_tokens |
object | No | Token limits per pressure level (see below) |
Dynamic token limits based on system memory pressure:
| Key | Description |
|---|---|
normal |
Token limit when memory is abundant |
moderate |
Token limit under moderate pressure |
high |
Token limit under high pressure |
critical |
Token limit under critical pressure |
{
"defaults": {
"max_tokens": 8192,
"timeout": 120,
"cache_size": 2,
"memory_threshold_gb": 2.0,
"safety_margin": 0.9,
"swap_critical_percent": 99.0,
"swap_high_percent": 90.0,
"stream": true,
"stream_chunk_size": 32,
"streaming_format": "sse",
"warmup_tokens": 5,
"enable_function_calling": true,
"model_directory": "/Users/username/models"
},
"server": {
"ip": "10.3.2.1",
"port": 8800,
"debug": false
},
"models": {
"henrybravo/Qwen3.5-35B-A3B-4bit": {
"max_tokens": 16384,
"temp": 0.7,
"top_p": 0.9,
"top_k": 40,
"min_p": 0.05,
"chat_template": "qwen",
"required_memory_gb": 20,
"supports_tools": true,
"memory_pressure_max_tokens": {
"normal": 16384,
"moderate": 16384,
"high": 16384,
"critical": 16384
}
}
}
}{
"models": {
"mlx-community/Qwen3.5-35B-A3B-8bit": {
"max_tokens": 16384,
"temp": 0.7,
"top_p": 0.9,
"top_k": 40,
"min_p": 0.05,
"chat_template": "qwen",
"required_memory_gb": 40,
"supports_tools": true,
"memory_pressure_max_tokens": {
"normal": 12288,
"moderate": 8192,
"high": 8192,
"critical": 4096
}
}
}
}MLX Router supports three streaming formats for maximum client compatibility:
| Format | Configuration | Best For | Response Type |
|---|---|---|---|
| SSE (default) | "streaming_format": "sse" |
curl, Python requests, OpenWebUI, Goose | data: {json}\n\n... |
| JSON Lines | "streaming_format": "json_lines" |
Advanced clients | {json}\n{json}\n... |
| JSON Array | "streaming_format": "json_array" |
Goose, OpenWebUI | {"object": "chat.completion", "choices": [...]} |
SSE is the OpenAI standard - most clients (OpenWebUI, Python openai lib, etc.) expect it by default and I recommend using it unless your client and LLM require a different format.
model_directory: Path to local model storage (default:"$HOME/models")- Environment Variable:
MLX_MODEL_DIRcan override the config setting - Automatic Discovery: Models placed in this directory are automatically detected
- HuggingFace Cache Format: Supports both direct directories and HF cache naming (
models--org--model) - No hardcoded public models required: a minimal config works as long as your local model directory contains valid MLX model folders
Memory pressure management is a critical feature in MLX Router that enables efficient operation on Apple Silicon systems with varying amounts of RAM. This system prevents system instability by dynamically adjusting model behavior based on current memory usage.
Read more about memory pressure management in the MEMORY_PRESSURE.md document.
Logs are written to both console and logs/mlx_router.log file.
Logs are written to application log files:
- Standard Output:
~/mlx_router_app/logs/mlx_router.log - Error Output:
~/mlx_router_app/logs/mlx_router.error.log
Log Information:
- Model loading/unloading with timing metrics
- API requests with unique request IDs
- Generation statistics and performance metrics
- Memory pressure monitoring and adjustments
- Comprehensive error tracking with stack traces
View Logs:
# Real-time monitoring
tail -f ~/mlx_router_app/logs/mlx_router.log
# View recent entries
tail -50 ~/mlx_router_app/logs/mlx_router.log
# Check for errors
tail -50 ~/mlx_router_app/logs/mlx_router.error.logThe launchd service is configured with KeepAlive=true, providing robust crash recovery:
- Automatic Restart: If the Python process crashes (malloc errors, segfaults, etc.), launchd automatically restarts the service
- Login Persistence: Service starts automatically when you log in
- Process Monitoring: launchd continuously monitors the process health
- Crash Logging: All crashes and restarts are logged to
~/mlx_router_app/logs/mlx_router.error.log
Monitor Crash Recovery:
# Watch for service restarts in real-time
tail -f ~/mlx_router_app/logs/mlx_router.log | grep -E "(Starting|Stopping|Error)"
# Check crash history
grep -i "crash\|error\|restart" ~/mlx_router_app/logs/mlx_router.error.log
# View service status and restart count
launchctl print gui/$(id -u)/com.henrybravo.mlx-routerExample Crash Recovery Flow:
- Application crashes due to memory error or segfault
- launchd detects process termination
- Service automatically restarts within seconds
- API becomes available again without manual intervention
- Crash details logged for debugging
A comprehensive testing guide is available in INSTALL_TEST.md. For quick verification:
# Test system service installation
./install-launchd.sh
# Verify endpoints are responding
curl -s http://localhost:8800/health | jq
curl -s http://localhost:8800/v1/models | jq
# Test chat completion (replace with a model from /v1/models)
curl -s -X POST http://localhost:8800/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-discovered-model-id",
"messages": [{"role": "user", "content": "What existed first: the chicken or the egg?"}],
"stream": false,
"max_tokens": 1024
}' | jq -r '.choices[0].message.content'# Check service status
launchctl print gui/$(id -u)/com.henrybravo.mlx-router
# Manual start for debugging
~/mlx_router_app/venv/bin/python \
~/mlx_router_app/main.py \
--config ~/mlx_router_app/config.json
# Restart service
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plistInstallation Issues:
- Python version too old: Install Python 3.11+ before running the install script
- Missing files: Ensure all required files are present in the project directory
- Permission denied: The install script now runs in user space and doesn't require sudo
- uv not found warning: Install
uvglobally withpip install uvfor faster installation (optional) - Missing dependencies: The install script automatically detects and fixes missing dependencies
- Installation failure: Re-run
./install-launchd.sh- it includes self-healing dependency resolution
Runtime Issues:
- Service won't start: Check error logs and verify Python dependencies
- API not responding: Ensure port 8800 is available and not blocked
- Model loading fails: Verify models exist and have sufficient memory
- Config changes not applied: Restart the service after modifying
~/mlx_router_app/config.json - Invalid config format: Validate JSON syntax with
python -m json.tool ~/mlx_router_app/config.json - Memory-related crashes: MLX memory errors automatically trigger service restart (production mode)
- Frequent crashes: Check error logs for patterns and ensure sufficient system memory for models
- High memory pressure: Adjust model parameters or reduce max_tokens in config
- stream chunk_size too large: Reduce
stream_chunk_sizein config for lower latency - stream chunk_size too small: Increase
stream_chunk_sizefor better throughput - Goose does not work well with very small chunks - LLM behavior issues: Adjust temperature, top_p, and top_k settings in model config or test with different models
- safety_margin: can be 0.2 to 1.5 - higher values use more memory but reduce OOM risk
- memory_threshold_gb: set according to system RAM - lower values trigger pressure adjustments sooner
- swap_critical_percent and swap_high_percent: adjust based on system swap usage patterns, higher values delay pressure triggers
- streaming_format: ensure client compatibility with selected format (sse, json_lines, json_array) but sse is recommended for most clients
- function calling issues: ensure
enable_function_callingis true and model supports it - vision model issues: ensure
supports_visionis true and mlx-vlm is installed for image/PDF processing - poppler not found: install poppler for PDF support (e.g.,
brew install poppleron macOS) - model not found: ensure model is downloaded to the configured model directory
- insufficient memory: reduce model size or max_tokens, or increase system RAM/swap
- log files not found: verify installation directory and log paths
- crash recovery not working: ensure service is installed with launchd and check logs for errors
- service restart loops: check error logs for root cause, may indicate insufficient memory or config issues
- warmup tokens causing delays: reduce
warmup_tokensin config for faster startup at the cost of initial latency
MLX Router's OpenAI-compatible API enables seamless integration with popular agent frameworks and AI applications: for comprehensive setup guides and examples, see AGENTS_INTEGRATION.md
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.
MIT License

