MLX Router

An OpenAI-compatible inference server optimized for Apple Silicon. Hot-swap between MLX models, stream responses in real-time, process images and PDFs, and integrate with any OpenAI client.

Published MLX models: https://huggingface.co/henrybravo

Features

🚀 FastAPI-powered - Modern async API with automatic documentation
🔄 Hot-swap between different models without server restart
⚡ MLX-optimized performance with GPU acceleration
🔒 Automatic memory management and cleanup
🎯 Full OpenAI API compatibility - Drop-in replacement for OpenAI endpoints
⏱️ Built-in timeout protection and error handling
📊 Comprehensive logging system with rotation
🏗️ Modular architecture - Clean separation of concerns
📖 Interactive API docs - Automatic Swagger/OpenAPI documentation
🔧 System service support - Install as macOS launchd service with automatic startup and crash recovery
🌊 Response Streaming - Real-time token delivery with Server-Sent Events (90%+ latency reduction)
🔧 Function Calling - OpenAI-compatible tool/function calling for agent frameworks
👁️ Vision Model Support - Process images and PDFs with multimodal models (mlx-vlm)

Supported Models

MLX Router supports any MLX model that is available locally. Models are automatically discovered from configured directories and can be loaded dynamically without server restart.

Local Model Directory Support

Models are loaded from local directories (default: $HOME/models) with automatic configuration detection:

Automatic Discovery: Models placed in the configured directory are automatically detected
Dynamic Configuration: Model parameters are extracted from local config.json files
Hot-Swapping: Switch between models without restarting the server
Fallback Support: Downloads any HuggingFace model to the custom directory if not found locally
Cache Management: Existing models in ~/.cache/huggingface/hub can be symlinked to the custom directory

Minimal Public Config

The checked-in config.json is intentionally minimal:

it keeps the public model path anonymized as /Users/username/models
it includes one sample model entry for henrybravo/Qwen3.5-35B-A3B-4bit
it relies on local model discovery from model_directory

Download models into your local model directory with the helper tools, then restart the server and inspect /v1/models to see what is available.

Model Directory Structure

$HOME/models/
├── mlx-community--Llama-3.3-70B-Instruct-4bit/
│   └── snapshots/
│       └── [commit-hash]/
│           ├── config.json
│           ├── tokenizer.json
│           └── model.safetensors
└── custom-models/
    └── my-model/
        ├── config.json
        ├── tokenizer.json
        └── model.safetensors

Use the --config argument to load additional model configurations from config.json when you want explicit per-model overrides.

You can use the helper tools in helper_tools/ directory:

mlx_downloader.py - Download MLX models to custom directories (supports MLX_MODEL_DIR)
mlx_lmstudio_linker.py - Link models to LM Studio
mlx_model_parameter_discovery_tool.py - Auto-discover optimal parameters

Architecture

graph TD
    A[User / Agent] -->|HTTP Request| B(FastAPI Server)
    B --> C[MLX Router]
    C --> D[Model 1]
    C --> E[Model 2]
    D --> F[Inference Result]
    E --> F
    F -->|HTTP Response| A

The project features a clean modular architecture:

Package Structure

mlx_router/
├── api                    ## FastAPI application
│   └── app.py              # API endpoint handlers
├── config                 ## Configuration management 
│   └── model_config.py     # ModelConfig class
└── core                   ## Core MLX logic     
    ├── content.py          # Content normalization utilities
    ├── manager.py          # MLXModelManager class
    ├── patterns.py         # Chat template patterns
    ├── resource_monitor.py # Memory monitoring
    └── templates.py        # Chat template handling

Core Components

MLXModelManager: Central class managing model loading, unloading, and generation with thread-safe operations
ModelConfig: Configuration management for model-specific parameters
ResourceMonitor: Apple Silicon memory monitoring and optimization
FastAPI App: Modern HTTP API with automatic documentation

Configuration System

config.json: External configuration file for model parameters, defaults, and operational settings
Model-specific parameters include: max_tokens, temperature, top_p, top_k, min_p, chat_template, required_memory_gb
Memory pressure-aware token limits for different system states

Chat Template System

The router supports multiple chat template formats:

llama3: Llama 3.x models with <|begin_of_text|> format
deepseek: DeepSeek models with instruction/response format
qwen: Qwen models with <|im_start|> format
phi4: Phi-4 models with <|user|>/<|assistant|> format
generic: Fallback template
chatml: OpenAI ChatML format
gpt-oss: GPT-OSS models with custom format

Prerequisites

Python 3.11+ (tested with 3.11 and 3.13)
MLX-compatible GPU
uv package manager (recommended)

Installation

Option A: Install from PyPI (recommended)

pip install mlx-router

Or with uv:

uv pip install mlx-router

Then start the server:

mlx-router --config /path/to/config.json

The checked-in config.json includes one sample model entry for henrybravo/Qwen3.5-35B-A3B-4bit and an anonymized model path (/Users/username/models). Before starting the server:

download or copy an MLX model into your local model directory (for example models--henrybravo--Qwen3.5-35B-A3B-4bit)
update model_directory in your config to your real local path
pass that config file to mlx-router --config

You can use the helper tools in helper_tools/ to prepare local models.

Logs in Option A: when you run mlx-router directly from a shell, logs are written to stdout/stderr in that terminal. If you want persistent logs, redirect them yourself:

mlx-router --config /path/to/config.json 2>&1 | tee mlx-router.log

Option B: Development Installation

Clone the repository:

git clone https://github.com/henrybravo/mlx-router.git
cd mlx-router

Set up the environment using uv:

pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install .

requirements.txt is deprecated and kept only for temporary compatibility with older manual workflows.

Start the server:

python main.py --config config.json

Option C: System Service Installation (macOS)

For production deployment as a macOS launchd user service that starts automatically at login:

Prerequisites:

macOS (Darwin-based system)
Python 3.11+ installed
uv package manager (optional, but recommended for faster installation)

# Install as a user service (no sudo required)
./install-launchd.sh

# Check service status
launchctl list | grep mlx-router

# View logs
tail -f ~/mlx_router_app/logs/mlx_router.log

Installation Features:

✅ Automatic dependency checking: Validates Python version and required files
⚡ uv support: Uses uv for faster package installation if available, falls back to pip
🔄 Existing service handling: Safely handles reinstallation over existing services
🧪 Installation verification: Tests service startup and API availability
🔧 Self-healing dependencies: Automatically detects and fixes missing Python packages
🎯 Application testing: Verifies main.py imports work before service installation
🏠 User-space installation: Installs to ~/mlx_router_app without requiring elevated privileges

Configuration in Production Mode:

The installation script automatically:

Copies config.json to ~/mlx_router_app/config.json
Configures the service to use this user-level config file
The service runs with --config ~/mlx_router_app/config.json
Enables automatic restart: Service automatically restarts if the application crashes
Login persistence: Service starts automatically when you log in

Modifying Production Configuration:

# Backup current config (recommended)
cp ~/mlx_router_app/config.json ~/mlx_router_app/config.json.backup

# Edit the production config file
nano ~/mlx_router_app/config.json

# Restart service to apply changes
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist

# Verify changes in logs
tail -f ~/mlx_router_app/logs/mlx-router.log

Service Management:

# Stop service
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist

# Start service
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist

# Uninstall service
./uninstall-launchd.sh

Custom Installation Directory:

# Install to a custom directory
INSTALL_DIR=/path/to/custom/dir ./install-launchd.sh

# Or set the environment variable first
export INSTALL_DIR=/path/to/custom/dir
./install-launchd.sh

Usage

Development Mode

Optionally: view help menu first:

python main.py -h

Start the server:

python main.py --config config.json

Production Mode (User Service)

Once installed as a user service, MLX Router runs automatically and can be accessed immediately:

# Service runs automatically after installation
# Access API directly
curl -s http://localhost:8800/health | jq

Access Points

The server will start on http://0.0.0.0:8800 by default.

Swagger UI: http://localhost:8800/docs
ReDoc: http://localhost:8800/redoc

API Endpoints

GET /v1/models - List available models with memory requirements
POST /v1/chat/completions - Generate chat completions (OpenAI compatible)
GET /health - Server health check
GET /v1/health - Detailed health metrics

Example API Usage

List Models:

curl -s http://localhost:8800/v1/models | jq

Chat Completion:

curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-120b-MXFP4-Q8",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "temperature": 0.7,
    "stream": false,
    "max_tokens": 1024
  }' | jq -r '.choices[0].message.content'

Streaming Response:

# asuming you have "stream": true in config.json
curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Write a short poem about technology"}
    ]
  }'

curl -sN -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Write a short poem about technology"}],
    "stream": true
  }' | sed -n 's/^data: //p' | grep -v '^\[DONE\]' | jq -rj '.choices[0].delta.content // empty' && echo

Function Calling:

curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "What is the weather like in San Francisco?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather information for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature units"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "max_tokens": 150
  }' | jq

Health Check:

curl -s http://localhost:8800/v1/health | jq

Vision Model Support

MLX Router supports vision/multimodal models for processing images and PDFs using the OpenAI-compatible multimodal content format. This feature uses mlx-vlm for vision model inference.

Supported Input Types

Images: PNG, JPEG, WebP, BMP (via base64 data URI)
PDFs: Automatically converted to images (requires poppler)

Prerequisites

For PDF support, install poppler:

# macOS (Homebrew)
brew install poppler

Configuration

Add "supports_vision": true to your model configuration:

{
  "models": {
    "mlx-community/chandra-8bit": {
      "max_tokens": 8192,
      "temp": 0.7,
      "chat_template": "generic",
      "required_memory_gb": 4,
      "supports_tools": false,
      "supports_vision": true
    }
  }
}

Example: Image OCR

# Encode image to base64 and send request
IMAGE_BASE64=$(base64 -i /path/to/image.png)
curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/chandra-8bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/png;base64,'"$IMAGE_BASE64"'"
            }
          },
          {
            "type": "text",
            "text": "OCR this image and extract all text"
          }
        ]
      }
    ],
    "max_tokens": 2048,
    "stream": false
  }' | jq -r '.choices[0].message.content'

Example: PDF OCR

# Encode PDF to base64 and send request
PDF_BASE64=$(base64 -i /path/to/document.pdf)
curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/chandra-8bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "data:application/pdf;base64,'"$PDF_BASE64"'"
            }
          },
          {
            "type": "text",
            "text": "OCR this document"
          }
        ]
      }
    ],
    "max_tokens": 4096,
    "stream": false
  }' | jq -r '.choices[0].message.content'

Tested Models

mlx-community/chandra-8bit - High-accuracy OCR model based on Qwen3-VL, excellent for document text extraction

Other mlx-vlm compatible vision models should also work (e.g., LLaVA, Qwen-VL variants).

Client Integration

Vision features have been tested with:

curl - Direct API calls (examples above)
tests/test_vision_model.py - Automated test script for vision model functionality
OpenWebUI - Web interface with image upload support
Python requests - Programmatic access via test scripts

For a complete test script, see tests/test_vision_model.py.

Configuration

The config.json file allows you to:

Set memory pressure thresholds for different system states
Configure default values and operational settings
Add optional per-model overrides when auto-discovery is not enough

Configuration Locations

Development Mode:

Use local config.json in project directory
Specify with python main.py --config config.json

Production Mode (User Service):

Configuration automatically copied to ~/mlx_router_app/config.json
Service configured to use this user-level config
Edit with: nano ~/mlx_router_app/config.json
Restart service after changes to apply modifications

Configuration Reference

`defaults` Section

Global settings that apply to all models unless overridden:

Parameter	Type	Default	Description
`max_tokens`	int	4096	Maximum tokens for generation
`timeout`	int	120	Generation timeout in seconds
`cache_size`	int	2	Number of models to keep cached
`memory_threshold_gb`	float	80.0	Memory threshold for pressure calculations
`safety_margin`	float	1.2	Multiplier for model memory requirements (1.2 = 20% buffer)
`swap_critical_percent`	float	90.0	Swap usage % that triggers critical pressure
`swap_high_percent`	float	75.0	Swap usage % that triggers high pressure
`stream`	bool	false	Enable streaming responses by default
`stream_chunk_size`	int	8	Number of tokens per streaming chunk
`streaming_format`	string	"sse"	Format: "sse", "json_lines", or "json_array"
`warmup_tokens`	int	5	Tokens generated during model warmup
`enable_function_calling`	bool	true	Enable tool/function calling support
`model`	string	-	Optional default model to preload on startup
`model_directory`	string	"$HOME/models"	Path to local model storage

`server` Section

Network and server configuration:

Parameter	Type	Default	Description
`ip`	string	"0.0.0.0"	IP address to bind to
`port`	int	8800	Port number
`debug`	bool	false	Enable debug logging

`models` Section

Per-model configuration. Each model entry uses the model ID as key. This section is optional and can be empty if you want pure local auto-discovery.

Parameter	Type	Required	Description
`max_tokens`	int	No	Maximum tokens for this model
`temp`	float	No	Temperature (0.0-2.0, lower = more deterministic)
`top_p`	float	No	Nucleus sampling threshold (0.0-1.0)
`top_k`	int	No	Top-k sampling limit
`min_p`	float	No	Minimum probability threshold
`chat_template`	string	Yes	Template format: "llama3", "qwen", "deepseek", "phi4", "chatml", "gpt-oss", "generic"
`reasoning_response`	string	No	"enable" or "disable" - for models with reasoning output (e.g., GPT-OSS, Phi-4)
`required_memory_gb`	float	Yes	RAM required to load the model
`supports_tools`	bool	No	Whether model supports function calling
`supports_vision`	bool	No	Whether model supports image/PDF input (requires mlx-vlm)
`memory_pressure_max_tokens`	object	No	Token limits per pressure level (see below)

`memory_pressure_max_tokens` Object

Dynamic token limits based on system memory pressure:

Key	Description
`normal`	Token limit when memory is abundant
`moderate`	Token limit under moderate pressure
`high`	Token limit under high pressure
`critical`	Token limit under critical pressure

Minimal Example Configuration

{
  "defaults": {
    "max_tokens": 8192,
    "timeout": 120,
    "cache_size": 2,
    "memory_threshold_gb": 2.0,
    "safety_margin": 0.9,
    "swap_critical_percent": 99.0,
    "swap_high_percent": 90.0,
    "stream": true,
    "stream_chunk_size": 32,
    "streaming_format": "sse",
    "warmup_tokens": 5,
    "enable_function_calling": true,
    "model_directory": "/Users/username/models"
  },
  "server": {
      "ip": "10.3.2.1",
      "port": 8800,
      "debug": false
  },
  "models": {
      "henrybravo/Qwen3.5-35B-A3B-4bit": {
          "max_tokens": 16384,
          "temp": 0.7,
          "top_p": 0.9,
          "top_k": 40,
          "min_p": 0.05,
          "chat_template": "qwen",
          "required_memory_gb": 20,
          "supports_tools": true,
          "memory_pressure_max_tokens": {
              "normal": 16384,
              "moderate": 16384,
              "high": 16384,
              "critical": 16384
          }
      }
  }
}

Optional Per-Model Override Example

{
  "models": {
    "mlx-community/Qwen3.5-35B-A3B-8bit": {
      "max_tokens": 16384,
      "temp": 0.7,
      "top_p": 0.9,
      "top_k": 40,
      "min_p": 0.05,
      "chat_template": "qwen",
      "required_memory_gb": 40,
      "supports_tools": true,
      "memory_pressure_max_tokens": {
        "normal": 12288,
        "moderate": 8192,
        "high": 8192,
        "critical": 4096
      }
    }
  }
}

Streaming Format Configuration

MLX Router supports three streaming formats for maximum client compatibility:

Format	Configuration	Best For	Response Type
SSE (default)	`"streaming_format": "sse"`	curl, Python requests, OpenWebUI, Goose	`data: {json}\n\n...`
JSON Lines	`"streaming_format": "json_lines"`	Advanced clients	`{json}\n{json}\n...`
JSON Array	`"streaming_format": "json_array"`	Goose, OpenWebUI	`{"object": "chat.completion", "choices": [...]}`

SSE is the OpenAI standard - most clients (OpenWebUI, Python openai lib, etc.) expect it by default and I recommend using it unless your client and LLM require a different format.

Model Directory Configuration

model_directory: Path to local model storage (default: "$HOME/models")
Environment Variable: MLX_MODEL_DIR can override the config setting
Automatic Discovery: Models placed in this directory are automatically detected
HuggingFace Cache Format: Supports both direct directories and HF cache naming (models--org--model)
No hardcoded public models required: a minimal config works as long as your local model directory contains valid MLX model folders

Memory Pressure Management

Memory pressure management is a critical feature in MLX Router that enables efficient operation on Apple Silicon systems with varying amounts of RAM. This system prevents system instability by dynamically adjusting model behavior based on current memory usage.

Read more about memory pressure management in the MEMORY_PRESSURE.md document.

Logging

Development Mode

Logs are written to both console and logs/mlx_router.log file.

Production Mode (User Service)

Logs are written to application log files:

Standard Output: ~/mlx_router_app/logs/mlx_router.log
Error Output: ~/mlx_router_app/logs/mlx_router.error.log

Log Information:

Model loading/unloading with timing metrics
API requests with unique request IDs
Generation statistics and performance metrics
Memory pressure monitoring and adjustments
Comprehensive error tracking with stack traces

View Logs:

# Real-time monitoring
tail -f ~/mlx_router_app/logs/mlx_router.log

# View recent entries
tail -50 ~/mlx_router_app/logs/mlx_router.log

# Check for errors
tail -50 ~/mlx_router_app/logs/mlx_router.error.log

Crash Recovery & Reliability

Automatic Restart (Production Mode)

The launchd service is configured with KeepAlive=true, providing robust crash recovery:

Automatic Restart: If the Python process crashes (malloc errors, segfaults, etc.), launchd automatically restarts the service
Login Persistence: Service starts automatically when you log in
Process Monitoring: launchd continuously monitors the process health
Crash Logging: All crashes and restarts are logged to ~/mlx_router_app/logs/mlx_router.error.log

Monitor Crash Recovery:

# Watch for service restarts in real-time
tail -f ~/mlx_router_app/logs/mlx_router.log | grep -E "(Starting|Stopping|Error)"

# Check crash history
grep -i "crash\|error\|restart" ~/mlx_router_app/logs/mlx_router.error.log

# View service status and restart count
launchctl print gui/$(id -u)/com.henrybravo.mlx-router

Example Crash Recovery Flow:

Application crashes due to memory error or segfault
launchd detects process termination
Service automatically restarts within seconds
API becomes available again without manual intervention
Crash details logged for debugging

Installation Testing

A comprehensive testing guide is available in INSTALL_TEST.md. For quick verification:

# Test system service installation
./install-launchd.sh

# Verify endpoints are responding
curl -s http://localhost:8800/health | jq
curl -s http://localhost:8800/v1/models | jq

# Test chat completion (replace with a model from /v1/models)
curl -s -X POST http://localhost:8800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-discovered-model-id",
    "messages": [{"role": "user", "content": "What existed first: the chicken or the egg?"}],
    "stream": false,
    "max_tokens": 1024
  }' | jq -r '.choices[0].message.content'

Troubleshooting

Service Issues

# Check service status
launchctl print gui/$(id -u)/com.henrybravo.mlx-router

# Manual start for debugging
~/mlx_router_app/venv/bin/python \
  ~/mlx_router_app/main.py \
  --config ~/mlx_router_app/config.json

# Restart service
launchctl unload ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist
launchctl load ~/Library/LaunchAgents/com.henrybravo.mlx-router.plist

Common Issues

Installation Issues:

Python version too old: Install Python 3.11+ before running the install script
Missing files: Ensure all required files are present in the project directory
Permission denied: The install script now runs in user space and doesn't require sudo
uv not found warning: Install uv globally with pip install uv for faster installation (optional)
Missing dependencies: The install script automatically detects and fixes missing dependencies
Installation failure: Re-run ./install-launchd.sh - it includes self-healing dependency resolution

Runtime Issues:

Service won't start: Check error logs and verify Python dependencies
API not responding: Ensure port 8800 is available and not blocked
Model loading fails: Verify models exist and have sufficient memory
Config changes not applied: Restart the service after modifying ~/mlx_router_app/config.json
Invalid config format: Validate JSON syntax with python -m json.tool ~/mlx_router_app/config.json
Memory-related crashes: MLX memory errors automatically trigger service restart (production mode)
Frequent crashes: Check error logs for patterns and ensure sufficient system memory for models
High memory pressure: Adjust model parameters or reduce max_tokens in config
stream chunk_size too large: Reduce stream_chunk_size in config for lower latency
stream chunk_size too small: Increase stream_chunk_size for better throughput - Goose does not work well with very small chunks
LLM behavior issues: Adjust temperature, top_p, and top_k settings in model config or test with different models
safety_margin: can be 0.2 to 1.5 - higher values use more memory but reduce OOM risk
memory_threshold_gb: set according to system RAM - lower values trigger pressure adjustments sooner
swap_critical_percent and swap_high_percent: adjust based on system swap usage patterns, higher values delay pressure triggers
streaming_format: ensure client compatibility with selected format (sse, json_lines, json_array) but sse is recommended for most clients
function calling issues: ensure enable_function_calling is true and model supports it
vision model issues: ensure supports_vision is true and mlx-vlm is installed for image/PDF processing
poppler not found: install poppler for PDF support (e.g., brew install poppler on macOS)
model not found: ensure model is downloaded to the configured model directory
insufficient memory: reduce model size or max_tokens, or increase system RAM/swap
log files not found: verify installation directory and log paths
crash recovery not working: ensure service is installed with launchd and check logs for errors
service restart loops: check error logs for root cause, may indicate insufficient memory or config issues
warmup tokens causing delays: reduce warmup_tokens in config for faster startup at the cost of initial latency

Agent Framework Integration

MLX Router's OpenAI-compatible API enables seamless integration with popular agent frameworks and AI applications: for comprehensive setup guides and examples, see AGENTS_INTEGRATION.md

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
agents		agents
docs		docs
helper_tools		helper_tools
images		images
logs		logs
mlx_router		mlx_router
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
com.henrybravo.mlx-router.plist		com.henrybravo.mlx-router.plist
config.json		config.json
install-launchd.sh		install-launchd.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uninstall-launchd.sh		uninstall-launchd.sh

Folders and files

Latest commit

History

Repository files navigation

MLX Router

Features

Supported Models

Local Model Directory Support

Minimal Public Config

Model Directory Structure

Architecture

Package Structure

Core Components

Configuration System

Chat Template System

Prerequisites

Installation

Option A: Install from PyPI (recommended)

Option B: Development Installation

Option C: System Service Installation (macOS)

Usage

Development Mode

Production Mode (User Service)

Access Points

API Endpoints

Example API Usage

Vision Model Support

Supported Input Types

Prerequisites

Configuration

Example: Image OCR

Example: PDF OCR

Tested Models

Client Integration

Configuration

Configuration Locations

Configuration Reference

defaults Section

server Section

models Section

memory_pressure_max_tokens Object

Minimal Example Configuration

Optional Per-Model Override Example

Streaming Format Configuration

Model Directory Configuration

Memory Pressure Management

Logging

Development Mode

Production Mode (User Service)

Crash Recovery & Reliability

Automatic Restart (Production Mode)

Installation Testing

Troubleshooting

Service Issues

Common Issues

Agent Framework Integration

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`defaults` Section

`server` Section

`models` Section

`memory_pressure_max_tokens` Object

Packages