Fine-tune small language models (SLMs) for enterprise customers and intelligently route traffic to them — instead of paying for GPT-4o on every request.
Enterprise customers have narrow, repeated use cases — legal contract review, financial report summarization, customer support triage. A fine-tuned 8B parameter model can match GPT-4o quality on those tasks at 10-50x lower cost.
Modality handles the full lifecycle:
- Onboard a customer and define their domain (e.g. "legal", "finance")
- Fine-tune an SLM on their data via OpenAI or Fireworks APIs
- Evaluate the model automatically before it goes live
- Route incoming requests to the best model — or fall back to GPT-4o when unsure
- Track usage and cost savings per customer
┌──────────────────────────────────────────────────────────────┐
│ CUSTOMER'S APP │
│ │
│ client = OpenAI(base_url="https://api.modality.dev/v1", │
│ api_key="mod_abc123...") │
│ response = client.chat.completions.create( │
│ model="auto", # Modality picks the best model │
│ messages=[...] │
│ ) │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ DATA PLANE (port 8000) │
│ internet-facing, autoscaled │
│ │
│ 1. Authenticate request via API key │
│ 2. Embed the prompt │
│ 3. Compare against cached model domain embeddings │
│ 4. Route to best SLM (or fall back to GPT-4o) │
│ 5. Return response + log usage │
│ │
│ POST /v1/chat/completions │
│ GET /health │
└──────────────────────────────────────────────────────────────┘
│
reads from │ shared database
│
┌──────────────────────────────────────────────────────────────┐
│ CONTROL PLANE (port 8001) │
│ internal only, behind VPN │
│ │
│ • Onboard customers POST /customers │
│ • Issue API keys POST /customers/:id/api-keys │
│ • Upload data & fine-tune POST /finetune │
│ • Manage models POST /models/:id/promote │
│ • View usage & savings GET /customers/:id/usage │
│ │
└──────────────────────────────────────────────────────────────┘
The customer-facing inference API. This is what your customers' apps call instead of OpenAI directly. It authenticates the request via API key, embeds the prompt, routes to the best fine-tuned SLM using cosine similarity, falls back to GPT-4o when confidence is low, and logs usage for billing. Internet-facing, behind a load balancer, autoscaled.
The internal management API. Used by your team and customer dashboards to onboard customers, issue API keys, upload training data, kick off fine-tuning jobs, manage models (promote/demote), and view usage and cost savings. Internal only, behind a VPN or private subnet.
Both planes share the same codebase and database — they're different entry points deployed as separate services.
From the customer's perspective, Modality is a drop-in replacement for the OpenAI API. They change two lines:
# Before — calling OpenAI directly ($$$)
client = OpenAI(api_key="sk-...")
# After — calling Modality (routes to their fine-tuned SLM)
client = OpenAI(
base_url="https://api.modality.dev/v1",
api_key="mod_abc123..." # issued via control plane
)
# Same API, same code, lower cost
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Summarize this contract..."}]
)The customer doesn't choose which model to use. Modality's router picks the best fine-tuned model for each request based on what it's about — or falls back to GPT-4o for anything outside the model's domain.
# 1. Configure API keys
cp .env.example .env
# Edit .env with your OpenAI / Fireworks keys
# 2. Start everything
docker compose up --build
# Data plane: http://localhost:8000
# Control plane: http://localhost:8001
# Swagger docs: http://localhost:8000/docs and http://localhost:8001/docsGood for getting started. The docker-compose.yml runs both planes + Postgres.
Deploy each plane as a separate service from the same Docker image:
# Build the image
docker build --target data-plane -t modality-data-plane .
docker build --target control-plane -t modality-control-plane .Data plane service:
- Internet-facing, behind a load balancer
- Autoscale on CPU/request count (start with 2, scale to 20+)
- Set
MODALITY_DATABASE_URLto your managed Postgres (RDS, Cloud SQL, etc.) - Health check:
GET /health
Control plane service:
- Internal only — behind VPN, private subnet, or IP-allowlisted
- 1-2 replicas is enough
- Same database connection string
- Health check:
GET /health
Database:
- Use managed Postgres (RDS, Cloud SQL, Neon, Supabase)
- Both planes connect to the same database
- The data plane only reads; the control plane reads and writes
| Variable | Description | Default |
|---|---|---|
MODALITY_DATABASE_URL |
Postgres connection string | sqlite+aiosqlite:///./modality.db |
MODALITY_OPENAI_API_KEY |
OpenAI API key for embeddings + fine-tuning | — |
MODALITY_FIREWORKS_API_KEY |
Fireworks API key (optional) | — |
MODALITY_FALLBACK_MODEL |
Large model used when no SLM matches | gpt-4o |
MODALITY_ROUTER_CONFIDENCE_THRESHOLD |
Minimum similarity score to route to an SLM | 0.7 |
MODALITY_EVAL_MIN_SCORE |
Minimum eval score to auto-promote a model | 0.8 |
- Customer sends a request to
/v1/chat/completions - The router embeds the prompt using
text-embedding-3-small(fast, cheap) - It compares the embedding against every active model's domain embedding using cosine similarity
- If the best match scores above the confidence threshold (default 0.7), route to that SLM
- If nothing matches well enough, fall back to GPT-4o
The domain embeddings are generated when a model is fine-tuned, based on the domain_description you provide (e.g. "Legal contract analysis, clause extraction, risk assessment for US corporate law"). They're cached in memory on the data plane so routing adds <5ms of latency.
-
Prepare training data as JSONL in OpenAI chat format:
{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} -
Call the control plane:
curl -X POST http://localhost:8001/finetune \ -H "Content-Type: application/json" \ -d '{ "customer_name": "Acme Legal", "domain": "legal", "domain_description": "Legal contract analysis, clause extraction, and risk assessment for US corporate law", "training_file_path": "/data/acme/train.jsonl", "base_model": "gpt-4o-mini-2024-07-18", "provider": "openai" }'
-
Modality uploads the data to the provider, starts the fine-tuning job, polls for completion, runs an automated evaluation (LLM-as-judge), and promotes the model into the routing table if it scores above the threshold.