Modality

Fine-tune small language models (SLMs) for enterprise customers and intelligently route traffic to them — instead of paying for GPT-4o on every request.

What this does

Enterprise customers have narrow, repeated use cases — legal contract review, financial report summarization, customer support triage. A fine-tuned 8B parameter model can match GPT-4o quality on those tasks at 10-50x lower cost.

Modality handles the full lifecycle:

Onboard a customer and define their domain (e.g. "legal", "finance")
Fine-tune an SLM on their data via OpenAI or Fireworks APIs
Evaluate the model automatically before it goes live
Route incoming requests to the best model — or fall back to GPT-4o when unsure
Track usage and cost savings per customer

Architecture

┌──────────────────────────────────────────────────────────────┐
│                       CUSTOMER'S APP                         │
│                                                              │
│  client = OpenAI(base_url="https://api.modality.dev/v1",    │
│                  api_key="mod_abc123...")                     │
│  response = client.chat.completions.create(                  │
│      model="auto",    # Modality picks the best model        │
│      messages=[...]                                          │
│  )                                                           │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│                     DATA PLANE (port 8000)                   │
│               internet-facing, autoscaled                    │
│                                                              │
│  1. Authenticate request via API key                         │
│  2. Embed the prompt                                         │
│  3. Compare against cached model domain embeddings           │
│  4. Route to best SLM (or fall back to GPT-4o)              │
│  5. Return response + log usage                              │
│                                                              │
│  POST /v1/chat/completions                                   │
│  GET  /health                                                │
└──────────────────────────────────────────────────────────────┘
                       │
          reads from   │   shared database
                       │
┌──────────────────────────────────────────────────────────────┐
│                   CONTROL PLANE (port 8001)                  │
│             internal only, behind VPN                        │
│                                                              │
│  • Onboard customers          POST /customers                │
│  • Issue API keys             POST /customers/:id/api-keys   │
│  • Upload data & fine-tune    POST /finetune                 │
│  • Manage models              POST /models/:id/promote       │
│  • View usage & savings       GET  /customers/:id/usage      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Data Plane (port 8000)

The customer-facing inference API. This is what your customers' apps call instead of OpenAI directly. It authenticates the request via API key, embeds the prompt, routes to the best fine-tuned SLM using cosine similarity, falls back to GPT-4o when confidence is low, and logs usage for billing. Internet-facing, behind a load balancer, autoscaled.

Control Plane (port 8001)

The internal management API. Used by your team and customer dashboards to onboard customers, issue API keys, upload training data, kick off fine-tuning jobs, manage models (promote/demote), and view usage and cost savings. Internal only, behind a VPN or private subnet.

Both planes share the same codebase and database — they're different entry points deployed as separate services.

How customers use it

From the customer's perspective, Modality is a drop-in replacement for the OpenAI API. They change two lines:

# Before — calling OpenAI directly ($$$)
client = OpenAI(api_key="sk-...")

# After — calling Modality (routes to their fine-tuned SLM)
client = OpenAI(
    base_url="https://api.modality.dev/v1",
    api_key="mod_abc123..."  # issued via control plane
)

# Same API, same code, lower cost
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Summarize this contract..."}]
)

The customer doesn't choose which model to use. Modality's router picks the best fine-tuned model for each request based on what it's about — or falls back to GPT-4o for anything outside the model's domain.

Running locally

# 1. Configure API keys
cp .env.example .env
# Edit .env with your OpenAI / Fireworks keys

# 2. Start everything
docker compose up --build

# Data plane:    http://localhost:8000
# Control plane: http://localhost:8001
# Swagger docs:  http://localhost:8000/docs and http://localhost:8001/docs

Deploying to production

Option A: Docker Compose (single server)

Good for getting started. The docker-compose.yml runs both planes + Postgres.

Option B: Kubernetes / ECS / Cloud Run (recommended)

Deploy each plane as a separate service from the same Docker image:

# Build the image
docker build --target data-plane -t modality-data-plane .
docker build --target control-plane -t modality-control-plane .

Data plane service:

Internet-facing, behind a load balancer
Autoscale on CPU/request count (start with 2, scale to 20+)
Set MODALITY_DATABASE_URL to your managed Postgres (RDS, Cloud SQL, etc.)
Health check: GET /health

Control plane service:

Internal only — behind VPN, private subnet, or IP-allowlisted
1-2 replicas is enough
Same database connection string
Health check: GET /health

Database:

Use managed Postgres (RDS, Cloud SQL, Neon, Supabase)
Both planes connect to the same database
The data plane only reads; the control plane reads and writes

Environment variables

Variable	Description	Default
`MODALITY_DATABASE_URL`	Postgres connection string	`sqlite+aiosqlite:///./modality.db`
`MODALITY_OPENAI_API_KEY`	OpenAI API key for embeddings + fine-tuning	—
`MODALITY_FIREWORKS_API_KEY`	Fireworks API key (optional)	—
`MODALITY_FALLBACK_MODEL`	Large model used when no SLM matches	`gpt-4o`
`MODALITY_ROUTER_CONFIDENCE_THRESHOLD`	Minimum similarity score to route to an SLM	`0.7`
`MODALITY_EVAL_MIN_SCORE`	Minimum eval score to auto-promote a model	`0.8`

How routing works

Customer sends a request to /v1/chat/completions
The router embeds the prompt using text-embedding-3-small (fast, cheap)
It compares the embedding against every active model's domain embedding using cosine similarity
If the best match scores above the confidence threshold (default 0.7), route to that SLM
If nothing matches well enough, fall back to GPT-4o

The domain embeddings are generated when a model is fine-tuned, based on the domain_description you provide (e.g. "Legal contract analysis, clause extraction, risk assessment for US corporate law"). They're cached in memory on the data plane so routing adds <5ms of latency.

How fine-tuning works

Prepare training data as JSONL in OpenAI chat format:

{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Call the control plane:

curl -X POST http://localhost:8001/finetune \
  -H "Content-Type: application/json" \
  -d '{
    "customer_name": "Acme Legal",
    "domain": "legal",
    "domain_description": "Legal contract analysis, clause extraction, and risk assessment for US corporate law",
    "training_file_path": "/data/acme/train.jsonl",
    "base_model": "gpt-4o-mini-2024-07-18",
    "provider": "openai"
  }'

Modality uploads the data to the provider, starts the fine-tuning job, polls for completion, runs an automated evaluation (LLM-as-judge), and promotes the model into the routing table if it scores above the threshold.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
modality		modality
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modality

What this does

Architecture

Data Plane (port 8000)

Control Plane (port 8001)

How customers use it

Running locally

Deploying to production

Option A: Docker Compose (single server)

Option B: Kubernetes / ECS / Cloud Run (recommended)

Environment variables

How routing works

How fine-tuning works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modality

What this does

Architecture

Data Plane (port 8000)

Control Plane (port 8001)

How customers use it

Running locally

Deploying to production

Option A: Docker Compose (single server)

Option B: Kubernetes / ECS / Cloud Run (recommended)

Environment variables

How routing works

How fine-tuning works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages