Skip to content

jessicasingh7/modality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modality

Fine-tune small language models (SLMs) for enterprise customers and intelligently route traffic to them — instead of paying for GPT-4o on every request.

What this does

Enterprise customers have narrow, repeated use cases — legal contract review, financial report summarization, customer support triage. A fine-tuned 8B parameter model can match GPT-4o quality on those tasks at 10-50x lower cost.

Modality handles the full lifecycle:

  1. Onboard a customer and define their domain (e.g. "legal", "finance")
  2. Fine-tune an SLM on their data via OpenAI or Fireworks APIs
  3. Evaluate the model automatically before it goes live
  4. Route incoming requests to the best model — or fall back to GPT-4o when unsure
  5. Track usage and cost savings per customer

Architecture

┌──────────────────────────────────────────────────────────────┐
│                       CUSTOMER'S APP                         │
│                                                              │
│  client = OpenAI(base_url="https://api.modality.dev/v1",    │
│                  api_key="mod_abc123...")                     │
│  response = client.chat.completions.create(                  │
│      model="auto",    # Modality picks the best model        │
│      messages=[...]                                          │
│  )                                                           │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│                     DATA PLANE (port 8000)                   │
│               internet-facing, autoscaled                    │
│                                                              │
│  1. Authenticate request via API key                         │
│  2. Embed the prompt                                         │
│  3. Compare against cached model domain embeddings           │
│  4. Route to best SLM (or fall back to GPT-4o)              │
│  5. Return response + log usage                              │
│                                                              │
│  POST /v1/chat/completions                                   │
│  GET  /health                                                │
└──────────────────────────────────────────────────────────────┘
                       │
          reads from   │   shared database
                       │
┌──────────────────────────────────────────────────────────────┐
│                   CONTROL PLANE (port 8001)                  │
│             internal only, behind VPN                        │
│                                                              │
│  • Onboard customers          POST /customers                │
│  • Issue API keys             POST /customers/:id/api-keys   │
│  • Upload data & fine-tune    POST /finetune                 │
│  • Manage models              POST /models/:id/promote       │
│  • View usage & savings       GET  /customers/:id/usage      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Data Plane (port 8000)

The customer-facing inference API. This is what your customers' apps call instead of OpenAI directly. It authenticates the request via API key, embeds the prompt, routes to the best fine-tuned SLM using cosine similarity, falls back to GPT-4o when confidence is low, and logs usage for billing. Internet-facing, behind a load balancer, autoscaled.

Control Plane (port 8001)

The internal management API. Used by your team and customer dashboards to onboard customers, issue API keys, upload training data, kick off fine-tuning jobs, manage models (promote/demote), and view usage and cost savings. Internal only, behind a VPN or private subnet.

Both planes share the same codebase and database — they're different entry points deployed as separate services.

How customers use it

From the customer's perspective, Modality is a drop-in replacement for the OpenAI API. They change two lines:

# Before — calling OpenAI directly ($$$)
client = OpenAI(api_key="sk-...")

# After — calling Modality (routes to their fine-tuned SLM)
client = OpenAI(
    base_url="https://api.modality.dev/v1",
    api_key="mod_abc123..."  # issued via control plane
)

# Same API, same code, lower cost
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Summarize this contract..."}]
)

The customer doesn't choose which model to use. Modality's router picks the best fine-tuned model for each request based on what it's about — or falls back to GPT-4o for anything outside the model's domain.

Running locally

# 1. Configure API keys
cp .env.example .env
# Edit .env with your OpenAI / Fireworks keys

# 2. Start everything
docker compose up --build

# Data plane:    http://localhost:8000
# Control plane: http://localhost:8001
# Swagger docs:  http://localhost:8000/docs and http://localhost:8001/docs

Deploying to production

Option A: Docker Compose (single server)

Good for getting started. The docker-compose.yml runs both planes + Postgres.

Option B: Kubernetes / ECS / Cloud Run (recommended)

Deploy each plane as a separate service from the same Docker image:

# Build the image
docker build --target data-plane -t modality-data-plane .
docker build --target control-plane -t modality-control-plane .

Data plane service:

  • Internet-facing, behind a load balancer
  • Autoscale on CPU/request count (start with 2, scale to 20+)
  • Set MODALITY_DATABASE_URL to your managed Postgres (RDS, Cloud SQL, etc.)
  • Health check: GET /health

Control plane service:

  • Internal only — behind VPN, private subnet, or IP-allowlisted
  • 1-2 replicas is enough
  • Same database connection string
  • Health check: GET /health

Database:

  • Use managed Postgres (RDS, Cloud SQL, Neon, Supabase)
  • Both planes connect to the same database
  • The data plane only reads; the control plane reads and writes

Environment variables

Variable Description Default
MODALITY_DATABASE_URL Postgres connection string sqlite+aiosqlite:///./modality.db
MODALITY_OPENAI_API_KEY OpenAI API key for embeddings + fine-tuning
MODALITY_FIREWORKS_API_KEY Fireworks API key (optional)
MODALITY_FALLBACK_MODEL Large model used when no SLM matches gpt-4o
MODALITY_ROUTER_CONFIDENCE_THRESHOLD Minimum similarity score to route to an SLM 0.7
MODALITY_EVAL_MIN_SCORE Minimum eval score to auto-promote a model 0.8

How routing works

  1. Customer sends a request to /v1/chat/completions
  2. The router embeds the prompt using text-embedding-3-small (fast, cheap)
  3. It compares the embedding against every active model's domain embedding using cosine similarity
  4. If the best match scores above the confidence threshold (default 0.7), route to that SLM
  5. If nothing matches well enough, fall back to GPT-4o

The domain embeddings are generated when a model is fine-tuned, based on the domain_description you provide (e.g. "Legal contract analysis, clause extraction, risk assessment for US corporate law"). They're cached in memory on the data plane so routing adds <5ms of latency.

How fine-tuning works

  1. Prepare training data as JSONL in OpenAI chat format:

    {"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  2. Call the control plane:

    curl -X POST http://localhost:8001/finetune \
      -H "Content-Type: application/json" \
      -d '{
        "customer_name": "Acme Legal",
        "domain": "legal",
        "domain_description": "Legal contract analysis, clause extraction, and risk assessment for US corporate law",
        "training_file_path": "/data/acme/train.jsonl",
        "base_model": "gpt-4o-mini-2024-07-18",
        "provider": "openai"
      }'
  3. Modality uploads the data to the provider, starts the fine-tuning job, polls for completion, runs an automated evaluation (LLM-as-judge), and promotes the model into the routing table if it scores above the threshold.

About

Fine-tune SLMs for enterprise customers and intelligently route traffic to reduce cost

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors