`Lethus`

Stop paying your LLM to re-read the conversation.

Drop-in proxy · OpenAI-compatible · Up to 4x token reduction

The problem is simple.

Every API call sends the full conversation. Turn 40 means the model re-reads 39 turns it already processed. You pay for all of them.

Turn  1 ──────    50 tokens    ■
Turn  5 ──────   450 tokens    ■■■■■
Turn 10 ──────  1800 tokens    ■■■■■■■■■■■■■■
Turn 20 ────── 4,800 tokens    ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Turn 40 ────── 12000 tokens    ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
                                                                  ↑ you pay for this

Three approaches exist. All are broken:

Approach	What it does	What breaks
Full History	Send everything	Tokens explode. Cost explodes.
Summarization	Compress to a paragraph	Decisions vanish. Detail is lost.
Top-K RAG	Retrieve scattered chunks	Reasoning continuity is shattered.

Lethus is the fourth option.

It sits between your app and the LLM. It intercepts the full conversation, figures out what the model actually needs, and forwards a minimal context window -- preserving the reasoning chain the model depends on.

What the model sees at Turn 20:

Without Lethus (4,800 tokens):                 With Lethus (1,200 tokens):
┌──────────────────────────────────────┐        ┌──────────────────────────────────────┐
│  Turn  1: "Plan a trip to Japan"     │        │  [State Doc - 180 tokens]            │
│  Turn  2: "Sure! Here are ideas..."  │        │   Summary of trip: budget, dates,    │
│  Turn  3: "What about budget?"       │        │   decisions made so far...           │
│  Turn  4: "For budget travel..."     │        │                                      │
│  Turn  5: "How about hostels?"       │        │  Turn  7: "Budget is $3000 total"    │
│  Turn  6: "Great choice. Here..."    │        │  Turn  8: "Let's split it: $1200     │
│  Turn  7: "Budget is $3000 total"    │        │           flights, $1000 stay..."     │
│  Turn  8: "Let's split it..."        │        │  Turn  9: "Sounds good, booked."     │
│  Turn  9: "Sounds good, booked."     │        │                                      │
│  Turn 10: "What about food?"         │        │  Turn 18: "Book the Airbnb?"         │
│  Turn 11: "Here are options..."      │        │  Turn 19: "Yes, great choice."       │
│  Turn 12: "Let's go street food"     │        │  Turn 20: "What's the total budget?" │
│  Turn 13: "Perfect. Also..."         │        └──────────────────────────────────────┘
│  Turn 14: "What about transit?"      │           ↑ decisions preserved
│  Turn 15: "Get a JR Pass..."        │           ↑ reasoning chain intact
│  Turn 16: "Done. Now activities?"    │           ↑ 75% fewer tokens
│  Turn 17: "Here are top spots..."    │
│  Turn 18: "Book the Airbnb?"         │
│  Turn 19: "Yes, great choice."       │
│  Turn 20: "What's the total budget?" │
└──────────────────────────────────────┘

The model gets the state doc (what's been decided), the budget decision span (contiguous, not scattered), and the recent turns. Everything it needs. Nothing it doesn't.

Two lines to switch.

# Before -- direct to LLM
client = openai.OpenAI(api_key="sk-...")

# After -- through Lethus
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="sk-...")

Same endpoint. Same request format. Same response schema. Streaming works.

Every response comes back with metadata:

X-Lethus-Conversation-Id:   a1b2c3d4-...
X-Lethus-Reduction-Percent:  72.3
X-Lethus-Intent:             RECALL
X-Lethus-Processing-Ms:      23

What happens inside.

Message in
    │
    ├── 1. Intent ──────── RECALL | CONTINUATION | CLARIFICATION | NEW_TOPIC
    │
    ├── 2. Retrieve ────── Embed query → vector search → similar turns
    │
    ├── 3. Z-Score ─────── Normalize scores → noise drops below zero
    │
    ├── 4. Boost ───────── Upweight decisions, issues, resolutions
    │
    ├── 5. Kadane ──────── Find optimal contiguous span
    │
    └── 6. Budget ──────── Trim to token limit
    │
    ▼
LLM receives only what matters
    │
    ▼
Response out + async writeback

Intent routing means cheap queries stay cheap:

Intent	What it means	Strategy	Cost
`CONTINUATION`	Follow-up to last message	Last 3 turns	Near zero
`CLARIFICATION`	"What do you mean?"	Last assistant response	Minimal
`NEW_TOPIC`	Subject change	State doc + recent turns	Low
`RECALL`	"What did we decide about...?"	Full retrieval pipeline	Worth it

Why Kadane's algorithm matters.

Standard RAG retrieves the top-K most similar chunks. They're scattered across the conversation. The model sees fragments without the connective tissue between them.

Top-K RAG (scattered):
  Turn 2 ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  Turn 7 ░░░░░░██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  Turn 15 ░░░░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  Turn 28 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░
  Turn 33 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██░░░░░░░░░░░░
         ↑ five fragments, no connecting context

Lethus + Kadane (contiguous):
  Turn 6 ░░░░░░████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  Turn 7          ↑ coherent reasoning thread with full context
  Turn 8
  Turn 9
  Turn 10 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Kadane's maximum subarray algorithm finds the contiguous span with the highest cumulative signal. Multi-span variant finds up to 3 non-overlapping windows. The model reads a coherent thread, not a collage.

The numbers.

	Without Lethus	With Lethus
Input tokens/month	640M	~430M	-33%
Monthly cost	$3,200	$2,150
Monthly savings		~$1,050
Annual savings		~$12,600

800M tokens/month, 80/20 input-output split, $5/M input pricing, Claude Opus 4.6.

Under the hood.

Z-Score Normalization -- making similarity scores meaningful

Raw cosine similarity scores cluster in a tight band (0.82 -- 0.89). The difference between relevant and irrelevant is buried at the third decimal. Threshold-based filtering can't see it.

Z-score normalization transforms raw scores into "standard deviations from the mean." Relevant turns jump to +1.5, +2.0. Noise drops below zero. Kadane's gain threshold now works consistently across any query, any conversation length.

Changelog Boosting -- semantic importance beyond embeddings

Some turns matter because of what was decided, not what was said. "Let's go with PostgreSQL" has low embedding similarity to a question about database performance -- but it's the most important turn in the thread.

Lethus tracks a structured changelog: DECISION, UPDATE, ISSUE, RESOLUTION, CONTEXT. Turns with changelog entries get boosted (+1.0). Their neighbors get a smaller boost (+0.3), because the reasoning leading to a decision matters too.

State Document -- living memory of the conversation

Every N turns (default: 3), an LLM regenerates a structured summary of the conversation. This state doc captures the current understanding -- goals, decisions, open questions, resolved issues -- without replaying the full history.

For NEW_TOPIC intents, the state doc provides grounding context without expensive vector retrieval. It's the difference between "I know nothing about this conversation" and "here's where we stand."

Cold Path -- async, non-blocking

After every response, Lethus fires an async writeback:

Store turns -- user + assistant messages --> PostgreSQL + Milvus embeddings
Generate changelog -- LLM extracts decisions, issues, resolutions from the exchange
Refresh state doc -- every N turns, regenerates the structured summary

The hot path stays fast. The cold path keeps the knowledge base current.

Architecture.

flowchart LR
    App["Your App"] -->|same API| L["Lethus"]
    L -->|minimal context| LLM["Any LLM"]
    L <--> PG[("Postgres")]
    L <--> MV[("Milvus")]
    L -->|embed| E["Embedding API"]

    style App fill:#3b82f6,color:#fff
    style L fill:#4f46e5,color:#fff
    style LLM fill:#059669,color:#fff
    style PG fill:#f59e0b,color:#000
    style MV fill:#f59e0b,color:#000
    style E fill:#8b5cf6,color:#fff

Everything runs locally via Docker Compose. PostgreSQL stores conversations + changelogs. Milvus stores embeddings. Any OpenAI-compatible embedding API works.

Stack.

	Technology
Backend	Node.js · Express 5 · TypeScript
Database	PostgreSQL 16 · Prisma ORM
Vector DB	Milvus 2.6 (etcd + MinIO)
Embeddings	Any OpenAI-compatible API (default: `text-embedding-3-small`)
Frontend	Next.js 16 · React 19 · Tailwind CSS 4
Infra	Docker Compose

Get started.

Prerequisites: Node.js >= 18, Docker + Docker Compose, any OpenAI-compatible API key

git clone https://github.com/0xteamCookie/Lethus.git && cd Lethus

# Install
cd backend && npm install && cd ../frontend && npm install

# Configure
cd ../backend && cp .env.example .env
# Edit .env -- set your API keys and upstream LLM URL

# Infrastructure
docker compose up -d

# Initialize
npm run db:generate && npm run db:migrate && npm run init:milvus

# Run
npm run dev                        # backend  :8000
cd ../frontend && npm run dev      # frontend :3000

Verify: http://localhost:8000/health · Chat: http://localhost:3000/chat · Presentation: http://localhost:3000/present

Configuration

All tuning via environment variables (see .env.example):

Variable	Default	What it does
`COLD_START_THRESHOLD`	`5`	Turns before retrieval pipeline activates
`RETRIEVAL_TOKEN_BUDGET`	`2000`	Max tokens for retrieved context
`RECENT_TURNS_COUNT`	`3`	Always-included recent turns
`KADANE_THETA`	`1.0`	Span selection sensitivity
`GAIN_SHIFT`	`0.6`	Baseline for gain scores
`CHANGELOG_BOOST`	`1.0`	Score boost for decision turns
`CHANGELOG_NEIGHBOR_BOOST`	`0.3`	Boost for adjacent turns
`STATE_DOC_UPDATE_INTERVAL`	`3`	Turns between state doc refresh

Observability API

GET  /conversations                     # list all
GET  /conversations/:id                 # conversation details
GET  /conversations/:id/turns           # full turn history
GET  /conversations/:id/changelog       # decision log
GET  /conversations/:id/state           # current state doc

Data Model

erDiagram
    Conversation ||--o{ Turn : "has many"
    Conversation ||--o{ ChangelogEntry : "tracks"
    Conversation ||--o| StateDoc : "summarizes"

Turn -- messages with token counts, embedded into Milvus
ChangelogEntry -- decisions, issues, resolutions (supersedable)
StateDoc -- living summary, regenerated every N turns

Scripts

npm run dev              # Start dev server
npm run db:migrate       # Run Prisma migrations
npm run db:generate      # Regenerate Prisma client
npm run init:milvus      # Create Milvus collection
npm run reset:milvus     # Drop and recreate collection
npm run verify           # Verify all connections
npm run test:services    # Run service tests
npm run demo             # Run a demo conversation

MIT License · teamCookie()

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Lethus`

The problem is simple.

Lethus is the fourth option.

Two lines to switch.

What happens inside.

Why Kadane's algorithm matters.

The numbers.

Under the hood.

Architecture.

Stack.

Get started.

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lethus

The problem is simple.

Lethus is the fourth option.

Two lines to switch.

What happens inside.

Why Kadane's algorithm matters.

The numbers.

Under the hood.

Architecture.

Stack.

Get started.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`Lethus`