Semantic KV cache reuse for LLM inference engines.
SemBlend extends exact-prefix KV caching (vLLM, LMCache, SGLang) with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different — different instruction phrasing, sentence order, or template fields — SemBlend finds and reuses the cached KV tensors, replacing a multi-second prefill with sub-second KV retrieval.
vLLM + LMCache alone: semantically similar prompt → 0% hit → full prefill
vLLM + LMCache + SemBlend: → 83–100% hit → reuse donor KV
Measured on A10G GPU, Qwen2.5-7B-AWQ, vLLM 0.14.1 + LMCache.
| Context | Cold TTFT | Hit TTFT | Speedup | Break-even P_hit |
|---|---|---|---|---|
| 4K | 1,859 ms | 801 ms | 2.3x | <1% |
| 8K | 3,193 ms | 817 ms | 3.9x | 4.9% |
| 16K | 5,852 ms | 871 ms | 6.7x | 4.1% |
| 32K | 15,418 ms | 1,288 ms | 12.0x | — |
Hit TTFT is ~800ms regardless of context length — bounded by KV retrieval, not prefill. Miss overhead is 5–212ms (negligible). SemBlend is net-positive at virtually any nonzero hit rate for contexts ≥ 4K.
| Workload | Hit Rate | Hit-only Speedup |
|---|---|---|
| WildChat-1M conversations (≥4K) | 82.7% | 1.69x |
| Summarization (CNN/DM, SAMSum) | 50–88% | 2.3–2.4x |
| Multi-turn dialogue (turn 2+) | 99.5% | 5.1x |
| Cross-instruction RAG (8K) | 100% | 3.3x |
| Cross-instruction RAG (16K) | 100% | 5.3x |
| Code generation (dissimilar) | 0% | 0.96x |
Full-document segmented GPU embedding (v0.2.0) achieves 100% coverage of the prompt regardless of length, enabling 82.7% hit rate on real WildChat conversations (up from 29% with sparse sampling).
RoPE position correction keeps output quality near baseline:
| Dataset | PPL ratio (SemBlend / cold) |
|---|---|
| CNN/DailyMail | 1.006 |
| WikiHow | 1.012 |
| XSum | 1.025 |
See the paper for full benchmark details.
pip install semblend # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm] # + vLLM/LMCache integration
pip install semblend[sglang] # + SGLang integration
pip install semblend[embedder] # + sentence-transformers (MiniLM GPU)Integrates via LMCache's KVConnectorBase_V1 — no patching required.
pip install semblend[vllm] vllm lmcache
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--kv-transfer-config '{
"kv_connector": "SemBlendConnectorV1",
"kv_connector_module_path": "semblend.integration.vllm.connector_v1",
"kv_role": "kv_both"
}'CacheBlend support: For selective layer recomputation (CacheBlend), vLLM must expose the loaded model to KV connectors via
initialize_worker_connector(). This is available in vLLM builds that include PR #37339. Without it, SemBlend's semantic matching and KV injection still work — only CacheBlend's per-layer recomputation is unavailable.
pip install semblend[sglang] sglang
# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000Or programmatically — call before SGLang initializes:
from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.
| Variable | Default | Description |
|---|---|---|
SEMBLEND_ENABLED |
1 |
Enable semantic donor search |
SEMBLEND_MIN_SIMILARITY |
0.60 |
Cosine similarity threshold |
SEMBLEND_EMBEDDER |
minilm |
minilm (auto GPU) · onnx_gpu |
SEMBLEND_FUZZY_CHUNKS |
0 |
Fuzzy chunk matching for shifted prefixes |
Request → Embed (2–15ms) → Search (1ms) → Align (1ms) → Inject KV
↓ ↓ ↓
MiniLM-L6-v2 cosine search MD5 chunk hash
GPU (ONNX RT) donor store 256-token boundary
segmented pool
- Embed — full-document segmented embedding on GPU via ONNX-runtime. Long prompts are split into overlapping 256-token windows, embedded in parallel, and mean-pooled into a single vector. 100% content coverage at any prompt length (~2ms short, ~10ms at 8K, ~15ms at 32K).
- Search — brute-force cosine similarity against the donor store (<1ms at 1K donors; CAGRA GPU ANN for larger pools)
- Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
- Inject — donor token IDs substituted into the request; LMCache/RadixCache retrieves cached KV; RoPE correction applied in-place on K tensors
Most effective when prompts share a large common context:
- Document Q&A / RAG — same retrieved passages, different questions
- Summarization — same article, different instruction phrasing
- Multi-turn dialogue — conversation history prefix reused across turns
- Code completion — shared repository context across requests
Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.
See CONTRIBUTING.md.
Built at WorldFlow AI. For enterprise support contact research@worldflowai.com.