Pure Go deep learning inference. Load GGUF models and run them on CPU with zero dependencies beyond the Go standard library.
model, _ := dlgo.LoadLLM("model.gguf")
response, _ := model.Chat("", "What is the capital of France?")
fmt.Println(response) // "The capital of France is Paris."- LLM inference — text generation, multi-turn chat, streaming
- Speech-to-text — Whisper transcription from WAV files
- Voice activity detection — Silero VAD
- GGUF format — loads quantized models directly, no conversion needed
- Mixture of Experts (MoE) — full MoE support with batched expert-parallel dispatch (up to 512 experts), fused GPU shaders (ScaleAdd, SwiGLU-OAI-Bias, GPU-side top-K routing), multiple gating functions (softmax, sigmoid, softmax-weight), shared experts
- Hybrid SSM+Attention — Gated Delta Net (GDN) recurrent layers with interleaved full attention, supporting Qwen3-Coder-Next, Qwen3.5, and Qwen3.5 MoE architectures
- Multi-head Latent Attention (MLA) — compressed KV cache with absorbed key projection for DeepSeek-V2 / GLM-4.7 architectures
- Attention Sinks — learned "trash bin" logits for OpenAI gpt-oss models, absorbing unneeded attention mass
- Vulkan GPU inference — full Vulkan compute backend with quantized MatVec shaders (Q4_0, Q4_K, Q5_0, Q6_K, Q8_0, F32), fused attention, RoPE, SwiGLU/GeGLU/SwiGLU-OAI-Bias, RMSNorm, GPU-side MoE top-K routing, batched expert dispatch with fused ScaleAdd, custom SSM/GDN kernels — beats Ollama's Vulkan backend by 66–126% on most models, within 5% on the rest; beats Ollama CUDA on Qwen3.5 (+28%)
- Never-OOM GPU pipeline — automatic VRAM budget with retry-on-failure; if a model exceeds VRAM, the pipeline reduces GPU layers until it fits, gracefully falling back to partial GPU + CPU
- Fast on CPU — AVX2/FMA/VNNI SIMD via optional CGo, QxQ integer dot products, fused IQ3_XXS/IQ4_XS SIMD kernels, batch prefill GEMM, parallel worker pools (beats Ollama on Qwen3.5 models, within 0–18% for most others)
- 25+ quantization formats — Q4_0 through Q8_0, K-quants (Q2_K–Q8_K), I-quants (IQ1_S, IQ1_M, IQ2_XXS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS), MXFP4, F16, BF16, F32
- Zero-overhead diagnostics — flag-enabled GPU/CPU forward pass instrumentation via
DLGO_GPU_DIAG/DLGO_CPU_DIAGenv vars with layer/position filtering and verbosity levels
| Architecture | Models Tested | CPU tok/s | GPU tok/s |
|---|---|---|---|
| LLaMA | Llama 3.2 1B, TinyLlama 1.1B | 52–65 | 314–422 |
| Qwen2/3 | Qwen 2.5 0.5B, Qwen3 0.6B | 60–98 | 241–411 |
| Qwen3 MoE | Qwen3-Coder-30B-A3B (128 experts, IQ3_XXS) | ~5.2 | ~40 |
| Qwen3-Coder-Next | 80B-A3B (hybrid GDN+attention, 512 experts, IQ3_XXS) | ~2.7 | ~5.4 (partial 27/48) |
| Qwen3.5 | Qwen3.5 0.8B, 2B (hybrid GDN+attention) | 24–34 | 171–193 |
| Qwen3.5 (large) | Qwen3.5 9B, 27B (hybrid GDN+attention) | 2.4–7.9 | 19–61 |
| Qwen3.5 MoE | Qwen3.5 35B-A3B (256 experts, 8 active) | ~4.1 | ~11 (partial 39/40) |
| Qwen3.5 MoE (large) | Qwen3.5 122B-A10B (IQ3_XXS) | ~1.4 | ~2.0 (partial 16/48) |
| GLM-4.7 Flash | GLM-4.7 Flash (MLA + MoE, 64 experts, Q4_K) | ~5.6 | ~15 (partial 43/47) |
| gpt-oss (OpenAI) | gpt-oss-20b (MoE, attention sinks, MXFP4/Q3_K_M) | 4.5–5.6 | 33–36 |
| Gemma 2/3 | Gemma 2 2B, Gemma 3 1B, Gemma 3 270M | 44–154 | 249–530 |
| SmolLM2 | SmolLM2 360M, SmolLM2 1.7B | 42–96 | 177–411 |
| Phi | Phi-2, Phi-4-mini | 9–20 | ~125 |
| Mistral | Mistral (llama-compatible) | — | — |
| Whisper | Tiny, Base, Small (speech-to-text) | ~1x RT | — |
CPU throughput with AVX2+FMA SIMD, parallel worker pool, batch prefill. GPU throughput with Vulkan compute on NVIDIA RTX 4070 Ti SUPER.
Benchmarks use the exact same GGUF file loaded into both engines via ollama create
with a Modelfile. temperature=0, seed=42, max_tokens=64, Ollama forced CPU-only
(num_gpu=0).
| Model | Quant | dlgo gen | Ollama gen | Delta | dlgo prefill | Ollama prefill |
|---|---|---|---|---|---|---|
| Gemma 3 270M | Q8_0 | 131.9 tok/s | 160.5 tok/s | −18% | 26 ms | 11 ms |
| SmolLM2 360M | Q8_0 | 96.1 tok/s | 117.7 tok/s | −18% | 59 ms | 37 ms |
| Qwen 2.5 0.5B | Q4_K_M | 90.8 tok/s | 121.0 tok/s | −25% | 66 ms | 24 ms |
| Qwen3 0.6B | Q8_0 | 58.7 tok/s | 69.9 tok/s | −16% | 69 ms | 33 ms |
| TinyLlama 1.1B | Q4_0 | 59.4 tok/s | 77.9 tok/s | −24% | 210 ms | 135 ms |
| Gemma 3 1B | Q4_K_M | 42.2 tok/s | 52.9 tok/s | −20% | 141 ms | 86 ms |
| Llama 3.2 1B | Q4_K_M | 52.4 tok/s | 58.6 tok/s | −11% | 108 ms | 51 ms |
| SmolLM2 1.7B | Q4_K_M | 39.2 tok/s | 44.8 tok/s | −12% | 160 ms | 124 ms |
| Qwen3.5 0.8B | Q8_0 | 33.2 tok/s | 27.3 tok/s | +22% | 191 ms | 85 ms |
| Phi-4-mini 3.8B | Q3_K_M | 19.4 tok/s | 23.2 tok/s | −16% | 288 ms | 137 ms |
| Model | Quant | dlgo gen | Ollama gen | Delta | dlgo prefill | Ollama prefill |
|---|---|---|---|---|---|---|
| Qwen3.5 9B | Q3_K_M | 7.5 tok/s | 7.2 tok/s | +4% | 439 ms | 449 ms |
| Qwen3.5 27B | Q3_K_M | 2.5 tok/s | 2.4 tok/s | +4% | 1314 ms | 1528 ms |
| Qwen3.5 35B-A3B MoE | Q3_K_M | 8.1 tok/s | 7.6 tok/s | +7% | 640 ms | 441 ms |
| Model | Quant | Size | Active Params | Architecture | CPU gen | GPU gen |
|---|---|---|---|---|---|---|
| Qwen3-Coder-Next 80B-A3B | IQ3_XXS | 27 GB | ~3B | Hybrid GDN+Attention, 512 experts | 2.7 tok/s | 5.4 tok/s (partial 27/48) |
| Qwen3-Coder-30B-A3B | IQ3_XXS | 12 GB | ~3B | MoE, 128 experts | 5.2 tok/s | 40.4 tok/s |
| Qwen3.5-122B-A10B | IQ3_XXS | 45 GB | ~10B | Hybrid GDN+MoE, 256 experts | 1.4 tok/s | 2.0 tok/s (partial 16/48) |
| Qwen3.5-35B-A3B MoE | Q3_K_M | 16 GB | ~3B | Hybrid GDN+MoE, 256 experts | 4.1 tok/s | 11.0 tok/s (partial 39/40) |
| gpt-oss-20b | MXFP4 | 11.5 GB | ~5B | MoE, 32 experts, attention sinks | 5.6 tok/s | 32.6 tok/s |
| gpt-oss-20b | Q3_K_M | 11 GB | ~5B | MoE, 32 experts, attention sinks | 4.5 tok/s | 32.7 tok/s |
| GLM-4.7-Flash | Q4_K_XL | 16.7 GB | ~3B | MLA + MoE, 64 experts | 5.6 tok/s | 15.3 tok/s (partial 43/47) |
Notes:
- Qwen3.5 models outperform Ollama on CPU generation: 0.8B (+22%), 9B (+4%), 27B (+4%), and 35B MoE (+7%). These use hybrid GDN+attention architectures where dlgo's optimized SSM kernels and fused IQ SIMD dot products give a measurable advantage.
- gpt-oss-20b runs fully on GPU with fused SwiGLU-OAI-Bias shader, GPU-side top-K routing, and batched expert dispatch. 33 tok/s (6–7x faster than CPU) with correct output on both MXFP4 and Q3_K_M.
- Qwen3-Coder-30B-A3B runs all 48 MoE layers on GPU at 40.4 tok/s (8x faster than CPU).
- Qwen3.5-35B-A3B MoE now loads successfully with never-OOM pipeline (previously crashed), running at 11 tok/s (partial 39/40 layers).
- Qwen3.5-122B-A10B (45 GB, 256 experts) runs with partial GPU offloading (16/48 layers) due to VRAM constraints, producing coherent output.
- GLM-4.7-Flash uses Multi-head Latent Attention (MLA) with compressed KV cache and absorbed key projection, combined with MoE routing. Partial GPU at 15.3 tok/s.
- Never-OOM: GPU pipeline automatically retries with fewer layers if VRAM allocation fails. Models of any size can run — throughput degrades gracefully but the system never crashes.
- Generation is within 11–25% of Ollama for most small models. The gap comes from Go+CGo dispatch overhead (channel-based worker pool, goroutine scheduling, CGo call bridge per matmul chunk).
go get github.com/computerex/dlgomodel, err := dlgo.LoadLLM("llama-3.2-1b-instruct-q4_k_m.gguf")
if err != nil {
log.Fatal(err)
}
response, err := model.Chat(
"You are a helpful assistant.",
"Explain quantum computing in one sentence.",
dlgo.WithMaxTokens(128),
dlgo.WithTemperature(0.7),
)
fmt.Println(response)model, _ := dlgo.LoadLLM("model.gguf")
model.ChatStream("", "Write a poem about Go.", func(token string) {
fmt.Print(token)
}, dlgo.WithMaxTokens(256))response, _ := model.ChatMessages([]dlgo.Message{
{Role: "system", Content: "You are a pirate."},
{Role: "user", Content: "Tell me about the sea."},
{Role: "assistant", Content: "Arrr, the sea be vast!"},
{Role: "user", Content: "What about treasure?"},
}, dlgo.WithMaxTokens(128))whisper, _ := dlgo.LoadWhisper("whisper-base.gguf", "tokenizer.json")
text, _ := whisper.TranscribeFile("audio.wav")
fmt.Println(text)dlgo.WithMaxTokens(256) // max tokens to generate
dlgo.WithTemperature(0.8) // 0 = greedy, higher = more creative
dlgo.WithTopK(40) // top-K sampling
dlgo.WithTopP(0.9) // nucleus sampling
dlgo.WithGreedy() // deterministic outputdlgo.go High-level API (LoadLLM, Chat, Generate, Stream)
core/ QuantizedTensor with row-level dequantization
quant/ 25+ GGML quantization formats, fused SIMD dot products
format/gguf/ GGUF v2/v3 parser
format/ggml/ Legacy GGML parser
gpu/ Vulkan GPU compute backend (MatVec, attention, RoPE, etc.)
ops/ RMSNorm, RoPE, Softmax, SwiGLU, GeGLU, sampling
blas/ Quantized matrix-vector multiply, parallel worker pool
layers/ Conv1D, LSTM, GRU, MHA, GQA, cross-attention
audio/ WAV loading, STFT, mel spectrogram
memory/ KV cache, buffer pool
decode/ Greedy decode, beam search
models/llm/ LLM pipeline (tokenizer, forward, generation, chat templates)
models/whisper/ Whisper speech-to-text
models/silero/ Silero voice activity detection
examples/ Ready-to-run examples
See docs/quantization-guide.md for detailed guidance on choosing quantization formats. Summary:
| Tier | Types | Speed | Use Case |
|---|---|---|---|
| Tier 1 (QxQ integer SIMD) | Q4_0, Q8_0, Q2_K–Q6_K, Q5_0 | Fastest | Recommended for all use |
| Tier 1b (fused SIMD) | IQ3_XXS, IQ4_XS | Fast | Optimized vpshufb/cvtepi8 kernels, used by MoE models |
| Tier 2 (float SIMD) | F16, Q4_1, Q5_1 | 2–4x slower | Functional, avoid if possible |
| Tier 3 (dequant+dot) | IQ1*, IQ2*, TQ*, BF16 | Slowest | Avoid for large models |
All common GGUF downloads (Q4_K_M, Q5_K_M, Q3_K_L, Q8_0, etc.) use only Tier 1 types.
GPU benchmarks on NVIDIA GeForce RTX 4070 Ti SUPER (16 GB VRAM). Same GGUF files loaded
into both engines. Greedy decoding, temperature=0, seed=42.
Ollama defaults to CUDA on NVIDIA GPUs. All tested models:
| Model | Quant | dlgo Vulkan | Ollama CUDA | Delta |
|---|---|---|---|---|
| Qwen3.5 0.8B | Q8_0 | 287 tok/s | 250 tok/s | +15% |
| Qwen3.5 27B | Q3_K_M | 6.4 tok/s | 4.0 tok/s | +60% |
| Qwen3.5 9B | Q3_K_M | 70.9 tok/s | 86.7 tok/s | −18% |
| Gemma 3 270M | Q8_0 | 550 tok/s | 570 tok/s | −3% |
| SmolLM2 360M | Q8_0 | 467 tok/s | 545 tok/s | −14% |
| SmolLM2 1.7B | Q4_K_M | 323 tok/s | 403 tok/s | −20% |
| Qwen3 0.6B | Q8_0 | 367 tok/s | 421 tok/s | −13% |
| Llama 3.2 1B | Q4_K_M | 420 tok/s | 501 tok/s | −16% |
| Gemma 3 1B | Q4_K_M | 288 tok/s | 338 tok/s | −15% |
| Qwen 2.5 0.5B | Q4_K_M | 447 tok/s | 666 tok/s | −33% |
| TinyLlama 1.1B | Q4_0 | 441 tok/s | 623 tok/s | −29% |
| Phi-4-mini 3.8B | Q3_K_M | 140 tok/s | 201 tok/s | −31% |
Qwen3.5 is 28% faster than Ollama's CUDA backend thanks to custom Vulkan compute shaders for the Gated Delta Net (GDN) SSM layers — the first GPU-accelerated pure Vulkan implementation of this architecture. The 7–25% gap to CUDA for standard attention models is the inherent Vulkan vs CUDA overhead on NVIDIA hardware. On non-NVIDIA GPUs (AMD, Intel, mobile), dlgo's Vulkan backend provides a significant advantage — Ollama's Vulkan backend is much slower for these quantization types.
Ollama forced to Vulkan backend (OLLAMA_VULKAN=1 OLLAMA_LLM_LIBRARY=vulkan):
| Model | Quant | dlgo Vulkan | Ollama Vulkan | Delta | dlgo prefill | Ollama prefill |
|---|---|---|---|---|---|---|
| SmolLM2 360M | Q8_0 | 389 tok/s | 411 tok/s | −5% | 86 ms | 1431 ms |
| TinyLlama 1.1B | Q4_0 | 423 tok/s | 187 tok/s | +126% | 99 ms | 1966 ms |
| Qwen 2.5 0.5B | Q4_K_M | 394 tok/s | 237 tok/s | +66% | 68 ms | 2749 ms |
| Gemma 3 1B | Q4_K_M | 245 tok/s | 116 tok/s | +111% | 194 ms | 3399 ms |
dlgo beats Ollama's Vulkan backend on 3 of 4 models by 66–126%, and is within 5% on SmolLM2. Prefill is 10–50x faster across all models.
GPU implementation highlights:
- Pure Vulkan compute (cross-platform, no CUDA dependency)
- Quantized MatVec shaders with typed struct access (
float16_t,uint8_t,int8_t) - Fused layer dispatch (single CGo call per layer, minimized Go↔C overhead)
- Fused Add+RMSNorm kernel reducing barriers per layer
- Fused multi-head attention kernel (Q·K softmax, V accumulation)
- Custom SSM/GDN shaders (conv1d+SiLU, delta rule, L2 norm, sigmoid gate) — full Gated Delta Net on GPU
- Fused SwiGLU-OAI-Bias shader (bias + activation in one pass) for OpenAI gpt-oss MoE models
- GPU-side MoE top-K routing (
moe_topkshader) — no CPU-GPU sync for expert selection - Fused
scale_addshader for weighted expert accumulation (one dispatch per expert instead of three) - Batched expert dispatch — all experts' gate+up, activation, and down projections dispatched in parallel phases
- GPU-resident MoE expert biases with offset-indexed
add_offsetshader - Never-OOM VRAM budget with retry loop (graceful degradation to fewer GPU layers)
- HOST_CACHED staging buffer for 14x faster CPU←GPU data transfer
- Single command buffer submission per token with batched dispatches
- Push descriptors (
VK_KHR_push_descriptor) for minimal dispatch overhead dp4ainteger dot product shaders (available for Q4_0, Q5_0, Q8_0, Q4_K, Q6_K)
- GGUF parser reads model metadata and tensor locations from the file
- Quantized tensors stay in their compressed format in memory — only dequantized on the fly during matrix multiplication
- Forward pass runs the model: embedding, RoPE, GQA attention, Multi-head Latent Attention (MLA), SwiGLU/GeGLU FFN, RMSNorm, hybrid SSM/attention (Gated Delta Net with delta rule recurrence), attention sinks, and Mixture-of-Experts (MoE) with multiple gating functions — architecture variations are expressed as a per-layer
LayerSpecresolved at load time - SIMD acceleration (optional, via CGo) uses AVX2+FMA+VNNI for QxQ integer dot products, fused vpshufb-based IQ3_XXS/IQ4_XS kernels, and batch prefill GEMM kernels
- Parallel matmul distributes rows across a persistent worker pool with fused multi-matrix dispatch; MoE layers use batched multi-expert dispatch for full core utilization
- GPU acceleration (optional, via Vulkan) offloads the entire forward pass to GPU — all weights, KV cache, and intermediate buffers reside in VRAM
- Token sampling supports temperature, top-K, top-P, min-P, and repetition penalty