Beta — 2.5 months of solo development. Correctness-validated on TinyLlama 1.1B (CPU + Vulkan). Actively developing multi-model support.
A from-scratch implementation of transformer inference with cryptographic integrity built into every layer:
- Pure Rust GGUF parser — no
unsafein parsing code, no C++ string handling CVEs - 11 hand-written Vulkan compute shaders — raw GLSL 450 compiled to SPIR-V, fused dequant+dot-product directly from quantized weight blocks. Includes a Q4_K shader that unpacks 144-byte super-blocks with 6-bit scale extraction from a sharded 12-byte metadata array. No cuBLAS, no GGML backend, no abstraction layers
- Merkle verification — blake3 hash tree over every tensor in the model file. If a single weight is modified, the root hash changes
- Kernel-level process sandbox — Windows Job Objects, Linux seccomp-BPF + Landlock, macOS Seatbelt. No child processes, no filesystem access, no escape. The model process cannot touch the host
- Signed audit trail — Ed25519-signed, hash-chained attestation records for every inference. Cryptographically tamper-evident
- 8.5× GPU speedup — hand-tuned Vulkan path vs scalar CPU on Q4_K_M. No NVIDIA lock-in
cargo install yule
For GPU acceleration:
cargo install yule --features vulkan
Requires Rust 1.85+ (2024 edition). Vulkan feature requires Vulkan SDK.
yule run ./model.gguf --prompt "Explain quantum computing in one sentence"Options:
--max-tokens 512— generation limit (default: 512)--temperature 0.7— sampling temperature (default: 0.7)--backend auto|cpu|vulkan— compute backend (default: auto)--no-sandbox— disable process sandbox (not recommended)
yule pull TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
yule listyule serve ./model.ggufPrints a capability token to stderr. All requests require Authorization: Bearer <token>.
Options:
--bind 127.0.0.1:11434— listen address--token <token>— use a specific token instead of generating one--no-sandbox— disable process sandbox
Yule-native — every response includes integrity proof (merkle root, sandbox status, attestation ID):
curl -H "Authorization: Bearer $TOKEN" http://localhost:11434/yule/health
curl -H "Authorization: Bearer $TOKEN" http://localhost:11434/yule/model
curl -H "Authorization: Bearer $TOKEN" \
-d '{"messages":[{"role":"user","content":"Hello"}]}' \
http://localhost:11434/yule/chatOpenAI-compatible — drop-in replacement for tool integration:
curl -H "Authorization: Bearer $TOKEN" \
-d '{"model":"m","messages":[{"role":"user","content":"Hello"}]}' \
http://localhost:11434/v1/chat/completionsyule verify ./model.gguf # merkle tree + per-tensor hashes
yule inspect ./model.gguf # raw GGUF metadata
yule audit --last 10 # recent attestation records
yule audit --verify-chain # validate entire hash chainArchitectures: Llama, Mistral, Phi-3, Qwen2, Gemma2.
Quantizations: Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, Q8_K, Q2_K, Q3_K, F16, F32, BF16.
AVX2 SIMD: Q4_0 (3.9×), Q8_0 (5.0×), Q4_K (1.9×), Q6_K — with runtime CPU detection and automatic dispatch.
Vulkan GPU: Q4_0, Q4_K, Q8_0 — hand-written compute shaders, no framework dependency.
10-crate Cargo workspace. 107 tests, zero unsafe in application code (only in SIMD intrinsics and mmap).
| Crate | What it does |
|---|---|
yule-core |
GGUF parser, 11 dequant kernels (scalar + AVX2), BPE tokenizer |
yule-infer |
Forward pass (CPU + GPU), GQA attention, KV cache, constant-time sampler |
yule-gpu |
ComputeBackend trait, Vulkan backend (ash + gpu-allocator), 11 SPIR-V shaders |
yule-verify |
blake3 merkle trees, Ed25519 signatures, model manifests |
yule-attest |
Hash-chained attestation records, device key management |
yule-sandbox |
Windows Job Objects, Linux seccomp-BPF + Landlock, macOS Seatbelt |
yule-api |
Axum server, capability-token auth, SSE streaming, OpenAI-compat |
yule-registry |
HuggingFace model pull, local cache, quant detection |
yule-cli |
CLI entry point |
yule-bench |
Criterion benchmarks |
The Vulkan path uploads quantized weights to VRAM at init, then runs the full forward pass as a single command buffer submission. Each quantized matmul shader reads raw quantized bytes from SSBOs and does fused dequant+dot — no VRAM bloat from dequantized weight copies. The Q4_K shader handles 144-byte super-blocks with a non-trivial 6-bit scale/min extraction that matches the CPU path exactly.
Embedding lookup and sampling stay on CPU. Auto-detects discrete GPUs at startup, falls back to CPU if Vulkan is unavailable.
HTTP request → Axum → mpsc channel → Inference Thread → tokens → channel → SSE/JSON response
Model inference runs on a dedicated std::thread (mmap refs aren't Send). Async HTTP communicates via channels.
Every inference session produces a signed attestation record: model merkle root, sandbox status, blake3(prompt), blake3(output), token count, sampling parameters. Records are hash-chained — each includes blake3 of the previous record. The log at ~/.yule/audit.jsonl is append-only and tamper-evident.
cargo build # CPU only
cargo build --features vulkan # with Vulkan GPU
cargo test -j 1 # all tests
cargo bench -p yule-bench # dequant benchmarksApache-2.0