🦙 llama-cpp-rs

Safe Rust bindings to llama.cpp, tracking upstream closely.

Crate	Description	crates.io
`llama-cpp-4`	Safe high-level API
`llama-cpp-sys-4`	Raw bindgen bindings

llama.cpp version: b8249 (March 2026)

Examples

Package name	Directory	Description
`simple`	`examples/simple/`	Single-turn text completion from CLI or Hugging Face
`chat`	`examples/chat/`	Interactive multi-turn chat REPL
`embeddings`	`examples/embeddings/`	Batch embedding with cosine similarity
`split-model-example`	`examples/split_model/`	Load sharded / split GGUF files
`openai-server`	`examples/server/`	OpenAI-compatible HTTP server with streaming and tool calling
`rpc-example`	`examples/rpc/`	Remote procedure call backend
`mtmd`	`examples/mtmd/`	Multimodal (vision / audio) inference (requires `--features mtmd`)

Quick start

git clone --recursive https://github.com/eugenehp/llama-cpp-rs
cd llama-cpp-rs

Interactive chat

cargo run -p chat -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

OpenAI-compatible server

# Starts on http://127.0.0.1:8080
cargo run -p openai-server -- \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Chat completion
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}], "max_tokens":128}'

# Streaming
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Count to 5"}], "stream":true}'

# Embeddings
curl http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["Hello world", "Bonjour le monde"]}'

Text generation (library)

use llama_cpp_4::{
    llama_backend::LlamaBackend,
    llama_batch::LlamaBatch,
    model::{params::LlamaModelParams, AddBos, LlamaModel, Special},
    context::params::LlamaContextParams,
    sampling::LlamaSampler,
};
use std::num::NonZeroU32;

let backend = LlamaBackend::init()?;
let model = LlamaModel::load_from_file(&backend, "model.gguf", &LlamaModelParams::default())?;
let mut ctx = model.new_context(&backend, LlamaContextParams::default())?;

let tokens = model.str_to_token("Hello, world!", AddBos::Always)?;
let mut batch = LlamaBatch::new(512, 1);
for (i, &tok) in tokens.iter().enumerate() {
    batch.add(tok, i as i32, &[0], i == tokens.len() - 1)?;
}
ctx.decode(&mut batch)?;

let sampler = LlamaSampler::chain_simple([LlamaSampler::greedy()]);
// ... decode loop

GPU acceleration

Feature	Hardware	Flag
`cuda`	NVIDIA (CUDA)	`--features cuda`
`metal`	Apple Silicon	`--features metal`
`vulkan`	AMD / Intel / cross-platform	`--features vulkan`
`native`	CPU with AVX2/NEON auto-detect	`--features native`
`openmp`	Multi-core CPU (default on)	`--features openmp`
`rpc`	Remote compute backend	`--features rpc`

# Metal (macOS)
cargo run -p openai-server --features metal -- --n-gpu-layers 99 \
    local model.gguf

# CUDA (Linux/Windows)
cargo run -p openai-server --features cuda -- --n-gpu-layers 99 \
    local model.gguf

# Vulkan (cross-platform)
cargo run -p openai-server --features vulkan -- --n-gpu-layers 99 \
    hf-model bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

Hugging Face model download

All examples and the server accept a hf-model <repo> [quant] subcommand that downloads models from the Hub (cached in ~/.cache/huggingface/).

# Interactive quant picker for repos with many options
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF

# Select by quant name (downloads all shards automatically)
cargo run -p openai-server -- hf-model unsloth/Qwen3.5-397B-A17B-GGUF Q4_K_M

# Exact filename
cargo run -p openai-server -- \
    hf-model TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf

Set HUGGING_FACE_HUB_TOKEN for gated models.

Development

# Clone with submodules (llama.cpp is a submodule of llama-cpp-sys-4)
git clone --recursive https://github.com/eugenehp/llama-cpp-rs

# Or after cloning without --recursive
git submodule update --init --recursive

# Build everything
cargo build

# Run all unit tests (no model required)
cargo test

# Run server unit tests specifically
cargo test -p openai-server

Updating llama.cpp

cd llama-cpp-sys-4/llama.cpp
git fetch --tags
git checkout b8249   # or latest tag
cd ../..
cargo build          # build.rs regenerates bindings automatically

Quick test

cargo run -p openai-server -- hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Send request via CURL:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"What is 2+2?"}],"stream":false,"max_tokens":500}'

Multimodal Images

Via the OpenAI-compatible server

cargo run -p openai-server --features mtmd --release -- \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Or with an explicit mmproj path:

cargo run -p openai-server --features mtmd -- \
    --mmproj mmproj-BF16.gguf \
    hf-model unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_0

Standalone multimodal example

cargo run --features mtmd -p mtmd -- \
    --model /path/to/model.gguf \
    --mmproj /path/to/mmproj.gguf \
    --image /path/to/image.jpg \
    --prompt "Describe this image."

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this picture?"},
{"type": "image_url", "image_url": {"url":
"https://upload.wikimedia.org/wikipedia/commons/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg"}}
]
}],
"max_tokens": 256
}'

Working end to end:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this picture?"},
{"type": "image_url", "image_url": {"url":
"https://upload.wikimedia.org/wikipedia/commons/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg"}}
]
}],
"max_tokens": 256
}'
{"choices":[{"finish_reason":"length","index":0,"message":{"content":"<think>\n\n</think>\n\nThis picture feature a **ginger and white cat** sitting upright outdoor on a bed of dry, brown leaves.\n\nHere’s a detaile breakdo:\n\n- **Subject**: The main focus is the cat, which has a classic “orange tabby and white” coat pattern.\n  - Its fur is mostly ginger (orange) with distinc darker orange stripes — typical of a tabby.\n  - It has a white face mask around its nose and mouth, a white chest and belly, and white paws (“socks”).\n  - Its eyes are strikin greenish-yellow, and it’s looking directl at the camera with an alert, calm express.\n  - The ears are perked up, and whiskers are clearly visible.\n\n- **Setting**: The cat is seated on the ground covered in fallen, dried leaves — suggest an autumn or early winter scene, possibl in a garden or wooded area.\n  - The backgro is softly blurred (shallow depth of field), drawing attenti to the cat while hinting at more foliage or grass behind it.\n\n- **Photography Style**: The image is well-composed, with natural lightin that highlig the texture of the cat’s fur and the rich colors. The bokeh effect","role":"assistant"}}],"created":1773097066,"id":"chatcmpl-1773097066","model":"Qwen3.5-27B-Q4_0","object":"chat.completion","usage":{"completion_tokens":256,"prompt_tokens":0,"total_tokens":256}}%

Credits

Originally derived from llama-cpp-2 — thanks to those contributors.
See also bitnet-cpp-rs for highly-quantized BitNet model support.

Citation

If you use gpu-fft in academic work, please cite it as:

BibTeX

@software{hauptmann2025llamacpprs,
  author    = {Hauptmann, Eugene},
  title     = {{llama-cpp-4}: llama-cpp {Rust} wrapper},
  year      = {2025},
  version   = {0.2.9},
  url       = {https://github.com/eugenehp/llama-cpp-rs},
}

Plain text (APA)

Hauptmann, E. (2025). llama-cpp-4: llama-cpp Rust wrapper (v0.2.9). https://github.com/eugenehp/llama-cpp-rs

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,251 Commits
.github		.github
.vscode		.vscode
examples		examples
llama-cpp-4		llama-cpp-4
llama-cpp-sys-4		llama-cpp-sys-4
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
LINUX.md		LINUX.md
MAC.md		MAC.md
README.md		README.md
WINDOWS.md		WINDOWS.md
test-build.Dockerfile		test-build.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦙 llama-cpp-rs

Examples

Quick start

Interactive chat

OpenAI-compatible server

Text generation (library)

GPU acceleration

Hugging Face model download

Development

Updating llama.cpp

Quick test

Multimodal Images

Via the OpenAI-compatible server

Standalone multimodal example

Credits

Citation

License

Copyright

About

Licenses found

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦙 llama-cpp-rs

Examples

Quick start

Interactive chat

OpenAI-compatible server

Text generation (library)

GPU acceleration

Hugging Face model download

Development

Updating llama.cpp

Quick test

Multimodal Images

Via the OpenAI-compatible server

Standalone multimodal example

Credits

Citation

License

Copyright

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages