gguf-runner is a small command-line tool that lets you run AI models on your own machine.
The idea behind this project is simple: local AI should feel like a normal Unix-style tool.
You point it to a .gguf model file, ask a question, and stream the answer in your terminal.
No cloud API, no GPU setup maze, and no heavy platform around it.
It is built for people who want to:
- run models fully offline
- keep data local
- script prompts in shell workflows
- experiment with different model sizes on regular hardware
Under the hood, gguf-runner uses memory mapping (mmap) and CPU-only inference.
This means execution is not constrained by GPU availability or fixed GPU memory (VRAM) limits.
In theory, the upper bound shifts toward storage capacity, with the tradeoff that larger working sets become slower.
In practice, performance is often as good as your filesystem caching behavior allows, so warm-cache runs can feel much faster than cold starts.
If you are new to the project, start with the quick steps below and you should get your first response in a few minutes.
- Install
gguf-runner(choose one):
Option A: prebuilt binary from GitHub Releases
tar -xzf gguf-runner-<tag>-linux-amd64.tar.gzOption B: install from source with Cargo
# default (portable)
cargo install --git https://github.com/apimeister/gguf-runner
# optimized for this machine (recommended)
RUSTFLAGS="-C target-cpu=native" cargo install --git https://github.com/apimeister/gguf-runnerOn AMD Ryzen 7 PRO 8700GE, target-cpu=native improved:
- tok/s:
5.668->6.848(+20.8%) - runtime:
215.522s->178.041s(-17.4%)
Note: target-cpu=native binaries are tuned for the build machine and are less portable across different CPUs.
- Verify installation and CPU feature detection:
gguf-runner --show-featuresIf you used a release archive and did not move the binary into your PATH, run:
./gguf-runner --show-features- Download
Qwen3.5-0.8B:
wget https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf- Run a first text prompt:
gguf-runner \
--model ./Qwen3.5-0.8B-Q4_K_M.gguf \
--prompt "hello"- (Optional) Run a vision prompt with Qwen3.5:
wget https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/Qwen3.5-2B-Q4_K_M.gguf
wget -O mmproj-Qwen3.5-2B-F16.gguf https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/mmproj-F16.gguf
gguf-runner \
--model ./Qwen3.5-2B-Q4_K_M.gguf \
--image sample-image.jpg \
--prompt "Describe that image."More model download examples:
docs/downloading-models.md
Known-good status from docs/performance.md (text benchmarks) and local model/mmproj availability.
| Model | Text | Vision |
|---|---|---|
gemma-3-4b-it-Q4_K_M.gguf |
✅ | ✅ |
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf |
✅ | ❌ |
Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf |
✅ | ❌ |
Qwen3-0.6B-Q4_K_M.gguf |
✅ | ❌ |
Qwen3-4B-Instruct-2507-Q4_K_M.gguf |
✅ | ❌ |
Qwen3-30B-A3B-Instruct-2507-Q4_K_S.gguf |
✅ | ❌ |
Qwen3-Coder-Next-Q4_K_M.gguf |
✅ | ❌ |
Qwen3-VL-2B-Instruct-Q4_K_M.gguf |
⚪ | ✅ |
Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf |
⚪ | ✅ |
Qwen3.5-0.8B-Q4_K_M.gguf |
✅ | ✅ |
Qwen3.5-2B-Q4_K_M.gguf |
✅ | ✅ |
Qwen3.5-35B-A3B-UD-Q4_K_M.gguf |
✅ | ✅ |
- A local
.ggufmodel file. - Enough RAM for the model you choose.
- Rust toolchain (only if you build from source).
gguf-runner has two distinct operation modes selected with --mode:
Oneshot mode runs a single prompt and exits. The model loads, generates a response, prints it to stdout, then terminates. This is the default when --mode is not specified.
gguf-runner \
--model ./Qwen3.5-0.8B-Q4_K_M.gguf \
--prompt "What is the capital of France?"When to use oneshot:
- Scripting and automation — pipe the output to other tools
- One-off questions where you do not need a follow-up
- CI pipelines, cron jobs, or any non-interactive context
Key characteristics:
--promptis required- Tools are disabled by default (pass
--allowed-tools allto enable) - No persistent chat history — each invocation starts fresh
- Output goes to stdout, making it easy to capture or pipe
Scripting examples:
# Capture output to a variable
SUMMARY=$(gguf-runner --model model.gguf --prompt "Summarize: $(cat notes.txt)")
# Pipe into another command
gguf-runner --model model.gguf --prompt "List five ideas for a project name" | fzf
# Use in a shell script
for file in *.md; do
gguf-runner --model model.gguf --prompt "Summarize this: $(cat $file)" > "${file%.md}.summary"
doneREPL mode starts an interactive terminal session. The model loads once and stays in memory. You type prompts and get responses in a continuous loop, with full chat history carried across turns.
gguf-runner \
--model ./Qwen3.5-0.8B-Q4_K_M.gguf \
--mode replWhen to use REPL:
- Multi-turn conversations where context matters
- Exploratory sessions — ask follow-up questions
- Agentic work with file and shell tools
- Any time you want the model to remember what you said earlier in the session
Key characteristics:
- Tools are enabled by default (pass
--allowed-tools noneto disable) - Chat history accumulates within the session
- The model loads once — subsequent prompts pay no load cost
- A status bar shows token count, speed, and context usage
Slash commands (type /help inside the REPL for the current list):
| Command | Effect |
|---|---|
/help |
Show available commands |
/model |
Print the active model path |
/image <path> |
Attach an image to the next prompt (vision models) |
/images |
List currently attached images |
/clear-images |
Remove all image attachments |
/clear |
Reset chat history and image attachments |
/exit or /quit |
Exit the REPL |
Tab completion works for slash commands: type /e and press Tab to expand to /exit.
Use Ctrl+C or Esc to exit at any time.
Starting with an initial prompt:
You can pass --prompt alongside --mode repl to send a first message automatically once the model is ready:
gguf-runner \
--model ./Qwen3.5-0.8B-Q4_K_M.gguf \
--mode repl \
--prompt "Hello, let's start by you telling me your name."Both modes support an agent layer that lets the model read files, list directories, write files, and run approved shell commands. Tools are off by default in oneshot and on by default in REPL.
# Enable all tools (oneshot — disabled by default)
gguf-runner --model model.gguf --prompt "..." --allowed-tools all
# Enable only specific tools
gguf-runner --model model.gguf --mode repl --allowed-tools read_file,list_dir
# Disable all tools (REPL — enabled by default)
gguf-runner --model model.gguf --mode repl --allowed-tools noneAvailable tool names: read_file, list_dir, write_file, mkdir, rmdir, shell_list_allowed, shell_exec, shell_request_allowed
Use --tool-root to confine file operations to a specific directory. Without it, the current working directory is used as the root.
gguf-runner \
--model model.gguf \
--mode repl \
--tool-root ./my-projectShell execution is sandboxed: the model can only run commands you explicitly allowlist.
# Allow specific commands on the command line
gguf-runner --model model.gguf --mode repl \
--allow-shell-command cargo \
--allow-shell-command git
# Or via environment variable (comma-separated)
GGUF_ALLOW_SHELL_COMMANDS=cargo,git gguf-runner --model model.gguf --mode replPersistent tool and shell settings can be stored in a TOML config file. gguf-runner checks two locations in order, with the project-local file taking precedence:
~/.gguf-runner/config.toml(user-wide defaults)./.gguf-runner/config.toml(per-project overrides)
# .gguf-runner/config.toml
[tools]
# Disable individual tools if needed
write_file = false
rmdir = false
[shell]
# Allowlist shell commands with optional descriptions
# Descriptions help the model understand when to use each command
[shell.md]
cargo = "Rust build and test tool"
git = "Version control"
rg = "Fast grep (ripgrep)"In REPL mode with tools enabled, gguf-runner inspects each prompt before sending it to the model. Plain conversational questions go directly to a fast chat path. Prompts that clearly need file or shell access (mentioning files, directories, cargo, git, etc.) go through the full agent loop. This means you can mix plain chat and tool-assisted requests in the same session without configuration changes.
gguf-runner \
--model ./your-model.gguf \
--prompt "Your question"Most common options (and what they do):
--mode oneshot|repl:oneshotruns one request and exits.replkeeps an interactive prompt loop until you type/exitor/quit.--allowed-tools <list>: Comma-separated tool allowlist, orall/none(nonedisables all tools).- defaults:
oneshot => none,repl => all
- defaults:
--max-tokens 256: Maximum number of generated output tokens. Use lower values for short answers and faster test runs.--context-size 4096: Sets how much conversation/history the model can keep in context.--temperature 0.7: Controls randomness. Lower is more deterministic, higher is more creative/variable.--threads 8: Number of CPU threads to use. Usually set this near your available CPU cores.--think yes|no|hidden: Controls thinking output for reasoning models (Qwen3, Qwen3.5, etc.).yes— show the model's thinking steps (default for oneshot)hidden— suppress thinking, show only the final answer (default for REPL)no— skip the thinking phase entirely for faster, shorter responses
--show-features: Prints detected CPU features (compiled vs runtime) and exits.--show-tokens: Streams token-level output/diagnostics while generating.--show-timings: Prints timing breakdowns so you can inspect performance bottlenecks.--profiling: Enables deeper profiling output for performance analysis.--debug: Enables additional debug logging/details during execution.
For vision-capable models (for example Qwen3-VL / Qwen3.5 multimodal variants):
gguf-runner \
--model ./Qwen3-VL-2B-Instruct-Q4_K_M.gguf \
--image ./regression/IMG_0138.jpg \
--prompt "Describe this image."In REPL mode, attach images with the /image command before your prompt:
[you] > /image ./screenshot.png
[you] > What does this error message say?
If required multimodal tensors/components are missing, the runner fails fast with a clear error.
- CPU inference only
- GGUF model files only
- Focus on clear, readable implementation
- Feature coverage:
docs/features.md - Performance history:
docs/performance.md - Performance ideas/tuning notes:
docs/performance-improvement-suggestions.md - Module layout:
docs/module-structure.md
cargo run --example gguf_dump -- --model ./model.gguf --dump-kv --dump-tensorsSuggestions and PRs are welcome.