gguf-runner

gguf-runner is a small command-line tool that lets you run AI models on your own machine.

The idea behind this project is simple: local AI should feel like a normal Unix-style tool. You point it to a .gguf model file, ask a question, and stream the answer in your terminal. No cloud API, no GPU setup maze, and no heavy platform around it.

It is built for people who want to:

run models fully offline
keep data local
script prompts in shell workflows
experiment with different model sizes on regular hardware

Under the hood, gguf-runner uses memory mapping (mmap) and CPU-only inference. This means execution is not constrained by GPU availability or fixed GPU memory (VRAM) limits. In theory, the upper bound shifts toward storage capacity, with the tradeoff that larger working sets become slower. In practice, performance is often as good as your filesystem caching behavior allows, so warm-cache runs can feel much faster than cold starts.

If you are new to the project, start with the quick steps below and you should get your first response in a few minutes.

Getting Started

Install gguf-runner (choose one):

Option A: prebuilt binary from GitHub Releases

tar -xzf gguf-runner-<tag>-linux-amd64.tar.gz

Option B: install from source with Cargo

# default (portable)
cargo install --git https://github.com/apimeister/gguf-runner

# optimized for this machine (recommended)
RUSTFLAGS="-C target-cpu=native" cargo install --git https://github.com/apimeister/gguf-runner

On AMD Ryzen 7 PRO 8700GE, target-cpu=native improved:

tok/s: 5.668 -> 6.848 (+20.8%)
runtime: 215.522s -> 178.041s (-17.4%)

Note: target-cpu=native binaries are tuned for the build machine and are less portable across different CPUs.

Verify installation and CPU feature detection:

gguf-runner --show-features

If you used a release archive and did not move the binary into your PATH, run:

./gguf-runner --show-features

Download Qwen3.5-0.8B:

wget https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf

Run a first text prompt:

gguf-runner \
  --model ./Qwen3.5-0.8B-Q4_K_M.gguf \
  --prompt "hello"

(Optional) Run a vision prompt with Qwen3.5:

wget https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/Qwen3.5-2B-Q4_K_M.gguf
wget -O mmproj-Qwen3.5-2B-F16.gguf https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/mmproj-F16.gguf
gguf-runner \
  --model ./Qwen3.5-2B-Q4_K_M.gguf \
  --image sample-image.jpg \
  --prompt "Describe that image."

More model download examples:

docs/downloading-models.md

Working Models

Known-good status from docs/performance.md (text benchmarks) and local model/mmproj availability.

Model	Text	Vision
`gemma-3-4b-it-Q4_K_M.gguf`	✅	✅
`Meta-Llama-3-8B-Instruct-Q4_K_M.gguf`	✅	❌
`Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf`	✅	❌
`Qwen3-0.6B-Q4_K_M.gguf`	✅	❌
`Qwen3-4B-Instruct-2507-Q4_K_M.gguf`	✅	❌
`Qwen3-30B-A3B-Instruct-2507-Q4_K_S.gguf`	✅	❌
`Qwen3-Coder-Next-Q4_K_M.gguf`	✅	❌
`Qwen3-VL-2B-Instruct-Q4_K_M.gguf`	⚪	✅
`Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf`	⚪	✅
`Qwen3.5-0.8B-Q4_K_M.gguf`	✅	✅
`Qwen3.5-2B-Q4_K_M.gguf`	✅	✅
`Qwen3.5-35B-A3B-UD-Q4_K_M.gguf`	✅	✅

What You Need

A local .gguf model file.
Enough RAM for the model you choose.
Rust toolchain (only if you build from source).

Operation Modes

gguf-runner has two distinct operation modes selected with --mode:

Oneshot mode (default)

Oneshot mode runs a single prompt and exits. The model loads, generates a response, prints it to stdout, then terminates. This is the default when --mode is not specified.

gguf-runner \
  --model ./Qwen3.5-0.8B-Q4_K_M.gguf \
  --prompt "What is the capital of France?"

When to use oneshot:

Scripting and automation — pipe the output to other tools
One-off questions where you do not need a follow-up
CI pipelines, cron jobs, or any non-interactive context

Key characteristics:

--prompt is required
Tools are disabled by default (pass --allowed-tools all to enable)
No persistent chat history — each invocation starts fresh
Output goes to stdout, making it easy to capture or pipe

Scripting examples:

# Capture output to a variable
SUMMARY=$(gguf-runner --model model.gguf --prompt "Summarize: $(cat notes.txt)")

# Pipe into another command
gguf-runner --model model.gguf --prompt "List five ideas for a project name" | fzf

# Use in a shell script
for file in *.md; do
  gguf-runner --model model.gguf --prompt "Summarize this: $(cat $file)" > "${file%.md}.summary"
done

REPL mode

REPL mode starts an interactive terminal session. The model loads once and stays in memory. You type prompts and get responses in a continuous loop, with full chat history carried across turns.

gguf-runner \
  --model ./Qwen3.5-0.8B-Q4_K_M.gguf \
  --mode repl

When to use REPL:

Multi-turn conversations where context matters
Exploratory sessions — ask follow-up questions
Agentic work with file and shell tools
Any time you want the model to remember what you said earlier in the session

Key characteristics:

Tools are enabled by default (pass --allowed-tools none to disable)
Chat history accumulates within the session
The model loads once — subsequent prompts pay no load cost
A status bar shows token count, speed, and context usage

Slash commands (type /help inside the REPL for the current list):

Command	Effect
`/help`	Show available commands
`/model`	Print the active model path
`/image <path>`	Attach an image to the next prompt (vision models)
`/images`	List currently attached images
`/clear-images`	Remove all image attachments
`/clear`	Reset chat history and image attachments
`/exit` or `/quit`	Exit the REPL

Tab completion works for slash commands: type /e and press Tab to expand to /exit.

Use Ctrl+C or Esc to exit at any time.

Starting with an initial prompt:

You can pass --prompt alongside --mode repl to send a first message automatically once the model is ready:

gguf-runner \
  --model ./Qwen3.5-0.8B-Q4_K_M.gguf \
  --mode repl \
  --prompt "Hello, let's start by you telling me your name."

Tools and Agent Capabilities

Both modes support an agent layer that lets the model read files, list directories, write files, and run approved shell commands. Tools are off by default in oneshot and on by default in REPL.

Enabling and restricting tools

# Enable all tools (oneshot — disabled by default)
gguf-runner --model model.gguf --prompt "..." --allowed-tools all

# Enable only specific tools
gguf-runner --model model.gguf --mode repl --allowed-tools read_file,list_dir

# Disable all tools (REPL — enabled by default)
gguf-runner --model model.gguf --mode repl --allowed-tools none

Available tool names: read_file, list_dir, write_file, mkdir, rmdir, shell_list_allowed, shell_exec, shell_request_allowed

Restricting file access

Use --tool-root to confine file operations to a specific directory. Without it, the current working directory is used as the root.

gguf-runner \
  --model model.gguf \
  --mode repl \
  --tool-root ./my-project

Allowing shell commands

Shell execution is sandboxed: the model can only run commands you explicitly allowlist.

# Allow specific commands on the command line
gguf-runner --model model.gguf --mode repl \
  --allow-shell-command cargo \
  --allow-shell-command git

# Or via environment variable (comma-separated)
GGUF_ALLOW_SHELL_COMMANDS=cargo,git gguf-runner --model model.gguf --mode repl

Config file

Persistent tool and shell settings can be stored in a TOML config file. gguf-runner checks two locations in order, with the project-local file taking precedence:

~/.gguf-runner/config.toml (user-wide defaults)
./.gguf-runner/config.toml (per-project overrides)

# .gguf-runner/config.toml

[tools]
# Disable individual tools if needed
write_file = false
rmdir = false

[shell]
# Allowlist shell commands with optional descriptions
# Descriptions help the model understand when to use each command
[shell.md]
cargo = "Rust build and test tool"
git = "Version control"
rg = "Fast grep (ripgrep)"

How tool routing works in REPL

In REPL mode with tools enabled, gguf-runner inspects each prompt before sending it to the model. Plain conversational questions go directly to a fast chat path. Prompts that clearly need file or shell access (mentioning files, directories, cargo, git, etc.) go through the full agent loop. This means you can mix plain chat and tool-assisted requests in the same session without configuration changes.

Basic Command Pattern

gguf-runner \
  --model ./your-model.gguf \
  --prompt "Your question"

Most common options (and what they do):

--mode oneshot|repl: oneshot runs one request and exits. repl keeps an interactive prompt loop until you type /exit or /quit.
--allowed-tools <list>: Comma-separated tool allowlist, or all / none (none disables all tools).
- defaults: oneshot => none, repl => all
--max-tokens 256: Maximum number of generated output tokens. Use lower values for short answers and faster test runs.
--context-size 4096: Sets how much conversation/history the model can keep in context.
--temperature 0.7: Controls randomness. Lower is more deterministic, higher is more creative/variable.
--threads 8: Number of CPU threads to use. Usually set this near your available CPU cores.
--think yes|no|hidden: Controls thinking output for reasoning models (Qwen3, Qwen3.5, etc.).
- yes — show the model's thinking steps (default for oneshot)
- hidden — suppress thinking, show only the final answer (default for REPL)
- no — skip the thinking phase entirely for faster, shorter responses
--show-features: Prints detected CPU features (compiled vs runtime) and exits.
--show-tokens: Streams token-level output/diagnostics while generating.
--show-timings: Prints timing breakdowns so you can inspect performance bottlenecks.
--profiling: Enables deeper profiling output for performance analysis.
--debug: Enables additional debug logging/details during execution.

Vision Example (Image Input)

For vision-capable models (for example Qwen3-VL / Qwen3.5 multimodal variants):

gguf-runner \
  --model ./Qwen3-VL-2B-Instruct-Q4_K_M.gguf \
  --image ./regression/IMG_0138.jpg \
  --prompt "Describe this image."

In REPL mode, attach images with the /image command before your prompt:

[you] > /image ./screenshot.png
[you] > What does this error message say?

If required multimodal tensors/components are missing, the runner fails fast with a clear error.

Project Scope

CPU inference only
GGUF model files only
Focus on clear, readable implementation

Useful Docs

Feature coverage: docs/features.md
Performance history: docs/performance.md
Performance ideas/tuning notes: docs/performance-improvement-suggestions.md
Module layout: docs/module-structure.md

GGUF Metadata Dump (No Inference)

cargo run --example gguf_dump -- --model ./model.gguf --dump-kv --dump-tensors

Suggestions and PRs are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github		.github
docs		docs
examples		examples
regression		regression
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gguf-runner

Getting Started

Working Models

What You Need

Operation Modes

Oneshot mode (default)

REPL mode

Tools and Agent Capabilities

Enabling and restricting tools

Restricting file access

Allowing shell commands

Config file

How tool routing works in REPL

Basic Command Pattern

Vision Example (Image Input)

Project Scope

Useful Docs

GGUF Metadata Dump (No Inference)

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gguf-runner

Getting Started

Working Models

What You Need

Operation Modes

Oneshot mode (default)

REPL mode

Tools and Agent Capabilities

Enabling and restricting tools

Restricting file access

Allowing shell commands

Config file

How tool routing works in REPL

Basic Command Pattern

Vision Example (Image Input)

Project Scope

Useful Docs

GGUF Metadata Dump (No Inference)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages