Skip to content

Improve installation DX: prebuilt wheels for 3.13/3.14/3.14t + declarative backend selection #2136

@clemlesne

Description

@clemlesne

Problem

Installing llama-cpp-python with a GPU backend requires setting CMAKE_ARGS as an environment variable at build time:

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

This creates pain across the ecosystem:

  1. Not declarable in pyproject.toml — Every downstream project needs custom Makefiles or install scripts with GPU auto-detection logic (macOS → Metal, nvidia-smi → CUDA, rocminfo → ROCm, fallback → OpenBLAS). This is duplicated across hundreds of projects.

  2. Cache invalidation is brokenpip and uv cache wheels by package version, not by CMAKE_ARGS. A cached OpenBLAS wheel silently gets reused when Metal or CUDA is requested. Workaround: --no-cache, which defeats caching entirely.

  3. GPU prebuilt wheels stop at Python 3.12 — The Metal wheel CI (build-wheels-metal.yaml) is hardcoded to CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*". The CUDA wheel CI (build-wheels-cuda.yaml) has its matrix pinned to Python 3.9-3.12. CPU-only wheels include 3.13 (via default cibuildwheel config in build-and-release.yaml), but the arm64 job there also pins to cp38-cp312. No workflow produces 3.14 or free-threaded (3.13t/3.14t) wheels. Python 3.13 has been stable since Oct 2024, 3.14 since Oct 2025. Free-threaded builds are increasingly important — vLLM, llguidance, and the broader no-GIL ecosystem depend on them.

Current state of published wheel indexes:

Index cp313 cp314 Free-threaded
CPU (/whl/cpu/)
Metal (/whl/metal/)
CUDA (/whl/cu1xx/)

Proposed changes

1. Expand prebuilt wheel matrix (highest impact, smallest change)

Update CIBW_BUILD in Metal/CUDA workflows and add free-threaded support. This is the single highest-impact change — it eliminates source builds for most users.

build-wheels-metal.yaml:

Upgrade cibuildwheel from v2.22.0 to v3.x (3.0 added cp314/cp314t support). In cibuildwheel 3.0, cp314t is built by default (free-threading is no longer experimental in 3.14), and cp313t requires CIBW_ENABLE: cpython-freethreading.

-        uses: pypa/cibuildwheel@v2.22.0
+        uses: pypa/cibuildwheel@v3.0.0
         env:
-          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"
+          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+          CIBW_ENABLE: cpython-freethreading

build-and-release.yaml — same cibuildwheel upgrade, and update the build_wheels_arm64 job:

-          CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-*"
+          CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-* cp313-* cp314-*"
+          CIBW_ENABLE: cpython-freethreading

build-wheels-cuda.yaml — uses a different build system (python -m build --wheel with a PowerShell matrix). The pyver matrix would need "3.13", "3.14" added.

With prebuilt wheels, any downstream project can use uv's declarative index support:

# pyproject.toml — zero Makefile, zero CMAKE_ARGS
[project]
dependencies = ["llama-cpp-python~=0.3"]

[tool.uv.sources]
llama-cpp-python = [
  { index = "llama-metal", marker = "sys_platform == 'darwin'" },
  { index = "llama-cpu",   marker = "sys_platform == 'linux'" },
]

[[tool.uv.index]]
name = "llama-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"
explicit = true

[[tool.uv.index]]
name = "llama-cpu"
url = "https://abetlen.github.io/llama-cpp-python/whl/cpu"
explicit = true

2. Document --config-settings as the source-build path

Since the build backend is scikit-build-core, cmake args can be passed via the standard PEP 517 config-settings interface:

pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"
# or with uv:
uv pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"

This is cleaner than the CMAKE_ARGS env var — it's the standard PEP 517 mechanism, more explicit, and discoverable. It's already supported via scikit-build-core but not documented in the README or install docs.

3. (Future) Adopt PEP 817 Wheel Variants

PEP 817 (draft, Dec 2025) introduces a standard mechanism for GPU/accelerator wheel variants. PyTorch 2.9 already ships experimental variant-enabled wheels. Once PEP 817 is accepted and tool support lands, llama-cpp-python could publish variant wheels that are auto-selected by the installer:

# Future: just works, installer picks Metal/CUDA/CPU automatically
pip install llama-cpp-python

This is mentioned for context only — the actionable items are (1) and (2) above.

Ecosystem context

  • Quansight offered funded engineering help for free-threaded support in Pre-built wheels for Python 3.14 and 3.14 free-threaded #2103 (via vLLM ecosystem work) — awaiting maintainer signal
  • ~470K monthly PyPI downloads (pypistats) — every project using this beyond toy scripts hits this install wall
  • How others solved it: PyTorch uses per-backend index URLs + PEP 817 variants; ONNX Runtime publishes separate PyPI packages per backend (onnxruntime-gpu, onnxruntime-silicon)

Related

Wheel matrix gaps (same root cause):

Wheel variants / long-term packaging:

Downstream impact of missing wheels:

Happy to submit a PR for (1) and (2).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions