Optimum Grilly

HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU

Alpha software. APIs may change. We welcome early adopters and feedback.

optimum-grilly bridges HuggingFace Transformers to Grilly's Vulkan compute backend. Load any supported model with from_pretrained, run inference on AMD, NVIDIA, or Intel GPUs — no CUDA required.

Features

Any GPU: AMD, NVIDIA, Intel — anything with Vulkan drivers
HuggingFace compatible: Same from_pretrained / generate API you already know
Zero PyTorch runtime: Export once, run forever without PyTorch installed
Automatic CPU fallback: Works without a GPU (slower, but functional)
Supported architectures: LLaMA, Mistral, BERT, GPT-2 (T5 planned)

Installation

# Core package (CPU fallback only)
pip install optimum-grilly

# With Vulkan GPU acceleration
pip install optimum-grilly[gpu]

# With export support (requires PyTorch)
pip install optimum-grilly[export]

# Everything
pip install optimum-grilly[all]

Requirements

Python >= 3.10
grilly >= 0.4.5 (for GPU acceleration)
Vulkan drivers installed on your system
For export: PyTorch >= 2.0

Quick Start

1. Export a HuggingFace model

Convert a HuggingFace model to .grilly format (safetensors + config):

from optimum.grilly import export_to_grilly

# Export a causal LM
export_to_grilly(
    "meta-llama/Llama-3.2-1B",
    output_dir="./llama-1b-grilly",
)

# Export a BERT model for feature extraction
export_to_grilly(
    "bert-base-uncased",
    output_dir="./bert-grilly",
    task="feature-extraction",
)

Or from the command line:

optimum-grilly-export --model meta-llama/Llama-3.2-1B --output ./llama-1b-grilly
optimum-grilly-export --model bert-base-uncased --output ./bert-grilly --task feature-extraction

2. Run inference

from optimum.grilly import GrillyModelForCausalLM
from transformers import AutoTokenizer

# Load model and tokenizer
model = GrillyModelForCausalLM.from_pretrained("./llama-1b-grilly")
tokenizer = AutoTokenizer.from_pretrained("./llama-1b-grilly")

# Generate text
input_ids = tokenizer("The meaning of life is", return_tensors="np")["input_ids"]
output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

3. Feature extraction (embeddings)

from optimum.grilly import GrillyModelForFeatureExtraction
from optimum.grilly.pipelines import grilly_feature_extraction_pipeline
from transformers import AutoTokenizer

model = GrillyModelForFeatureExtraction.from_pretrained("./bert-grilly")
tokenizer = AutoTokenizer.from_pretrained("./bert-grilly")

# Get sentence embeddings
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello world", pooling="mean"
)
print(embedding.shape)  # (1, 768)

API Reference

Configuration

from optimum.grilly import GrillyConfig

# From a HuggingFace config dict
config = GrillyConfig.from_hf_config(hf_config_dict)

# Save / load
config.save("./model-dir")
config = GrillyConfig.load("./model-dir")

# Inspect
print(config)  # GrillyConfig(model_type='llama', hidden_size=4096, ...)
print(config.get_layer_map())  # Layer descriptors for weight loading

Models

Class	Description
`GrillyModel`	Base class — embed + transformer blocks + final norm
`GrillyModelForCausalLM`	+ LM head + `generate()` for text generation
`GrillyModelForFeatureExtraction`	Returns `last_hidden_state` for embeddings
`GrillyModelForSequenceClassification`	+ classifier head for classification tasks

All models support:

from_pretrained(path) — Load from a .grilly directory
save_pretrained(path) — Save config + weights
forward(input_ids, attention_mask=None) — Run inference

Export

from optimum.grilly import export_to_grilly

export_to_grilly(
    model_name_or_path="meta-llama/Llama-3.2-1B",
    output_dir="./output",
    task="causal-lm",         # "causal-lm", "feature-extraction",
                               # "sequence-classification", "auto"
    dtype="float32",
    include_tokenizer=True,
)

Pipelines

from optimum.grilly.pipelines import (
    grilly_text_generation_pipeline,
    grilly_feature_extraction_pipeline,
)

# Text generation
text = grilly_text_generation_pipeline(model, tokenizer, "Once upon a time")

# Feature extraction with pooling
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello", pooling="mean"  # "mean", "cls", "last"
)

Architecture

optimum-grilly
├── optimum/grilly/
│   ├── __init__.py          # Lazy imports
│   ├── configuration.py     # GrillyConfig (HF config mapping)
│   ├── modeling.py           # GrillyModel + task subclasses
│   ├── export.py             # HF PyTorch → .grilly converter
│   ├── pipelines.py          # Pipeline helpers
│   ├── utils.py              # safetensors I/O
│   └── version.py
├── tests/
│   ├── test_configuration.py
│   ├── test_modeling.py
│   ├── test_export.py
│   ├── test_pipelines.py
│   └── test_utils.py
└── pyproject.toml

How it works

Export (export.py): Downloads a HuggingFace PyTorch model, extracts all named_parameters() and named_buffers() as float32 numpy arrays, saves them as safetensors alongside a grilly_config.json that maps the HF architecture to grilly ops.
Load (modeling.py): Reads the safetensors weights and config, builds a graph of _TransformerBlock objects that hold numpy weight arrays. Each block dispatches linear/norm/attention/FFN operations to grilly_core (the C++ Vulkan extension) with automatic CPU numpy fallbacks.
Inference: All computation happens in float32. The Vulkan backend handles GPU upload/download transparently. When grilly_core is not available, all ops fall back to numpy — slower but correct.

Supported architectures

Architecture	Status	Notes
LLaMA / LLaMA 2 / LLaMA 3	Supported	Pre-norm, SwiGLU, RoPE, GQA
Mistral	Supported	Same as LLaMA (sliding window not yet implemented)
BERT	Supported	Post-norm, standard FFN
GPT-2	Supported	Pre-norm, fused QKV, Conv1D weight handling
T5	Planned	Encoder-decoder not yet implemented

Environment Variables

Variable	Description
`VK_GPU_INDEX`	Select GPU by index (default: 0)
`GRILLY_DEBUG`	Set to `1` for debug logging
`ALLOW_CPU_VULKAN`	Set to `1` to allow llvmpipe CPU fallback

Known Limitations

No KV-cache: generate() recomputes the full forward pass per token (O(n²)). KV-cache support is planned.
Float32 only: No fp16/bf16/int8 quantization yet.
No beam search: Only greedy and top-k sampling.
No streaming: generate() returns the full sequence.
T5 not supported: Encoder-decoder architectures are not yet implemented.

Development

git clone https://github.com/grillcheese-ai/optimum-grilly.git
cd optimum-grilly
pip install -e ".[dev]"
pytest tests/ -v

License

Apache 2.0 — see LICENSE for details.

Links

Grilly — The GPU framework
HuggingFace Optimum — HF's optimization toolkit
GrillCheese AI — Research lab

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
optimum/grilly		optimum/grilly
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimum Grilly

Features

Installation

Requirements

Quick Start

1. Export a HuggingFace model

2. Run inference

3. Feature extraction (embeddings)

API Reference

Configuration

Models

Export

Pipelines

Architecture

How it works

Supported architectures

Environment Variables

Known Limitations

Development

License

Links

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Optimum Grilly

Features

Installation

Requirements

Quick Start

1. Export a HuggingFace model

2. Run inference

3. Feature extraction (embeddings)

API Reference

Configuration

Models

Export

Pipelines

Architecture

How it works

Supported architectures

Environment Variables

Known Limitations

Development

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages