Grilly

Deep learning, well done.

GPU-accelerated neural network framework using Vulkan compute shaders. PyTorch-like API that runs on any GPU -- AMD, NVIDIA, Intel -- no CUDA dependency. 231 GLSL compute shaders compiled to SPIR-V, dispatched through a native C++ layer with automatic CPU fallback.

⚠️ Pre-v1.0 release — massive changes ahead

The current main branch is substantially rewritten since the last tagged release (v0.6.1). The changes touch the C++ Vulkan dispatch layer, the VMA buffer allocation strategy, the autograd graph, the nn module forward signatures, the torch_api facade, and several shaders. Existing user code that depends on v0.6.1 semantics may need updates.

Highlights of what changed:

40x faster nn.Linear on AMD/Windows via a new staging-buffer pattern (DEVICE_LOCAL VRAM compute + WC stage-in + HOST_CACHED readback). The same pattern was applied across linear / linear_backward / layernorm / embedding / activations / optimizer / loss / dropout.

Cooperative-matrix GEMM (gemm-coopmat-shared.glsl) — fp16 GEMM via VK_KHR_cooperative_matrix. Hits hardware tensor cores on RDNA3 / NVIDIA RTX, runs through the driver emulation path on RDNA2. New LinearParams.elemSize field + generic py::array bindings let nn.Linear accept fp32 OR fp16 input transparently.

Causal Linear-RNN prefix scan — new prefix-scan-causal / prefix-scan-causal-backward shaders + C++ dispatcher + Python autograd wrapper. grilly.nn.prefix_scan.CausalSequenceMixer is a drop-in subgroup-parallel replacement for sequence pooling that strictly respects autoregressive causality.

Real autograd through nn.Linear, nn.LayerNorm, nn.Embedding — their forward methods now wrap numpy outputs in Variable with a GradFn that calls the existing backward() so loss.backward() actually updates layer parameters. Before this fix, optimizer steps silently no-op'd on every Module subclass that used the standard self.weight = nn.Parameter(...) idiom.

Module.__setattr__ auto-registration — Parameter and child Module attribute assignments now populate _parameters / _modules automatically. Standard PyTorch idiom; was previously silently broken (every subclass returned 0 parameters).

.grl checkpoint roundtrip — torch.save({'model': sd, 'step': N}, path); ck = torch.load(path) now returns exactly what was saved (matches PyTorch semantics). Previously the loader force-wrapped content under a fixed 'model' key.

VMA fix — BufferPool::allocateBuffer now uses requiredFlags = DEVICE_LOCAL_BIT instead of preferredFlags, so the allocator actually selects VRAM instead of silently falling back to slow host memory.

Variable.__array__ — numpy interop. np.matmul(tensor, w) / np.dot(tensor, w) / np.asarray(tensor) now work transparently.

PEP 660 editable install fix — import grilly_core now works under modern editable installs (path hook didn't add the package dir to sys.path, so the Vulkan probe silently reported VULKAN_AVAILABLE = False even on machines with working Vulkan).

See the "What's new since 0.6.1" section below for the full list. Tag will land as a v1.0.0-rc.1 once the remaining causal-RNN training validation lands.

Why Grilly?

Any GPU: Vulkan runs on AMD, NVIDIA, Intel, and Apple (via MoltenVK). No CUDA lock-in.
PyTorch-like API: nn.Module, F.relu, AdamW -- familiar patterns, new backend.
Always works: Pure-Python numpy fallback if no GPU is available. Same code, same results.
Research-ready: Spiking neural networks, Vector Symbolic Architectures, Mixture of Experts, cognitive controllers, temporal reasoning -- all GPU-accelerated.
Lightweight: Core dependency is numpy only. Optional extras for torch, HuggingFace, ONNX.

Installation

Option 1: Python-only (no GPU acceleration)

pip install grilly

Works immediately with numpy. No GPU, no Vulkan SDK, no C++ compiler needed.

Option 2: With Vulkan GPU acceleration

Linux / Google Colab (one-liner)

# Full build (~30 min — includes validation layers, all SDK tools)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash

# Fast build (~5 min — shaderc + loader only, recommended for Colab/CI)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --fast

On Colab:

# Recommended: Colab mode (Vulkan 1.3 + fast build + NVIDIA ICD, ~5 min)
!wget -qO- https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --colab

This installs system deps, downloads and builds Vulkan SDK 1.4, compiles the grilly C++ extension, and installs the Python package. The --fast flag builds only the components grilly needs (shaderc, loader, headers) and skips validation layers.

Linux (manual step-by-step)

# 1. System dependencies (Ubuntu/Debian)
sudo apt-get install -y cmake g++ ninja-build pkg-config \
    libxcb-dri3-0 libxcb-present0 libpciaccess0 libpng-dev \
    libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev libwayland-dev \
    libxrandr-dev libxcb-randr0-dev libx11-xcb-dev wayland-protocols

# 2. Vulkan SDK (download from https://vulkan.lunarg.com/sdk/home)
wget https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkansdk-linux-x86_64-1.4.341.1.tar.xz
tar xf vulkansdk-linux-x86_64-1.4.341.1.tar.xz
cd 1.4.341.1 && ./vulkansdk all -j $(nproc)
export VULKAN_SDK=$(pwd)/x86_64
export PATH=$VULKAN_SDK/bin:$PATH
export LD_LIBRARY_PATH=$VULKAN_SDK/lib:$LD_LIBRARY_PATH

# 3. Build grilly
git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)

# 4. Install the compiled extension
cp build/grilly_core.*.so $(python -c "import grilly; print(grilly.__path__[0])")/

Windows

# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home (Windows installer)
# 2. Install Visual Studio 2022 with C++ workload

git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -DPYBIND11_FINDPYTHON=ON
cmake --build build --config Release
cp build\Release\grilly_core.*.pyd .

Pre-built binary (Windows x64, Python 3.12): Download grilly_core.cp312-win_amd64.pyd from the latest release and copy it into your grilly install directory.

macOS

# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home#mac
brew install cmake ninja
# 2. Follow the Linux build steps above (uses MoltenVK)

Verify installation

import grilly
print(f"grilly {grilly.__version__}")

# Check GPU backend
try:
    from grilly.backend import _bridge
    print(f"Vulkan: {'enabled' if _bridge.is_available() else 'not available'}")
except ImportError:
    print("Vulkan: not installed (numpy fallback active)")

Requirements

	Minimum	Recommended
Python	3.12+	3.12
GPU VRAM	8 GB	12 GB+
System RAM	32 GB	64 GB
Vulkan	1.1+	1.4 (latest SDK)

Supported GPUs: AMD (RX 5000+), NVIDIA (GTX 1060+), Intel (Arc A-series), Apple (M1+ via MoltenVK).

Quick Start

import numpy as np
from grilly import nn
from grilly.optim import AdamW

# Build a model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

# Train
optimizer = AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

x = np.random.randn(32, 784).astype(np.float32)
targets = np.random.randint(0, 10, (32,))

logits = model(x)
loss = loss_fn(logits, targets)
grad = loss_fn.backward(np.ones_like(loss), logits, targets)

model.zero_grad()
model.backward(grad)
optimizer.step()

Autograd

from grilly.nn import Variable, tensor

x = Variable(tensor([1.0, 2.0, 3.0]), requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad)  # [2.0, 4.0, 6.0]

Functional API

import grilly.functional as F

out = F.linear(x, weight, bias)
out = F.relu(out)
out = F.softmax(out, dim=-1)
attn = F.flash_attention2(q, k, v)

See notebooks/01_getting_started.ipynb for a complete walkthrough.

Features

Layers (100+)

Category	Modules
Linear	`Linear`, `Embedding`, `CapsuleEmbedding`, `Dropout`
Convolution	`Conv1d`, `Conv2d`
Recurrent	`LSTM`, `LSTMCell`, `GRU`, `GRUCell`
Normalization	`LayerNorm`, `RMSNorm`, `BatchNorm1d/2d`
Activations	`ReLU`, `GELU`, `SiLU`, `SwiGLU`, `GCU`, `RoSwish`
Attention	`FlashAttention2/3`, `HYLAAttention`, `MultiheadAttention`, `RoPE`
LoRA	`LoRALinear`, `LoRAAttention`, `LoRAModel`
Pooling	`MaxPool2d`, `AvgPool2d`, `AdaptiveMaxPool2d/AvgPool2d`
Loss	`MSELoss`, `CrossEntropyLoss`, `BCELoss`
Containers	`Sequential`, `Residual`
Multimodal	`PerceiverIO`, `ImageBindFusion`, `FlamingoFusion`, `VisionLanguageModel`
Memory	`MemoryRead`, `MemoryWrite`, `MemoryContextAggregate`
Routing	`DomainRouter`, `DomainPredictor`, `ExpertCombiner`

Spiking Neural Networks

Full SNN framework with GPU-accelerated spike dynamics:

Neurons: IFNode, LIFNode, ParametricLIFNode
Surrogate gradients: ATan, Sigmoid, FastSigmoid
Synapses: STPSynapse, DualTimescaleSynapse, SynapseFilter
Temporal containers: SeqToANNContainer, MultiStepContainer
Spiking attention: SpikingSelfAttention, QKAttention, TemporalWiseAttention
ANN-to-SNN conversion: Converter, VoltageScaler

Optimizers

Optimizer	Description
`Adam`	Classic Adam
`AdamW`	Adam with decoupled weight decay
`SGD`	Stochastic gradient descent
`NLMS`	Normalized Least Mean Squares
`NaturalGradient`	Fisher-preconditioned
`HypergradientAdamW`	OSGM-style auto learning rate
`AutoHypergradientAdamW`	Fully automatic hypergradient
`AffectAdam`	Emotion-weighted updates

Schedulers: StepLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR.

Experimental Modules

Module	Description
`experimental.vsa`	Vector Symbolic Architectures (binary, holographic, block-codes, resonator networks)
`experimental.moe`	Mixture of Experts (relational encoder, resonator routing)
`experimental.temporal`	Temporal reasoning (causal chains, counterfactuals, world models)
`experimental.cognitive`	Cognitive controller (working memory, simulation, understand-think-speak)
`experimental.language`	Language processing (encoding, generation, parsing)

Architecture

Python API                    C++ Bridge                  GPU Shaders
─────────────                 ──────────                  ───────────
nn.Module layers              pybind11 bindings           231 SPIR-V kernels
F.* stateless ops      →      dual-validity tensors  →    AMD / NVIDIA / Intel
optim.* optimizers            zero CPU↔GPU ping-pong      No CUDA dependency
autograd engine               buffer pool management      Vulkan 1.1+ compute

3-Level GPU Fallback

Every operation has automatic fallback:

grilly C++ / Vulkan -- native compute shaders (fastest)
PyTorch CUDA -- if torch is available (fast)
NumPy CPU -- always available (correct)

Same API, same results, different speed. Your code never changes.

GPU Kernels (231 operations)

Category	Count	Examples
Linear algebra	20+	GEMM, FFT, SVD, matmul
Attention	15+	flash attention, multi-head, spiking
Convolution	10+	conv2d forward/backward, im2col
Learning	20+	Adam, STDP, Hebbian, EWC, NLMS
VSA	10+	bind, bundle, similarity, resonator
SNN	15+	LIF/IF neuron, synapse, spike generation
Normalization	10+	layer norm, batch norm, RMS norm
Activation	15+	ReLU, GELU, SiLU, softmax (all with backward)
Memory/FAISS	10+	similarity search, place/time cells

Ecosystem

Package	Description
optimum-grilly	HuggingFace Optimum backend -- `from_pretrained` on Vulkan
CubeMind	Neuro-vector-symbolic cognitive architecture powered by grilly

Notebooks & Tutorials

Notebook	Description
`notebooks/01_getting_started.ipynb`	Installation verification, first model, GPU check
`notebooks/02_training_loop.ipynb`	Full training loop: data loading, loss, optimization, checkpointing
`notebooks/03_spiking_neural_networks.ipynb`	SNN neurons, STDP learning, ANN-to-SNN conversion
`notebooks/04_vector_symbolic_architectures.ipynb`	VSA ops: bind, bundle, similarity, resonator networks
`notebooks/05_attention_and_transformers.ipynb`	Flash attention, RoPE, PerceiverIO, multi-head attention

See also tutorials/ for standalone Python scripts covering every feature.

Testing

# All tests
uv run pytest tests/ -v

# CPU-only (no GPU required)
uv run pytest tests/ -m "not gpu" -v

# With coverage
uv run pytest tests/ --cov=. --cov-report=term

# Single module
uv run pytest tests/test_linear.py -v

Environment Variables

Variable	Description	Default
`VK_GPU_INDEX`	Select GPU by index	`0`
`GRILLY_DEBUG`	Enable debug logging (`1` = on)	off
`ALLOW_CPU_VULKAN`	Allow Mesa llvmpipe software Vulkan	off

What's New

Pre-v1.0 (current `main`, since 0.6.1) {#whats-new-since-061}

A long debug session against a real workload (vsa_lm_v3c_grilly — language modeling with multiplication-free FFN + causal Linear-RNN mixer) surfaced and fixed a stack of bugs and perf cliffs that the 0.6.1 test suite never tripped. Each fix is small in isolation; the pile is large enough to warrant a major version bump.

Performance — bridge dispatch overhauled

BufferPool::allocateBuffer VMA fix. Changed preferredFlags → requiredFlags = DEVICE_LOCAL_BIT. The old code silently fell back to slow host-visible BAR memory on AMD/Windows when the allocator's auto-select picked the wrong heap; the fix forces memoryType[2] (DEVICE_LOCAL+HOST_VISIBLE+HOST_COHERENT) under Resizable BAR or fails loudly when ReBAR is unavailable. (cpp/src/buffer_pool.cpp)
3-way bucket pool routing. acquire / acquireDeviceLocal / acquireReadback now have separate per-size pools; release routes by the buffer's deviceLocal / readback flag. Prevents a DL buffer from being picked up by a host-visible acquire and crashing on mappedPtr=null.
Staging pattern across all hot ops ("Thread A"). Each op acquires DEVICE_LOCAL VRAM compute buffers + WC sequential-write stage-in + HOST_CACHED random-read stage-out, batches a single command buffer with copyBuffer × N → barrier → dispatch → barrier → copyBuffer × M → submit/wait. Applied to:
- cpp/src/ops/linear.cpp — linear, linearBackward, dropout
- cpp/src/ops/activations.cpp — activationForward / activationBackward helpers (covers ReLU/GELU/SiLU/Tanh)
- cpp/src/ops/layernorm.cpp — layernorm, layernormBackward
- cpp/src/ops/embedding.cpp — embeddingLookup
- cpp/src/ops/optimizer.cpp — adamUpdate, adamwUpdate
- cpp/src/ops/loss.cpp — crossEntropyLoss, crossEntropyBackward
Measured impact: forward nn.Linear on a 4096×384×1152 GEMM went from 763 ms → 19 ms on an AMD RX 6750 XT (~40x). The download phase alone collapsed from 749 ms → 2.7 ms once the output stage moved to HOST_CACHED memory (random-read instead of uncached WC reads).
transferComputeBarrier() added to CommandBatch — bidirectional TRANSFER ↔ COMPUTE memory + execution barrier needed by the staging pattern (the existing barrier() is COMPUTE→COMPUTE only, kept unchanged for linearBackward's 3-pass intra-shader barriers).

fp16 + cooperative matrix GEMM

shaders/gemm-coopmat-shared.glsl — fp16 tiled GEMM via VK_KHR_cooperative_matrix with shared-memory staging. Subgroup scope, 16×64 (M×N) tile per workgroup, 256 threads (4×Wave64 subgroups), fp32 accumulator. Dispatches to native WMMA on RDNA3 and NVIDIA RTX, falls through the driver emulation path on RDNA1/RDNA2.
shaders/gemm-bias-add.glsl — companion row-broadcast bias add (the coopmat store can't interleave bias inline).
LinearParams.elemSize — new field (4 = fp32, 2 = fp16). linear() selects gemm-coopmat-shared when elemSize == 2, cooperative-matrix is supported, AND shape is aligned (M%16, K%16, N%64); otherwise falls back to fnn-linear.glsl.
Pybind: generic py::array — bindings_linear.cpp now accepts fp32 OR fp16 numpy input via xBuf.itemsize. Output is always fp32 (coopmat accumulator). Bias must be fp32 regardless of input dtype.
linearBackward interface upgrade — same void* + elemSize signature so the fp16 path slots in cleanly when an fp16 backward shader lands. For now elemSize != 4 raises with a clear message.

Causal Linear-RNN prefix scan (new feature)

shaders/prefix-scan-causal.glsl — h_t = a_t * h_{t-1} + x_t in O(log S) parallel depth via subgroupInclusiveAdd on log(a) and the rescaled input (Blelloch's two-scan trick). Strictly causal; one workgroup per (batch, hidden_dim) pair.
shaders/prefix-scan-causal-backward.glsl — anti-causal scan for grad_x and grad_a via the identity R[t] = total - F[t] + w[t] (no subgroupShuffle, which is undefined on partial Wave64 subgroups). Hits fp32 epsilon vs the closed-form gradient (verified max abs err ≈ 3.6e-6).
grilly/cpp/src/ops/prefix_scan.cpp — C++ dispatcher with the same staging pattern as the rest of Thread A.
grilly/cpp/python/bindings_prefix_scan.cpp — pybind exposing prefix_scan_causal and prefix_scan_causal_backward.
grilly/nn/prefix_scan.py — Python autograd wrapper (prefix_scan_causal()) wired into grilly's Variable / GradFn system, plus a CausalSequenceMixer module that uses it as a drop-in causal sequence-pooling replacement.
Constraint: seq_len <= 32 (one thread per time step in a single subgroup). A hierarchical multi-subgroup version is on the roadmap.

Autograd — actually working now

Module.__setattr__ auto-registration. self.weight = nn.Parameter(...) and self.lin = nn.Linear(...) now populate _parameters / _modules automatically. Standard PyTorch idiom. Was previously silently broken — every Module subclass returned 0 parameters from parameters(), AdamW silently no-op'd.
nn.Linear.forward autograd wiring. When the input is a Variable, the output is wrapped in a Variable with a GradFn whose backward closure calls the existing Linear.backward() (which already populates weight.grad/bias.grad via the GPU shader). Same template applied to nn.LayerNorm.forward and nn.Embedding.forward.
Variable.__array__ — numpy array protocol on nn.autograd.Variable. np.matmul(tensor, w) / np.dot(tensor, w) / np.asarray(tensor) now operate on the backing ndarray transparently. Required to let grilly's existing numpy-native layer code keep working when called from torch_api Tensor inputs.
Module.__call__ Variable passthrough + output wrap. Inputs of type Tensor / LongTensor / Variable are passed through to forward() unchanged; raw ndarray outputs are re-wrapped in Tensor so chained calls preserve torch-style type all the way through user-defined Module subclasses.
Parameter shape methods — unsqueeze, view, mean(dim=...), detach added to nn.Parameter so user forward code can do self.weight.unsqueeze(0) / self.weight.view(...) / self.weight.mean(dim=-1) without knowing that Parameter is an np.ndarray subclass.
nn.init.normal_/uniform_ — added a _writable_array(tensor) helper that unwraps Tensor/Variable wrappers to their backing ndarray for in-place init. Previously raised TypeError: 'Tensor' object does not support item assignment for the standard nn.init.normal_(self.weight, 0, 0.02) idiom.
F.gelu re-export in grilly.nn.functional (was importable via grilly.nn.autograd.gelu but missing from the public functional namespace).

Checkpoint format

.grl save/load roundtrip fixed. torch.save({'model': sd, 'step': N}, path) followed by ck = torch.load(path) now returns exactly what was saved (matches torch.save/torch.load semantics). The previous load_grl force-wrapped content under a fixed 'model' key, producing ck['model']['model']['weight'] instead of ck['model']['weight'] for any payload that already contained a model key.

Editable install / Vulkan probe

grilly/__init__.py sys.path fix. Added an os.path.dirname insert at the very top of the package init so import grilly_core works under PEP 660 editable installs. The path hook used by modern editable installs (__editable__.grilly-X.Y.finder.__path_hook__) doesn't add the package directory to sys.path, so the sibling grilly_core.<plat>.pyd was invisible to import grilly_core. The downstream effect: backend/base.py:_probe_cpp_vulkan() silently caught the ModuleNotFoundError, set VULKAN_AVAILABLE = False, and the entire nn stack thought it had no GPU (despite a perfectly working Vulkan device).
Module._get_backend() graceful None. Catches the legacy VulkanCompute init exception and returns None so layers that only used _get_backend() for one-time GPU Xavier init at construction time don't crash when the legacy Python vulkan ctypes package isn't installed (the new C++ _bridge path doesn't need it).

Pre-existing shader bugs surfaced by recompile

Three shaders that had stale .spv files compiled against a more permissive glslang version. Recent glslang catches them:

fused-layernorm-linear.glsl — added missing #extension GL_EXT_shader_atomic_float : require for the atomicAdd(shared_sum, sg_sum) accumulator.
lstm-cell-forward.glsl — renamed buffer field input → input_data (input is a reserved word in recent glslang). Also removed an incorrect writeonly qualifier on the gates buffer that the shader actually reads back.
vsa-explore.glsl — renamed buffer field output → output_data. Same writeonly mismatch fix.

Tooling

rebuild.ps1 — one-command Windows rebuild. Compiles all GLSL → SPIR-V (with -S comp to disambiguate the stage, --target-env vulkan1.3 for cooperative matrix + subgroup extensions), runs cmake --build build2 --config Release --target grilly_core, copies the freshly built .pyd to the package root. Skips up-to-date shaders by mtime comparison.
PipelineCache::getDevice() accessor — needed by linear.cpp to query hasCooperativeMatrix() before selecting the coopmat shader path.

Lint cleanup

75 ruff errors fixed across the codebase. Mix of unsorted imports (I001), unused imports (F401), missing f-string placeholders (F541), deprecated typing imports (UP035), non-PEP 585 annotations (UP006), and a yield from modernization in nn.Module.named_buffers.

0.6.x

MindForge adapter hypernetwork integration (via CubeMind)
Synaptic shaders: synapsis-stdp-update.glsl, bridge-spike-to-continuous.glsl
JIT shader fusion with shaderc runtime compilation
Perceiver IO with IndexCache K/V pre-projection
MoQE Gumbel-Softmax router shader
215 compute shaders (up from 190)

0.5.x

C++ Tensor with dual-validity tracking -- GPU-resident data, no CPU ping-pong
Flash Attention 3 with subgroup acceleration
HYLAAttention (softmax-free), FNetMixing, SympFormerBlock
HDC packed ops -- 32x memory compression
Sanger GHA for neurogenesis
JIT compilation framework (@grilly.jit)
Automatic Mixed Precision (autocast + GradScaler)

Contributing

Fork the repo and create a feature branch
Add tests for new features
Run ruff check . and uv run pytest tests/ -v
Submit a pull request

License

MIT License -- see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
.github/workflows		.github/workflows
assets		assets
backend		backend
benchmarks		benchmarks
cpp		cpp
docs		docs
examples		examples
experimental		experimental
experimental_datasets		experimental_datasets
functional		functional
grilly_datasets		grilly_datasets
howtos		howtos
nn		nn
notebooks		notebooks
optim		optim
scripts		scripts
shaders		shaders
tests		tests
third_party		third_party
tokenizer_impl		tokenizer_impl
torch_api		torch_api
tutorials		tutorials
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SUPPORTED_DEVICES.md		SUPPORTED_DEVICES.md
__init__.py		__init__.py
grilly_core.exp		grilly_core.exp
grilly_core.lib		grilly_core.lib
grilly_core.pdb		grilly_core.pdb
grilly_core_lib.lib		grilly_core_lib.lib
grilly_core_lib.pdb		grilly_core_lib.pdb
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
rebuild.ps1		rebuild.ps1
rufferrors.txt		rufferrors.txt
torch_api_example.py		torch_api_example.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Grilly

⚠️ Pre-v1.0 release — massive changes ahead

Why Grilly?

Installation

Option 1: Python-only (no GPU acceleration)

Option 2: With Vulkan GPU acceleration

Linux / Google Colab (one-liner)

Linux (manual step-by-step)

Windows

macOS

Verify installation

Requirements

Quick Start

Autograd

Functional API

Features

Layers (100+)

Spiking Neural Networks

Optimizers

Experimental Modules

Architecture

3-Level GPU Fallback

GPU Kernels (231 operations)

Ecosystem

Notebooks & Tutorials

Testing

Environment Variables

What's New

Pre-v1.0 (current main, since 0.6.1) {#whats-new-since-061}

Performance — bridge dispatch overhauled

fp16 + cooperative matrix GEMM

Causal Linear-RNN prefix scan (new feature)

Autograd — actually working now

Checkpoint format

Editable install / Vulkan probe

Pre-existing shader bugs surfaced by recompile

Tooling

Lint cleanup

0.6.x

0.5.x

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors 3

Languages

Pre-v1.0 (current `main`, since 0.6.1) {#whats-new-since-061}

Packages