Deep learning, well done.
GPU-accelerated neural network framework using Vulkan compute shaders. PyTorch-like API that runs on any GPU -- AMD, NVIDIA, Intel -- no CUDA dependency. 231 GLSL compute shaders compiled to SPIR-V, dispatched through a native C++ layer with automatic CPU fallback.
The current
mainbranch is substantially rewritten since the last tagged release (v0.6.1). The changes touch the C++ Vulkan dispatch layer, the VMA buffer allocation strategy, the autograd graph, thennmodule forward signatures, thetorch_apifacade, and several shaders. Existing user code that depends onv0.6.1semantics may need updates.Highlights of what changed:
- 40x faster
nn.Linearon AMD/Windows via a new staging-buffer pattern (DEVICE_LOCAL VRAM compute + WC stage-in + HOST_CACHED readback). The same pattern was applied acrosslinear/linear_backward/layernorm/embedding/activations/optimizer/loss/dropout.- Cooperative-matrix GEMM (
gemm-coopmat-shared.glsl) — fp16 GEMM viaVK_KHR_cooperative_matrix. Hits hardware tensor cores on RDNA3 / NVIDIA RTX, runs through the driver emulation path on RDNA2. NewLinearParams.elemSizefield + genericpy::arraybindings letnn.Linearaccept fp32 OR fp16 input transparently.- Causal Linear-RNN prefix scan — new
prefix-scan-causal/prefix-scan-causal-backwardshaders + C++ dispatcher + Python autograd wrapper.grilly.nn.prefix_scan.CausalSequenceMixeris a drop-in subgroup-parallel replacement for sequence pooling that strictly respects autoregressive causality.- Real autograd through
nn.Linear,nn.LayerNorm,nn.Embedding— theirforwardmethods now wrap numpy outputs inVariablewith aGradFnthat calls the existingbackward()soloss.backward()actually updates layer parameters. Before this fix, optimizer steps silently no-op'd on every Module subclass that used the standardself.weight = nn.Parameter(...)idiom.Module.__setattr__auto-registration —Parameterand childModuleattribute assignments now populate_parameters/_modulesautomatically. Standard PyTorch idiom; was previously silently broken (every subclass returned 0 parameters)..grlcheckpoint roundtrip —torch.save({'model': sd, 'step': N}, path); ck = torch.load(path)now returns exactly what was saved (matches PyTorch semantics). Previously the loader force-wrapped content under a fixed'model'key.- VMA fix —
BufferPool::allocateBuffernow usesrequiredFlags = DEVICE_LOCAL_BITinstead ofpreferredFlags, so the allocator actually selects VRAM instead of silently falling back to slow host memory.Variable.__array__— numpy interop.np.matmul(tensor, w)/np.dot(tensor, w)/np.asarray(tensor)now work transparently.- PEP 660 editable install fix —
import grilly_corenow works under modern editable installs (path hook didn't add the package dir tosys.path, so the Vulkan probe silently reportedVULKAN_AVAILABLE = Falseeven on machines with working Vulkan).See the "What's new since 0.6.1" section below for the full list. Tag will land as a
v1.0.0-rc.1once the remaining causal-RNN training validation lands.
- Any GPU: Vulkan runs on AMD, NVIDIA, Intel, and Apple (via MoltenVK). No CUDA lock-in.
- PyTorch-like API:
nn.Module,F.relu,AdamW-- familiar patterns, new backend. - Always works: Pure-Python numpy fallback if no GPU is available. Same code, same results.
- Research-ready: Spiking neural networks, Vector Symbolic Architectures, Mixture of Experts, cognitive controllers, temporal reasoning -- all GPU-accelerated.
- Lightweight: Core dependency is numpy only. Optional extras for torch, HuggingFace, ONNX.
pip install grillyWorks immediately with numpy. No GPU, no Vulkan SDK, no C++ compiler needed.
# Full build (~30 min — includes validation layers, all SDK tools)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash
# Fast build (~5 min — shaderc + loader only, recommended for Colab/CI)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --fastOn Colab:
# Recommended: Colab mode (Vulkan 1.3 + fast build + NVIDIA ICD, ~5 min)
!wget -qO- https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --colabThis installs system deps, downloads and builds Vulkan SDK 1.4, compiles the grilly C++ extension, and installs the Python package. The --fast flag builds only the components grilly needs (shaderc, loader, headers) and skips validation layers.
# 1. System dependencies (Ubuntu/Debian)
sudo apt-get install -y cmake g++ ninja-build pkg-config \
libxcb-dri3-0 libxcb-present0 libpciaccess0 libpng-dev \
libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev libwayland-dev \
libxrandr-dev libxcb-randr0-dev libx11-xcb-dev wayland-protocols
# 2. Vulkan SDK (download from https://vulkan.lunarg.com/sdk/home)
wget https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkansdk-linux-x86_64-1.4.341.1.tar.xz
tar xf vulkansdk-linux-x86_64-1.4.341.1.tar.xz
cd 1.4.341.1 && ./vulkansdk all -j $(nproc)
export VULKAN_SDK=$(pwd)/x86_64
export PATH=$VULKAN_SDK/bin:$PATH
export LD_LIBRARY_PATH=$VULKAN_SDK/lib:$LD_LIBRARY_PATH
# 3. Build grilly
git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)
# 4. Install the compiled extension
cp build/grilly_core.*.so $(python -c "import grilly; print(grilly.__path__[0])")/# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home (Windows installer)
# 2. Install Visual Studio 2022 with C++ workload
git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -DPYBIND11_FINDPYTHON=ON
cmake --build build --config Release
cp build\Release\grilly_core.*.pyd .Pre-built binary (Windows x64, Python 3.12): Download grilly_core.cp312-win_amd64.pyd from the latest release and copy it into your grilly install directory.
# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home#mac
brew install cmake ninja
# 2. Follow the Linux build steps above (uses MoltenVK)import grilly
print(f"grilly {grilly.__version__}")
# Check GPU backend
try:
from grilly.backend import _bridge
print(f"Vulkan: {'enabled' if _bridge.is_available() else 'not available'}")
except ImportError:
print("Vulkan: not installed (numpy fallback active)")| Minimum | Recommended | |
|---|---|---|
| Python | 3.12+ | 3.12 |
| GPU VRAM | 8 GB | 12 GB+ |
| System RAM | 32 GB | 64 GB |
| Vulkan | 1.1+ | 1.4 (latest SDK) |
Supported GPUs: AMD (RX 5000+), NVIDIA (GTX 1060+), Intel (Arc A-series), Apple (M1+ via MoltenVK).
import numpy as np
from grilly import nn
from grilly.optim import AdamW
# Build a model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
# Train
optimizer = AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
x = np.random.randn(32, 784).astype(np.float32)
targets = np.random.randint(0, 10, (32,))
logits = model(x)
loss = loss_fn(logits, targets)
grad = loss_fn.backward(np.ones_like(loss), logits, targets)
model.zero_grad()
model.backward(grad)
optimizer.step()from grilly.nn import Variable, tensor
x = Variable(tensor([1.0, 2.0, 3.0]), requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad) # [2.0, 4.0, 6.0]import grilly.functional as F
out = F.linear(x, weight, bias)
out = F.relu(out)
out = F.softmax(out, dim=-1)
attn = F.flash_attention2(q, k, v)See notebooks/01_getting_started.ipynb for a complete walkthrough.
| Category | Modules |
|---|---|
| Linear | Linear, Embedding, CapsuleEmbedding, Dropout |
| Convolution | Conv1d, Conv2d |
| Recurrent | LSTM, LSTMCell, GRU, GRUCell |
| Normalization | LayerNorm, RMSNorm, BatchNorm1d/2d |
| Activations | ReLU, GELU, SiLU, SwiGLU, GCU, RoSwish |
| Attention | FlashAttention2/3, HYLAAttention, MultiheadAttention, RoPE |
| LoRA | LoRALinear, LoRAAttention, LoRAModel |
| Pooling | MaxPool2d, AvgPool2d, AdaptiveMaxPool2d/AvgPool2d |
| Loss | MSELoss, CrossEntropyLoss, BCELoss |
| Containers | Sequential, Residual |
| Multimodal | PerceiverIO, ImageBindFusion, FlamingoFusion, VisionLanguageModel |
| Memory | MemoryRead, MemoryWrite, MemoryContextAggregate |
| Routing | DomainRouter, DomainPredictor, ExpertCombiner |
Full SNN framework with GPU-accelerated spike dynamics:
- Neurons:
IFNode,LIFNode,ParametricLIFNode - Surrogate gradients:
ATan,Sigmoid,FastSigmoid - Synapses:
STPSynapse,DualTimescaleSynapse,SynapseFilter - Temporal containers:
SeqToANNContainer,MultiStepContainer - Spiking attention:
SpikingSelfAttention,QKAttention,TemporalWiseAttention - ANN-to-SNN conversion:
Converter,VoltageScaler
| Optimizer | Description |
|---|---|
Adam |
Classic Adam |
AdamW |
Adam with decoupled weight decay |
SGD |
Stochastic gradient descent |
NLMS |
Normalized Least Mean Squares |
NaturalGradient |
Fisher-preconditioned |
HypergradientAdamW |
OSGM-style auto learning rate |
AutoHypergradientAdamW |
Fully automatic hypergradient |
AffectAdam |
Emotion-weighted updates |
Schedulers: StepLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR.
| Module | Description |
|---|---|
experimental.vsa |
Vector Symbolic Architectures (binary, holographic, block-codes, resonator networks) |
experimental.moe |
Mixture of Experts (relational encoder, resonator routing) |
experimental.temporal |
Temporal reasoning (causal chains, counterfactuals, world models) |
experimental.cognitive |
Cognitive controller (working memory, simulation, understand-think-speak) |
experimental.language |
Language processing (encoding, generation, parsing) |
Python API C++ Bridge GPU Shaders
───────────── ────────── ───────────
nn.Module layers pybind11 bindings 231 SPIR-V kernels
F.* stateless ops → dual-validity tensors → AMD / NVIDIA / Intel
optim.* optimizers zero CPU↔GPU ping-pong No CUDA dependency
autograd engine buffer pool management Vulkan 1.1+ compute
Every operation has automatic fallback:
- grilly C++ / Vulkan -- native compute shaders (fastest)
- PyTorch CUDA -- if torch is available (fast)
- NumPy CPU -- always available (correct)
Same API, same results, different speed. Your code never changes.
| Category | Count | Examples |
|---|---|---|
| Linear algebra | 20+ | GEMM, FFT, SVD, matmul |
| Attention | 15+ | flash attention, multi-head, spiking |
| Convolution | 10+ | conv2d forward/backward, im2col |
| Learning | 20+ | Adam, STDP, Hebbian, EWC, NLMS |
| VSA | 10+ | bind, bundle, similarity, resonator |
| SNN | 15+ | LIF/IF neuron, synapse, spike generation |
| Normalization | 10+ | layer norm, batch norm, RMS norm |
| Activation | 15+ | ReLU, GELU, SiLU, softmax (all with backward) |
| Memory/FAISS | 10+ | similarity search, place/time cells |
| Package | Description |
|---|---|
| optimum-grilly | HuggingFace Optimum backend -- from_pretrained on Vulkan |
| CubeMind | Neuro-vector-symbolic cognitive architecture powered by grilly |
| Notebook | Description |
|---|---|
notebooks/01_getting_started.ipynb |
Installation verification, first model, GPU check |
notebooks/02_training_loop.ipynb |
Full training loop: data loading, loss, optimization, checkpointing |
notebooks/03_spiking_neural_networks.ipynb |
SNN neurons, STDP learning, ANN-to-SNN conversion |
notebooks/04_vector_symbolic_architectures.ipynb |
VSA ops: bind, bundle, similarity, resonator networks |
notebooks/05_attention_and_transformers.ipynb |
Flash attention, RoPE, PerceiverIO, multi-head attention |
See also tutorials/ for standalone Python scripts covering every feature.
# All tests
uv run pytest tests/ -v
# CPU-only (no GPU required)
uv run pytest tests/ -m "not gpu" -v
# With coverage
uv run pytest tests/ --cov=. --cov-report=term
# Single module
uv run pytest tests/test_linear.py -v| Variable | Description | Default |
|---|---|---|
VK_GPU_INDEX |
Select GPU by index | 0 |
GRILLY_DEBUG |
Enable debug logging (1 = on) |
off |
ALLOW_CPU_VULKAN |
Allow Mesa llvmpipe software Vulkan | off |
A long debug session against a real workload (vsa_lm_v3c_grilly —
language modeling with multiplication-free FFN + causal Linear-RNN
mixer) surfaced and fixed a stack of bugs and perf cliffs that the
0.6.1 test suite never tripped. Each fix is small in isolation; the
pile is large enough to warrant a major version bump.
BufferPool::allocateBufferVMA fix. ChangedpreferredFlags→requiredFlags = DEVICE_LOCAL_BIT. The old code silently fell back to slow host-visible BAR memory on AMD/Windows when the allocator's auto-select picked the wrong heap; the fix forcesmemoryType[2](DEVICE_LOCAL+HOST_VISIBLE+HOST_COHERENT) under Resizable BAR or fails loudly when ReBAR is unavailable. (cpp/src/buffer_pool.cpp)- 3-way bucket pool routing.
acquire/acquireDeviceLocal/acquireReadbacknow have separate per-size pools;releaseroutes by the buffer'sdeviceLocal/readbackflag. Prevents a DL buffer from being picked up by a host-visibleacquireand crashing onmappedPtr=null. - Staging pattern across all hot ops ("Thread A"). Each op
acquires DEVICE_LOCAL VRAM compute buffers + WC sequential-write
stage-in + HOST_CACHED random-read stage-out, batches a single
command buffer with
copyBuffer × N → barrier → dispatch → barrier → copyBuffer × M → submit/wait. Applied to:cpp/src/ops/linear.cpp—linear,linearBackward,dropoutcpp/src/ops/activations.cpp—activationForward/activationBackwardhelpers (covers ReLU/GELU/SiLU/Tanh)cpp/src/ops/layernorm.cpp—layernorm,layernormBackwardcpp/src/ops/embedding.cpp—embeddingLookupcpp/src/ops/optimizer.cpp—adamUpdate,adamwUpdatecpp/src/ops/loss.cpp—crossEntropyLoss,crossEntropyBackward
- Measured impact: forward
nn.Linearon a 4096×384×1152 GEMM went from 763 ms → 19 ms on an AMD RX 6750 XT (~40x). The download phase alone collapsed from 749 ms → 2.7 ms once the output stage moved toHOST_CACHEDmemory (random-read instead of uncached WC reads). transferComputeBarrier()added toCommandBatch— bidirectional TRANSFER ↔ COMPUTE memory + execution barrier needed by the staging pattern (the existingbarrier()is COMPUTE→COMPUTE only, kept unchanged forlinearBackward's 3-pass intra-shader barriers).
shaders/gemm-coopmat-shared.glsl— fp16 tiled GEMM viaVK_KHR_cooperative_matrixwith shared-memory staging. Subgroup scope, 16×64 (M×N) tile per workgroup, 256 threads (4×Wave64 subgroups), fp32 accumulator. Dispatches to native WMMA on RDNA3 and NVIDIA RTX, falls through the driver emulation path on RDNA1/RDNA2.shaders/gemm-bias-add.glsl— companion row-broadcast bias add (the coopmat store can't interleave bias inline).LinearParams.elemSize— new field (4 = fp32, 2 = fp16).linear()selectsgemm-coopmat-sharedwhenelemSize == 2, cooperative-matrix is supported, AND shape is aligned (M%16, K%16, N%64); otherwise falls back tofnn-linear.glsl.- Pybind: generic
py::array—bindings_linear.cppnow accepts fp32 OR fp16 numpy input viaxBuf.itemsize. Output is always fp32 (coopmat accumulator). Bias must be fp32 regardless of input dtype. linearBackwardinterface upgrade — samevoid*+elemSizesignature so the fp16 path slots in cleanly when an fp16 backward shader lands. For nowelemSize != 4raises with a clear message.
shaders/prefix-scan-causal.glsl—h_t = a_t * h_{t-1} + x_tin O(log S) parallel depth viasubgroupInclusiveAddonlog(a)and the rescaled input (Blelloch's two-scan trick). Strictly causal; one workgroup per(batch, hidden_dim)pair.shaders/prefix-scan-causal-backward.glsl— anti-causal scan forgrad_xandgrad_avia the identityR[t] = total - F[t] + w[t](nosubgroupShuffle, which is undefined on partial Wave64 subgroups). Hits fp32 epsilon vs the closed-form gradient (verifiedmax abs err ≈ 3.6e-6).grilly/cpp/src/ops/prefix_scan.cpp— C++ dispatcher with the same staging pattern as the rest of Thread A.grilly/cpp/python/bindings_prefix_scan.cpp— pybind exposingprefix_scan_causalandprefix_scan_causal_backward.grilly/nn/prefix_scan.py— Python autograd wrapper (prefix_scan_causal()) wired into grilly'sVariable/GradFnsystem, plus aCausalSequenceMixermodule that uses it as a drop-in causal sequence-pooling replacement.- Constraint:
seq_len <= 32(one thread per time step in a single subgroup). A hierarchical multi-subgroup version is on the roadmap.
Module.__setattr__auto-registration.self.weight = nn.Parameter(...)andself.lin = nn.Linear(...)now populate_parameters/_modulesautomatically. Standard PyTorch idiom. Was previously silently broken — every Module subclass returned 0 parameters fromparameters(), AdamW silently no-op'd.nn.Linear.forwardautograd wiring. When the input is aVariable, the output is wrapped in aVariablewith aGradFnwhose backward closure calls the existingLinear.backward()(which already populatesweight.grad/bias.gradvia the GPU shader). Same template applied tonn.LayerNorm.forwardandnn.Embedding.forward.Variable.__array__— numpy array protocol onnn.autograd.Variable.np.matmul(tensor, w)/np.dot(tensor, w)/np.asarray(tensor)now operate on the backing ndarray transparently. Required to let grilly's existing numpy-native layer code keep working when called from torch_api Tensor inputs.Module.__call__Variable passthrough + output wrap. Inputs of typeTensor/LongTensor/Variableare passed through toforward()unchanged; raw ndarray outputs are re-wrapped inTensorso chained calls preserve torch-style type all the way through user-defined Module subclasses.Parametershape methods —unsqueeze,view,mean(dim=...),detachadded tonn.Parameterso userforwardcode can doself.weight.unsqueeze(0)/self.weight.view(...)/self.weight.mean(dim=-1)without knowing thatParameteris annp.ndarraysubclass.nn.init.normal_/uniform_— added a_writable_array(tensor)helper that unwraps Tensor/Variable wrappers to their backing ndarray for in-place init. Previously raisedTypeError: 'Tensor' object does not support item assignmentfor the standardnn.init.normal_(self.weight, 0, 0.02)idiom.F.gelure-export ingrilly.nn.functional(was importable viagrilly.nn.autograd.gelubut missing from the publicfunctionalnamespace).
.grlsave/load roundtrip fixed.torch.save({'model': sd, 'step': N}, path)followed byck = torch.load(path)now returns exactly what was saved (matchestorch.save/torch.loadsemantics). The previousload_grlforce-wrapped content under a fixed'model'key, producingck['model']['model']['weight']instead ofck['model']['weight']for any payload that already contained amodelkey.
grilly/__init__.pysys.path fix. Added anos.path.dirnameinsert at the very top of the package init soimport grilly_coreworks under PEP 660 editable installs. The path hook used by modern editable installs (__editable__.grilly-X.Y.finder.__path_hook__) doesn't add the package directory tosys.path, so the siblinggrilly_core.<plat>.pydwas invisible toimport grilly_core. The downstream effect:backend/base.py:_probe_cpp_vulkan()silently caught theModuleNotFoundError, setVULKAN_AVAILABLE = False, and the entirennstack thought it had no GPU (despite a perfectly working Vulkan device).Module._get_backend()graceful None. Catches the legacyVulkanComputeinit exception and returnsNoneso layers that only used_get_backend()for one-time GPU Xavier init at construction time don't crash when the legacy Pythonvulkanctypes package isn't installed (the new C++_bridgepath doesn't need it).
Three shaders that had stale .spv files compiled against a more
permissive glslang version. Recent glslang catches them:
fused-layernorm-linear.glsl— added missing#extension GL_EXT_shader_atomic_float : requirefor theatomicAdd(shared_sum, sg_sum)accumulator.lstm-cell-forward.glsl— renamed buffer fieldinput→input_data(inputis a reserved word in recent glslang). Also removed an incorrectwriteonlyqualifier on the gates buffer that the shader actually reads back.vsa-explore.glsl— renamed buffer fieldoutput→output_data. Samewriteonlymismatch fix.
rebuild.ps1— one-command Windows rebuild. Compiles all GLSL → SPIR-V (with-S compto disambiguate the stage,--target-env vulkan1.3for cooperative matrix + subgroup extensions), runscmake --build build2 --config Release --target grilly_core, copies the freshly built.pydto the package root. Skips up-to-date shaders by mtime comparison.PipelineCache::getDevice()accessor — needed bylinear.cppto queryhasCooperativeMatrix()before selecting the coopmat shader path.
- 75 ruff errors fixed across the codebase. Mix of unsorted imports
(
I001), unused imports (F401), missing f-string placeholders (F541), deprecated typing imports (UP035), non-PEP 585 annotations (UP006), and ayield frommodernization innn.Module.named_buffers.
- MindForge adapter hypernetwork integration (via CubeMind)
- Synaptic shaders:
synapsis-stdp-update.glsl,bridge-spike-to-continuous.glsl - JIT shader fusion with shaderc runtime compilation
- Perceiver IO with IndexCache K/V pre-projection
- MoQE Gumbel-Softmax router shader
- 215 compute shaders (up from 190)
- C++ Tensor with dual-validity tracking -- GPU-resident data, no CPU ping-pong
- Flash Attention 3 with subgroup acceleration
- HYLAAttention (softmax-free), FNetMixing, SympFormerBlock
- HDC packed ops -- 32x memory compression
- Sanger GHA for neurogenesis
- JIT compilation framework (
@grilly.jit) - Automatic Mixed Precision (
autocast+GradScaler)
- Fork the repo and create a feature branch
- Add tests for new features
- Run
ruff check .anduv run pytest tests/ -v - Submit a pull request
MIT License -- see LICENSE for details.
