Skip to content

Grillcheese-AI/grilly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

318 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Grilly

Grilly

Deep learning, well done.

CI PyPI Tests License: MIT Docs

GPU-accelerated neural network framework using Vulkan compute shaders. PyTorch-like API that runs on any GPU -- AMD, NVIDIA, Intel -- no CUDA dependency. 231 GLSL compute shaders compiled to SPIR-V, dispatched through a native C++ layer with automatic CPU fallback.

⚠️ Pre-v1.0 release — massive changes ahead

The current main branch is substantially rewritten since the last tagged release (v0.6.1). The changes touch the C++ Vulkan dispatch layer, the VMA buffer allocation strategy, the autograd graph, the nn module forward signatures, the torch_api facade, and several shaders. Existing user code that depends on v0.6.1 semantics may need updates.

Highlights of what changed:

  • 40x faster nn.Linear on AMD/Windows via a new staging-buffer pattern (DEVICE_LOCAL VRAM compute + WC stage-in + HOST_CACHED readback). The same pattern was applied across linear / linear_backward / layernorm / embedding / activations / optimizer / loss / dropout.
  • Cooperative-matrix GEMM (gemm-coopmat-shared.glsl) — fp16 GEMM via VK_KHR_cooperative_matrix. Hits hardware tensor cores on RDNA3 / NVIDIA RTX, runs through the driver emulation path on RDNA2. New LinearParams.elemSize field + generic py::array bindings let nn.Linear accept fp32 OR fp16 input transparently.
  • Causal Linear-RNN prefix scan — new prefix-scan-causal / prefix-scan-causal-backward shaders + C++ dispatcher + Python autograd wrapper. grilly.nn.prefix_scan.CausalSequenceMixer is a drop-in subgroup-parallel replacement for sequence pooling that strictly respects autoregressive causality.
  • Real autograd through nn.Linear, nn.LayerNorm, nn.Embedding — their forward methods now wrap numpy outputs in Variable with a GradFn that calls the existing backward() so loss.backward() actually updates layer parameters. Before this fix, optimizer steps silently no-op'd on every Module subclass that used the standard self.weight = nn.Parameter(...) idiom.
  • Module.__setattr__ auto-registrationParameter and child Module attribute assignments now populate _parameters / _modules automatically. Standard PyTorch idiom; was previously silently broken (every subclass returned 0 parameters).
  • .grl checkpoint roundtriptorch.save({'model': sd, 'step': N}, path); ck = torch.load(path) now returns exactly what was saved (matches PyTorch semantics). Previously the loader force-wrapped content under a fixed 'model' key.
  • VMA fixBufferPool::allocateBuffer now uses requiredFlags = DEVICE_LOCAL_BIT instead of preferredFlags, so the allocator actually selects VRAM instead of silently falling back to slow host memory.
  • Variable.__array__ — numpy interop. np.matmul(tensor, w) / np.dot(tensor, w) / np.asarray(tensor) now work transparently.
  • PEP 660 editable install fiximport grilly_core now works under modern editable installs (path hook didn't add the package dir to sys.path, so the Vulkan probe silently reported VULKAN_AVAILABLE = False even on machines with working Vulkan).

See the "What's new since 0.6.1" section below for the full list. Tag will land as a v1.0.0-rc.1 once the remaining causal-RNN training validation lands.


Why Grilly?

  • Any GPU: Vulkan runs on AMD, NVIDIA, Intel, and Apple (via MoltenVK). No CUDA lock-in.
  • PyTorch-like API: nn.Module, F.relu, AdamW -- familiar patterns, new backend.
  • Always works: Pure-Python numpy fallback if no GPU is available. Same code, same results.
  • Research-ready: Spiking neural networks, Vector Symbolic Architectures, Mixture of Experts, cognitive controllers, temporal reasoning -- all GPU-accelerated.
  • Lightweight: Core dependency is numpy only. Optional extras for torch, HuggingFace, ONNX.

Installation

Option 1: Python-only (no GPU acceleration)

pip install grilly

Works immediately with numpy. No GPU, no Vulkan SDK, no C++ compiler needed.

Option 2: With Vulkan GPU acceleration

Linux / Google Colab (one-liner)

# Full build (~30 min — includes validation layers, all SDK tools)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash

# Fast build (~5 min — shaderc + loader only, recommended for Colab/CI)
curl -sSL https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --fast

On Colab:

# Recommended: Colab mode (Vulkan 1.3 + fast build + NVIDIA ICD, ~5 min)
!wget -qO- https://raw.githubusercontent.com/Grillcheese-AI/grilly/main/scripts/install.sh | bash -s -- --colab

This installs system deps, downloads and builds Vulkan SDK 1.4, compiles the grilly C++ extension, and installs the Python package. The --fast flag builds only the components grilly needs (shaderc, loader, headers) and skips validation layers.

Linux (manual step-by-step)

# 1. System dependencies (Ubuntu/Debian)
sudo apt-get install -y cmake g++ ninja-build pkg-config \
    libxcb-dri3-0 libxcb-present0 libpciaccess0 libpng-dev \
    libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev libwayland-dev \
    libxrandr-dev libxcb-randr0-dev libx11-xcb-dev wayland-protocols

# 2. Vulkan SDK (download from https://vulkan.lunarg.com/sdk/home)
wget https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkansdk-linux-x86_64-1.4.341.1.tar.xz
tar xf vulkansdk-linux-x86_64-1.4.341.1.tar.xz
cd 1.4.341.1 && ./vulkansdk all -j $(nproc)
export VULKAN_SDK=$(pwd)/x86_64
export PATH=$VULKAN_SDK/bin:$PATH
export LD_LIBRARY_PATH=$VULKAN_SDK/lib:$LD_LIBRARY_PATH

# 3. Build grilly
git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)

# 4. Install the compiled extension
cp build/grilly_core.*.so $(python -c "import grilly; print(grilly.__path__[0])")/

Windows

# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home (Windows installer)
# 2. Install Visual Studio 2022 with C++ workload

git clone --recurse-submodules https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
cmake -B build -DPYBIND11_FINDPYTHON=ON
cmake --build build --config Release
cp build\Release\grilly_core.*.pyd .

Pre-built binary (Windows x64, Python 3.12): Download grilly_core.cp312-win_amd64.pyd from the latest release and copy it into your grilly install directory.

macOS

# 1. Install Vulkan SDK from https://vulkan.lunarg.com/sdk/home#mac
brew install cmake ninja
# 2. Follow the Linux build steps above (uses MoltenVK)

Verify installation

import grilly
print(f"grilly {grilly.__version__}")

# Check GPU backend
try:
    from grilly.backend import _bridge
    print(f"Vulkan: {'enabled' if _bridge.is_available() else 'not available'}")
except ImportError:
    print("Vulkan: not installed (numpy fallback active)")

Requirements

Minimum Recommended
Python 3.12+ 3.12
GPU VRAM 8 GB 12 GB+
System RAM 32 GB 64 GB
Vulkan 1.1+ 1.4 (latest SDK)

Supported GPUs: AMD (RX 5000+), NVIDIA (GTX 1060+), Intel (Arc A-series), Apple (M1+ via MoltenVK).


Quick Start

import numpy as np
from grilly import nn
from grilly.optim import AdamW

# Build a model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

# Train
optimizer = AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

x = np.random.randn(32, 784).astype(np.float32)
targets = np.random.randint(0, 10, (32,))

logits = model(x)
loss = loss_fn(logits, targets)
grad = loss_fn.backward(np.ones_like(loss), logits, targets)

model.zero_grad()
model.backward(grad)
optimizer.step()

Autograd

from grilly.nn import Variable, tensor

x = Variable(tensor([1.0, 2.0, 3.0]), requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad)  # [2.0, 4.0, 6.0]

Functional API

import grilly.functional as F

out = F.linear(x, weight, bias)
out = F.relu(out)
out = F.softmax(out, dim=-1)
attn = F.flash_attention2(q, k, v)

See notebooks/01_getting_started.ipynb for a complete walkthrough.


Features

Layers (100+)

Category Modules
Linear Linear, Embedding, CapsuleEmbedding, Dropout
Convolution Conv1d, Conv2d
Recurrent LSTM, LSTMCell, GRU, GRUCell
Normalization LayerNorm, RMSNorm, BatchNorm1d/2d
Activations ReLU, GELU, SiLU, SwiGLU, GCU, RoSwish
Attention FlashAttention2/3, HYLAAttention, MultiheadAttention, RoPE
LoRA LoRALinear, LoRAAttention, LoRAModel
Pooling MaxPool2d, AvgPool2d, AdaptiveMaxPool2d/AvgPool2d
Loss MSELoss, CrossEntropyLoss, BCELoss
Containers Sequential, Residual
Multimodal PerceiverIO, ImageBindFusion, FlamingoFusion, VisionLanguageModel
Memory MemoryRead, MemoryWrite, MemoryContextAggregate
Routing DomainRouter, DomainPredictor, ExpertCombiner

Spiking Neural Networks

Full SNN framework with GPU-accelerated spike dynamics:

  • Neurons: IFNode, LIFNode, ParametricLIFNode
  • Surrogate gradients: ATan, Sigmoid, FastSigmoid
  • Synapses: STPSynapse, DualTimescaleSynapse, SynapseFilter
  • Temporal containers: SeqToANNContainer, MultiStepContainer
  • Spiking attention: SpikingSelfAttention, QKAttention, TemporalWiseAttention
  • ANN-to-SNN conversion: Converter, VoltageScaler

Optimizers

Optimizer Description
Adam Classic Adam
AdamW Adam with decoupled weight decay
SGD Stochastic gradient descent
NLMS Normalized Least Mean Squares
NaturalGradient Fisher-preconditioned
HypergradientAdamW OSGM-style auto learning rate
AutoHypergradientAdamW Fully automatic hypergradient
AffectAdam Emotion-weighted updates

Schedulers: StepLR, CosineAnnealingLR, ReduceLROnPlateau, OneCycleLR.

Experimental Modules

Module Description
experimental.vsa Vector Symbolic Architectures (binary, holographic, block-codes, resonator networks)
experimental.moe Mixture of Experts (relational encoder, resonator routing)
experimental.temporal Temporal reasoning (causal chains, counterfactuals, world models)
experimental.cognitive Cognitive controller (working memory, simulation, understand-think-speak)
experimental.language Language processing (encoding, generation, parsing)

Architecture

Python API                    C++ Bridge                  GPU Shaders
─────────────                 ──────────                  ───────────
nn.Module layers              pybind11 bindings           231 SPIR-V kernels
F.* stateless ops      →      dual-validity tensors  →    AMD / NVIDIA / Intel
optim.* optimizers            zero CPU↔GPU ping-pong      No CUDA dependency
autograd engine               buffer pool management      Vulkan 1.1+ compute

3-Level GPU Fallback

Every operation has automatic fallback:

  1. grilly C++ / Vulkan -- native compute shaders (fastest)
  2. PyTorch CUDA -- if torch is available (fast)
  3. NumPy CPU -- always available (correct)

Same API, same results, different speed. Your code never changes.

GPU Kernels (231 operations)

Category Count Examples
Linear algebra 20+ GEMM, FFT, SVD, matmul
Attention 15+ flash attention, multi-head, spiking
Convolution 10+ conv2d forward/backward, im2col
Learning 20+ Adam, STDP, Hebbian, EWC, NLMS
VSA 10+ bind, bundle, similarity, resonator
SNN 15+ LIF/IF neuron, synapse, spike generation
Normalization 10+ layer norm, batch norm, RMS norm
Activation 15+ ReLU, GELU, SiLU, softmax (all with backward)
Memory/FAISS 10+ similarity search, place/time cells

Ecosystem

Package Description
optimum-grilly HuggingFace Optimum backend -- from_pretrained on Vulkan
CubeMind Neuro-vector-symbolic cognitive architecture powered by grilly

Notebooks & Tutorials

Notebook Description
notebooks/01_getting_started.ipynb Installation verification, first model, GPU check
notebooks/02_training_loop.ipynb Full training loop: data loading, loss, optimization, checkpointing
notebooks/03_spiking_neural_networks.ipynb SNN neurons, STDP learning, ANN-to-SNN conversion
notebooks/04_vector_symbolic_architectures.ipynb VSA ops: bind, bundle, similarity, resonator networks
notebooks/05_attention_and_transformers.ipynb Flash attention, RoPE, PerceiverIO, multi-head attention

See also tutorials/ for standalone Python scripts covering every feature.


Testing

# All tests
uv run pytest tests/ -v

# CPU-only (no GPU required)
uv run pytest tests/ -m "not gpu" -v

# With coverage
uv run pytest tests/ --cov=. --cov-report=term

# Single module
uv run pytest tests/test_linear.py -v

Environment Variables

Variable Description Default
VK_GPU_INDEX Select GPU by index 0
GRILLY_DEBUG Enable debug logging (1 = on) off
ALLOW_CPU_VULKAN Allow Mesa llvmpipe software Vulkan off

What's New

Pre-v1.0 (current main, since 0.6.1) {#whats-new-since-061}

A long debug session against a real workload (vsa_lm_v3c_grilly — language modeling with multiplication-free FFN + causal Linear-RNN mixer) surfaced and fixed a stack of bugs and perf cliffs that the 0.6.1 test suite never tripped. Each fix is small in isolation; the pile is large enough to warrant a major version bump.

Performance — bridge dispatch overhauled

  • BufferPool::allocateBuffer VMA fix. Changed preferredFlagsrequiredFlags = DEVICE_LOCAL_BIT. The old code silently fell back to slow host-visible BAR memory on AMD/Windows when the allocator's auto-select picked the wrong heap; the fix forces memoryType[2] (DEVICE_LOCAL+HOST_VISIBLE+HOST_COHERENT) under Resizable BAR or fails loudly when ReBAR is unavailable. (cpp/src/buffer_pool.cpp)
  • 3-way bucket pool routing. acquire / acquireDeviceLocal / acquireReadback now have separate per-size pools; release routes by the buffer's deviceLocal / readback flag. Prevents a DL buffer from being picked up by a host-visible acquire and crashing on mappedPtr=null.
  • Staging pattern across all hot ops ("Thread A"). Each op acquires DEVICE_LOCAL VRAM compute buffers + WC sequential-write stage-in + HOST_CACHED random-read stage-out, batches a single command buffer with copyBuffer × N → barrier → dispatch → barrier → copyBuffer × M → submit/wait. Applied to:
    • cpp/src/ops/linear.cpplinear, linearBackward, dropout
    • cpp/src/ops/activations.cppactivationForward / activationBackward helpers (covers ReLU/GELU/SiLU/Tanh)
    • cpp/src/ops/layernorm.cpplayernorm, layernormBackward
    • cpp/src/ops/embedding.cppembeddingLookup
    • cpp/src/ops/optimizer.cppadamUpdate, adamwUpdate
    • cpp/src/ops/loss.cppcrossEntropyLoss, crossEntropyBackward
  • Measured impact: forward nn.Linear on a 4096×384×1152 GEMM went from 763 ms → 19 ms on an AMD RX 6750 XT (~40x). The download phase alone collapsed from 749 ms → 2.7 ms once the output stage moved to HOST_CACHED memory (random-read instead of uncached WC reads).
  • transferComputeBarrier() added to CommandBatch — bidirectional TRANSFER ↔ COMPUTE memory + execution barrier needed by the staging pattern (the existing barrier() is COMPUTE→COMPUTE only, kept unchanged for linearBackward's 3-pass intra-shader barriers).

fp16 + cooperative matrix GEMM

  • shaders/gemm-coopmat-shared.glsl — fp16 tiled GEMM via VK_KHR_cooperative_matrix with shared-memory staging. Subgroup scope, 16×64 (M×N) tile per workgroup, 256 threads (4×Wave64 subgroups), fp32 accumulator. Dispatches to native WMMA on RDNA3 and NVIDIA RTX, falls through the driver emulation path on RDNA1/RDNA2.
  • shaders/gemm-bias-add.glsl — companion row-broadcast bias add (the coopmat store can't interleave bias inline).
  • LinearParams.elemSize — new field (4 = fp32, 2 = fp16). linear() selects gemm-coopmat-shared when elemSize == 2, cooperative-matrix is supported, AND shape is aligned (M%16, K%16, N%64); otherwise falls back to fnn-linear.glsl.
  • Pybind: generic py::arraybindings_linear.cpp now accepts fp32 OR fp16 numpy input via xBuf.itemsize. Output is always fp32 (coopmat accumulator). Bias must be fp32 regardless of input dtype.
  • linearBackward interface upgrade — same void* + elemSize signature so the fp16 path slots in cleanly when an fp16 backward shader lands. For now elemSize != 4 raises with a clear message.

Causal Linear-RNN prefix scan (new feature)

  • shaders/prefix-scan-causal.glslh_t = a_t * h_{t-1} + x_t in O(log S) parallel depth via subgroupInclusiveAdd on log(a) and the rescaled input (Blelloch's two-scan trick). Strictly causal; one workgroup per (batch, hidden_dim) pair.
  • shaders/prefix-scan-causal-backward.glsl — anti-causal scan for grad_x and grad_a via the identity R[t] = total - F[t] + w[t] (no subgroupShuffle, which is undefined on partial Wave64 subgroups). Hits fp32 epsilon vs the closed-form gradient (verified max abs err ≈ 3.6e-6).
  • grilly/cpp/src/ops/prefix_scan.cpp — C++ dispatcher with the same staging pattern as the rest of Thread A.
  • grilly/cpp/python/bindings_prefix_scan.cpp — pybind exposing prefix_scan_causal and prefix_scan_causal_backward.
  • grilly/nn/prefix_scan.py — Python autograd wrapper (prefix_scan_causal()) wired into grilly's Variable / GradFn system, plus a CausalSequenceMixer module that uses it as a drop-in causal sequence-pooling replacement.
  • Constraint: seq_len <= 32 (one thread per time step in a single subgroup). A hierarchical multi-subgroup version is on the roadmap.

Autograd — actually working now

  • Module.__setattr__ auto-registration. self.weight = nn.Parameter(...) and self.lin = nn.Linear(...) now populate _parameters / _modules automatically. Standard PyTorch idiom. Was previously silently broken — every Module subclass returned 0 parameters from parameters(), AdamW silently no-op'd.
  • nn.Linear.forward autograd wiring. When the input is a Variable, the output is wrapped in a Variable with a GradFn whose backward closure calls the existing Linear.backward() (which already populates weight.grad/bias.grad via the GPU shader). Same template applied to nn.LayerNorm.forward and nn.Embedding.forward.
  • Variable.__array__ — numpy array protocol on nn.autograd.Variable. np.matmul(tensor, w) / np.dot(tensor, w) / np.asarray(tensor) now operate on the backing ndarray transparently. Required to let grilly's existing numpy-native layer code keep working when called from torch_api Tensor inputs.
  • Module.__call__ Variable passthrough + output wrap. Inputs of type Tensor / LongTensor / Variable are passed through to forward() unchanged; raw ndarray outputs are re-wrapped in Tensor so chained calls preserve torch-style type all the way through user-defined Module subclasses.
  • Parameter shape methodsunsqueeze, view, mean(dim=...), detach added to nn.Parameter so user forward code can do self.weight.unsqueeze(0) / self.weight.view(...) / self.weight.mean(dim=-1) without knowing that Parameter is an np.ndarray subclass.
  • nn.init.normal_/uniform_ — added a _writable_array(tensor) helper that unwraps Tensor/Variable wrappers to their backing ndarray for in-place init. Previously raised TypeError: 'Tensor' object does not support item assignment for the standard nn.init.normal_(self.weight, 0, 0.02) idiom.
  • F.gelu re-export in grilly.nn.functional (was importable via grilly.nn.autograd.gelu but missing from the public functional namespace).

Checkpoint format

  • .grl save/load roundtrip fixed. torch.save({'model': sd, 'step': N}, path) followed by ck = torch.load(path) now returns exactly what was saved (matches torch.save/torch.load semantics). The previous load_grl force-wrapped content under a fixed 'model' key, producing ck['model']['model']['weight'] instead of ck['model']['weight'] for any payload that already contained a model key.

Editable install / Vulkan probe

  • grilly/__init__.py sys.path fix. Added an os.path.dirname insert at the very top of the package init so import grilly_core works under PEP 660 editable installs. The path hook used by modern editable installs (__editable__.grilly-X.Y.finder.__path_hook__) doesn't add the package directory to sys.path, so the sibling grilly_core.<plat>.pyd was invisible to import grilly_core. The downstream effect: backend/base.py:_probe_cpp_vulkan() silently caught the ModuleNotFoundError, set VULKAN_AVAILABLE = False, and the entire nn stack thought it had no GPU (despite a perfectly working Vulkan device).
  • Module._get_backend() graceful None. Catches the legacy VulkanCompute init exception and returns None so layers that only used _get_backend() for one-time GPU Xavier init at construction time don't crash when the legacy Python vulkan ctypes package isn't installed (the new C++ _bridge path doesn't need it).

Pre-existing shader bugs surfaced by recompile

Three shaders that had stale .spv files compiled against a more permissive glslang version. Recent glslang catches them:

  • fused-layernorm-linear.glsl — added missing #extension GL_EXT_shader_atomic_float : require for the atomicAdd(shared_sum, sg_sum) accumulator.
  • lstm-cell-forward.glsl — renamed buffer field inputinput_data (input is a reserved word in recent glslang). Also removed an incorrect writeonly qualifier on the gates buffer that the shader actually reads back.
  • vsa-explore.glsl — renamed buffer field outputoutput_data. Same writeonly mismatch fix.

Tooling

  • rebuild.ps1 — one-command Windows rebuild. Compiles all GLSL → SPIR-V (with -S comp to disambiguate the stage, --target-env vulkan1.3 for cooperative matrix + subgroup extensions), runs cmake --build build2 --config Release --target grilly_core, copies the freshly built .pyd to the package root. Skips up-to-date shaders by mtime comparison.
  • PipelineCache::getDevice() accessor — needed by linear.cpp to query hasCooperativeMatrix() before selecting the coopmat shader path.

Lint cleanup

  • 75 ruff errors fixed across the codebase. Mix of unsorted imports (I001), unused imports (F401), missing f-string placeholders (F541), deprecated typing imports (UP035), non-PEP 585 annotations (UP006), and a yield from modernization in nn.Module.named_buffers.

0.6.x

  • MindForge adapter hypernetwork integration (via CubeMind)
  • Synaptic shaders: synapsis-stdp-update.glsl, bridge-spike-to-continuous.glsl
  • JIT shader fusion with shaderc runtime compilation
  • Perceiver IO with IndexCache K/V pre-projection
  • MoQE Gumbel-Softmax router shader
  • 215 compute shaders (up from 190)

0.5.x

  • C++ Tensor with dual-validity tracking -- GPU-resident data, no CPU ping-pong
  • Flash Attention 3 with subgroup acceleration
  • HYLAAttention (softmax-free), FNetMixing, SympFormerBlock
  • HDC packed ops -- 32x memory compression
  • Sanger GHA for neurogenesis
  • JIT compilation framework (@grilly.jit)
  • Automatic Mixed Precision (autocast + GradScaler)

Contributing

  1. Fork the repo and create a feature branch
  2. Add tests for new features
  3. Run ruff check . and uv run pytest tests/ -v
  4. Submit a pull request

License

MIT License -- see LICENSE for details.