Skip to content

AndreSlavescu/meTile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

meTile

Tile-based eDSL and compiler for Apple GPUs. Write tile programs in Python, compile to Metal.

@metile.kernel (Python eDSL)
    - Tile IR (hardware-agnostic)
    - Metal IR (simdgroup mappings, threadgroup memory)
    - MSL codegen
    - xcrun metal -O2 (precompiled metallib)
    - dispatch via ctypes Metal bridge

Example: GEMM

@metile.kernel
def gemm(A, B, C, M, N, K,
         BLOCK_M: metile.constexpr, BLOCK_N: metile.constexpr,
         BLOCK_K: metile.constexpr):
    pid_m = metile.program_id(0)
    pid_n = metile.program_id(1)
    acc = metile.zeros((BLOCK_M, BLOCK_N), dtype="f32")
    for k in metile.tile_range(0, K, BLOCK_K):
        a = metile.tile_load(A, pid_m * BLOCK_M, k, K, (BLOCK_M, BLOCK_K))
        b = metile.tile_load(B, k, pid_n * BLOCK_N, N, (BLOCK_K, BLOCK_N))
        acc = metile.dot(a, b, acc)
    metile.tile_store(C, pid_m * BLOCK_M, pid_n * BLOCK_N, N, acc, (BLOCK_M, BLOCK_N))

Example: Softmax

@metile.kernel
def softmax(X, Out, N, BLOCK: metile.constexpr):
    row = metile.program_id(0)
    m = -1e38
    for i in metile.tile_range(0, N, BLOCK):
        cols = i + metile.arange(0, BLOCK)
        mask = cols < N
        x = metile.load(X + row * N + cols, mask=mask)
        m = metile.maximum(m, x)
    m = metile.max(m)

    s = 0.0
    for i in metile.tile_range(0, N, BLOCK):
        cols = i + metile.arange(0, BLOCK)
        mask = cols < N
        x = metile.load(X + row * N + cols, mask=mask)
        s = s + metile.exp(x - m)
    s = metile.sum(s)

    for i in metile.tile_range(0, N, BLOCK):
        cols = i + metile.arange(0, BLOCK)
        mask = cols < N
        x = metile.load(X + row * N + cols, mask=mask)
        metile.store(Out + row * N + cols, metile.exp(x - m) / s, mask=mask)

Features

eDSL & Frontend

  • Python-based eDSL: @metile.kernel, program_id, arange, load/store, dot, tile_load/tile_store
  • Autotuner (@metile.autotune) with config search over block sizes, SG counts, execution modes

Compiler

  • Multi-level IR pipeline: Tile IR (hardware-agnostic) → Metal IR (decomposed primitives) → MSL
  • CuTe-inspired layout algebra with hierarchical Shape:Stride, composition, complement, and logical divide. Supports arbitrary tile shapes.
  • Composable optimization passes that transform IR structure: shared memory padding / XOR swizzle, split-K, vectorized loads, serpentine MMA traversal, preloaded tiles, double-buffered K-loop, block swizzle for L2 locality
  • Fused epilogues (ReLU, exp, scale) on register-resident accumulators via thread_elements() with zero global memory traffic.

Codegen

  • Simdgroup matrix (8x8) MMA with decomposed load / MMA / store primitives
  • Metal 4 tensor_ops (matmul2d) for TF32 on M5, M5 pro, and M5 max (requires Xcode to be built as well) via preemptive and cooperative execution modes
  • AOT compilation via xcrun metal -O2 with JIT fallback (newLibraryWithSource) when Xcode is unavailable

Runtime

  • Zero-copy unified memory via metile.Buffer. CPU and GPU share the same physical memory.
  • Pure Python runtime. meTile has a ctypes Metal bridge with no PyObjC dependency.

Install

pip install -e ".[dev]"

Run Tests

# run individual test
python -m pytest tests/test_gemm.py -v

# run all like so
python -m pytest tests/ -x -q

# or with `make test`
make test

Run Benchmarks

# run individual benchmark
python benchmarks/gemm.py

# or run `make bench` for running all the benchmarks
make bench

Architecture

@metile.kernel (Python eDSL)
    - Tile IR (hardware-agnostic ops: program_id, tile_load, dot, ...)
    - Metal IR (decomposed primitives: simdgroup load/MMA/store, cooperative_tensor ops)
    - Optimization passes (serpentine reordering, preload, pad/swizzle, split-K, ...)
    - MSL codegen (op-by-op emission)
    - xcrun metal -O2 (precompiled metallib)
    - dispatch via ctypes Metal bridge
Layer File Role
Frontend frontend/kernel.py @kernel decorator, compilation pipeline, dispatch
Frontend frontend/tracing.py eDSL ops, constexpr folding, tensor descriptors
Tile IR ir/tile_ir.py Hardware-agnostic tile operations
Metal IR ir/metal_ir.py Decomposed Apple GPU primitives (simdgroup, tensor_ops, cooperative loads)
Layout ir/layout.py CuTe-inspired layout algebra (Shape:Stride, composition, logical divide)
Lowering compiler/lowering.py Tile IR → Metal IR (GEMM detection, simdgroup/tensor_ops paths)
Passes compiler/passes.py IR → IR transforms (serpentine, preload, pad, swizzle, split-K, vectorize)
Codegen codegen/msl_emitter.py Metal IR → MSL (uniform op walker, no per-kernel templates)
Runtime runtime/metal_device.py Metal API via ctypes (compile, dispatch, sync)
Runtime runtime/buffer.py Zero-copy unified memory buffers

Citations

metile's layout algebra is directly inspired by CuTe's hierarchical layout representation:

@misc{cecka2026cute,
    title={CuTe Layout Representation and Algebra},
    author={Cris Cecka},
    year={2026},
    eprint={2603.02298},
    archivePrefix={arXiv},
    primaryClass={cs.MS},
    url={https://arxiv.org/abs/2603.02298}
}

The metile/ir/layout.py module implements CuTe's core concepts, Layout(shape, stride) with hierarchical tuples, colexicographic coordinate mapping, and algebraic operations (coalesce, compose, complement, logical divide, logical product), adapted for Apple GPU tiling patterns (simdgroup 8x8, threadgroup memory banking, cooperative loads).

metile's eDSL design and tile-level programming model draws from Triton, which is a popular tile-based multi-program multi-data pythonic eDSL:

@inproceedings{tillet2019triton,
    title={Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations},
    author={Philippe Tillet and H. T. Kung and David Cox},
    booktitle={Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages},
    year={2019},
    doi={10.1145/3315508.3329973}
}

Links

License

MIT

About

python-based eDSL for efficient Metal Shading Language code generation

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors