Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,22 @@ The entire 209GB model streams from SSD through a custom Metal compute pipeline.

## Hardware

Primary development machine:

- **Machine**: MacBook Pro, Apple M3 Max
- **Chip**: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
- **Memory**: 48 GB unified (~400 GB/s bandwidth)
- **SSD**: 1TB Apple Fabric, **17.5 GB/s sequential read** (measured)
- **macOS**: 26.2 (Darwin 25.2.0)

Also verified on:

- **Machine**: MacBook Pro, Apple M4 Pro
- **Chip**: 14-core CPU (10P + 4E), 20-core GPU
- **Memory**: 24 GB unified (~273 GB/s bandwidth)
- **macOS**: 15.4 (Darwin 24.4.0)
- **Result**: **3.50 tok/s** at 4-bit — only ~20% slower despite half the RAM and half the GPU cores. The architecture scales down to 24 GB unified memory without modification.

## Architecture

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
Expand Down Expand Up @@ -95,6 +105,7 @@ metal_infer/
main.m # MoE-only benchmark
Makefile # Build system
extract_weights.py # Creates model_weights.bin from safetensors
export_tokenizer.py # Exports tokenizer.json → tokenizer.bin + vocab.bin (with GPT-2 byte decoding)
repack_experts_2bit.py # 4-bit → 2-bit expert requantization
train_predictor.py # Expert routing prediction analysis
model_weights.bin # Non-expert weights (5.5GB, mmap'd)
Expand All @@ -112,6 +123,7 @@ results.tsv # Experiment log (58 experiments)
### Kept
| Approach | Result | Impact |
|----------|--------|--------|
| GPT-2 byte decoding in vocab.bin | Fixes `Ġ`/`Ċ` garbage in output | **Correctness** |
| FMA dequant kernel | GPU compute -12% | **+12% tok/s** |
| Trust OS page cache | Deleted Metal LRU → +38% | **Foundational** |
| GPU combine+norm in CMD3 | Eliminates CPU round-trip | **Pipeline** |
Expand Down
1 change: 1 addition & 0 deletions metal_infer/results.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 4.04 0 5.5 discard Fused GPU final-norm +
HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 4.35 0 9.8 keep Indexed expert cache makes a larger hotset viable: 2500-entry Metal cache averaged 3.75ms/layer and 4.35 tok/s vs 4.09ms/layer and 3.98 tok/s at 1500 entries on the fixed ocean prompt. 3500+ entries regressed despite higher hit rate, so 2500 is the new sweet spot.
HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 5.74 0 5.5 keep TRUST THE OS: killed Metal LRU cache + F_NOCACHE. OS page cache alone = 5.74 tok/s (was 4.11 with 9.8GB Metal cache). 38% faster. Near-zero memory pressure.
HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 5.37 2615 5.5 keep Batch prefill: TTFT 5587→2615ms (53% faster). discard_deferred_experts() saves 57ms/intermediate token. Generation 5.37 unchanged.
3601d41 Qwen3.5-397B-A17B-4bit 397.0 17.0 3.50 4613 5.1 keep M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory.
1 change: 1 addition & 0 deletions results.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@ c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 4.10 0 5.5 discard GPU LUT dequant ker
c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.33 0 5.5 discard GPU private buffer compression: blit shared→private before matvec. Isolated -13.5% but 4×7MB blit cost exceeds savings in pipeline.
c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.91 0 5.5 discard dispatch_io: -70% slower than pread (dispatch_data overhead kills it).
c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.91 0 5.5 discard Expert file clustering: NVMe doesn't care about scatter distance at 7MB read granularity. 0%.
3601d41 Qwen3.5-397B-A17B-4bit 397.0 17.0 3.50 4613 5.1 keep M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory.