diff --git a/CLAUDE.md b/CLAUDE.md index e3066a9..31fd060 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -21,12 +21,22 @@ The entire 209GB model streams from SSD through a custom Metal compute pipeline. ## Hardware +Primary development machine: + - **Machine**: MacBook Pro, Apple M3 Max - **Chip**: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE - **Memory**: 48 GB unified (~400 GB/s bandwidth) - **SSD**: 1TB Apple Fabric, **17.5 GB/s sequential read** (measured) - **macOS**: 26.2 (Darwin 25.2.0) +Also verified on: + +- **Machine**: MacBook Pro, Apple M4 Pro +- **Chip**: 14-core CPU (10P + 4E), 20-core GPU +- **Memory**: 24 GB unified (~273 GB/s bandwidth) +- **macOS**: 15.4 (Darwin 24.4.0) +- **Result**: **3.50 tok/s** at 4-bit — only ~20% slower despite half the RAM and half the GPU cores. The architecture scales down to 24 GB unified memory without modification. + ## Architecture The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096. @@ -95,6 +105,7 @@ metal_infer/ main.m # MoE-only benchmark Makefile # Build system extract_weights.py # Creates model_weights.bin from safetensors + export_tokenizer.py # Exports tokenizer.json → tokenizer.bin + vocab.bin (with GPT-2 byte decoding) repack_experts_2bit.py # 4-bit → 2-bit expert requantization train_predictor.py # Expert routing prediction analysis model_weights.bin # Non-expert weights (5.5GB, mmap'd) @@ -112,6 +123,7 @@ results.tsv # Experiment log (58 experiments) ### Kept | Approach | Result | Impact | |----------|--------|--------| +| GPT-2 byte decoding in vocab.bin | Fixes `Ġ`/`Ċ` garbage in output | **Correctness** | | FMA dequant kernel | GPU compute -12% | **+12% tok/s** | | Trust OS page cache | Deleted Metal LRU → +38% | **Foundational** | | GPU combine+norm in CMD3 | Eliminates CPU round-trip | **Pipeline** | diff --git a/metal_infer/results.tsv b/metal_infer/results.tsv index 86cc254..13f9a66 100644 --- a/metal_infer/results.tsv +++ b/metal_infer/results.tsv @@ -22,3 +22,4 @@ HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 4.04 0 5.5 discard Fused GPU final-norm + HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 4.35 0 9.8 keep Indexed expert cache makes a larger hotset viable: 2500-entry Metal cache averaged 3.75ms/layer and 4.35 tok/s vs 4.09ms/layer and 3.98 tok/s at 1500 entries on the fixed ocean prompt. 3500+ entries regressed despite higher hit rate, so 2500 is the new sweet spot. HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 5.74 0 5.5 keep TRUST THE OS: killed Metal LRU cache + F_NOCACHE. OS page cache alone = 5.74 tok/s (was 4.11 with 9.8GB Metal cache). 38% faster. Near-zero memory pressure. HEAD Qwen3.5-397B-A17B-2bit 397.0 17.0 5.37 2615 5.5 keep Batch prefill: TTFT 5587→2615ms (53% faster). discard_deferred_experts() saves 57ms/intermediate token. Generation 5.37 unchanged. +3601d41 Qwen3.5-397B-A17B-4bit 397.0 17.0 3.50 4613 5.1 keep M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory. diff --git a/results.tsv b/results.tsv index 0eeed6c..42a546c 100644 --- a/results.tsv +++ b/results.tsv @@ -32,3 +32,4 @@ c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 4.10 0 5.5 discard GPU LUT dequant ker c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.33 0 5.5 discard GPU private buffer compression: blit shared→private before matvec. Isolated -13.5% but 4×7MB blit cost exceeds savings in pipeline. c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.91 0 5.5 discard dispatch_io: -70% slower than pread (dispatch_data overhead kills it). c222a4f Qwen3.5-397B-A17B-4bit 397.0 17.0 3.91 0 5.5 discard Expert file clustering: NVMe doesn't care about scatter distance at 7MB read granularity. 0%. +3601d41 Qwen3.5-397B-A17B-4bit 397.0 17.0 3.50 4613 5.1 keep M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory.