danveloper · JackCid89 · Apr 3, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -21,12 +21,22 @@ The entire 209GB model streams from SSD through a custom Metal compute pipeline.
 
 ## Hardware
 
+Primary development machine:
+
 - **Machine**: MacBook Pro, Apple M3 Max
 - **Chip**: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
 - **Memory**: 48 GB unified (~400 GB/s bandwidth)
 - **SSD**: 1TB Apple Fabric, **17.5 GB/s sequential read** (measured)
 - **macOS**: 26.2 (Darwin 25.2.0)
 
+Also verified on:
+
+- **Machine**: MacBook Pro, Apple M4 Pro
+- **Chip**: 14-core CPU (10P + 4E), 20-core GPU
+- **Memory**: 24 GB unified (~273 GB/s bandwidth)
+- **macOS**: 15.4 (Darwin 24.4.0)
+- **Result**: **3.50 tok/s** at 4-bit — only ~20% slower despite half the RAM and half the GPU cores. The architecture scales down to 24 GB unified memory without modification.
+
 ## Architecture
 
 The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
@@ -95,6 +105,7 @@ metal_infer/
   main.m               # MoE-only benchmark
   Makefile             # Build system
   extract_weights.py   # Creates model_weights.bin from safetensors
+  export_tokenizer.py  # Exports tokenizer.json → tokenizer.bin + vocab.bin (with GPT-2 byte decoding)
   repack_experts_2bit.py  # 4-bit → 2-bit expert requantization
   train_predictor.py   # Expert routing prediction analysis
   model_weights.bin    # Non-expert weights (5.5GB, mmap'd)
@@ -112,6 +123,7 @@ results.tsv            # Experiment log (58 experiments)
 ### Kept
 | Approach | Result | Impact |
 |----------|--------|--------|
+| GPT-2 byte decoding in vocab.bin | Fixes `Ġ`/`Ċ` garbage in output | **Correctness** |
 | FMA dequant kernel | GPU compute -12% | **+12% tok/s** |
 | Trust OS page cache | Deleted Metal LRU → +38% | **Foundational** |
 | GPU combine+norm in CMD3 | Eliminates CPU round-trip | **Pipeline** |

diff --git a/metal_infer/results.tsv b/metal_infer/results.tsv
@@ -22,3 +22,4 @@ HEAD	Qwen3.5-397B-A17B-2bit	397.0	17.0	4.04	0	5.5	discard	Fused GPU final-norm +
 HEAD	Qwen3.5-397B-A17B-2bit	397.0	17.0	4.35	0	9.8	keep	Indexed expert cache makes a larger hotset viable: 2500-entry Metal cache averaged 3.75ms/layer and 4.35 tok/s vs 4.09ms/layer and 3.98 tok/s at 1500 entries on the fixed ocean prompt. 3500+ entries regressed despite higher hit rate, so 2500 is the new sweet spot.
 HEAD	Qwen3.5-397B-A17B-2bit	397.0	17.0	5.74	0	5.5	keep	TRUST THE OS: killed Metal LRU cache + F_NOCACHE. OS page cache alone = 5.74 tok/s (was 4.11 with 9.8GB Metal cache). 38% faster. Near-zero memory pressure.
 HEAD	Qwen3.5-397B-A17B-2bit	397.0	17.0	5.37	2615	5.5	keep	Batch prefill: TTFT 5587→2615ms (53% faster). discard_deferred_experts() saves 57ms/intermediate token. Generation 5.37 unchanged.
+3601d41	Qwen3.5-397B-A17B-4bit	397.0	17.0	3.50	4613	5.1	keep	M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory.
diff --git a/results.tsv b/results.tsv
@@ -32,3 +32,4 @@ c222a4f	Qwen3.5-397B-A17B-4bit	397.0	17.0	4.10	0	5.5	discard	GPU LUT dequant ker
 c222a4f	Qwen3.5-397B-A17B-4bit	397.0	17.0	3.33	0	5.5	discard	GPU private buffer compression: blit shared→private before matvec. Isolated -13.5% but 4×7MB blit cost exceeds savings in pipeline.
 c222a4f	Qwen3.5-397B-A17B-4bit	397.0	17.0	3.91	0	5.5	discard	dispatch_io: -70% slower than pread (dispatch_data overhead kills it).
 c222a4f	Qwen3.5-397B-A17B-4bit	397.0	17.0	3.91	0	5.5	discard	Expert file clustering: NVMe doesn't care about scatter distance at 7MB read granularity. 0%.
+3601d41	Qwen3.5-397B-A17B-4bit	397.0	17.0	3.50	4613	5.1	keep	M4 Pro 24GB (half RAM, half GPU cores vs M3 Max 48GB): 3.50 tok/s steady-state, TTFT 4613ms. Trust-OS page cache ~14GB. Only 20% slower despite halved memory bandwidth (~273 vs ~400 GB/s). Confirms architecture scales down to 24GB unified memory.