Building behavioral auditing and alignment tools for LLMs. Current focus: surgical behavioral alignment -- using category-aware contrastive losses to fix sycophancy without damaging bias fairness.
Core finding: Standard SFT inverts toxicity discrimination in Qwen 7B (+0.145 baseline to -0.003 post-SFT, 5 seeds, p<0.001). Rho-guided SFT with a contrastive confidence loss repairs this (d=10.8 on toxicity, d=13.7 on bias, p<0.0001). At 7B scale, sycophancy SFT causes concentrated collateral damage on Age (-29%), Race (-14%), and Religion (-11%) -- motivating the Rho-Surgery approach with targeted protection losses.
rho-eval (v2.2.3) -- Drop-in behavioral audit for any LLM. 1,826 probes across 8 dimensions (factual, toxicity, sycophancy, bias, reasoning, refusal, deception, over-refusal), no internet required. Plugin architecture for custom behaviors. Apple Silicon MLX acceleration. Rho-guided SFT alignment. Rho-Surgery with gamma protection loss.
pip install rho-evalCLI Tools -- works on any platform (Apple Silicon MLX, CUDA, CPU):
# Audit any model across 8 behavioral dimensions
rho-eval Qwen/Qwen2.5-7B-Instruct --behaviors all
# One-command behavioral repair: diagnose → compress → LoRA SFT → save
rho-surgery Qwen/Qwen2.5-7B-Instruct -o ./repaired-7b/
# Comprehensive benchmarking with before/after comparison
rho-benchmark ./repaired-7b/model/ --baseline Qwen/Qwen2.5-7B-Instruct
# Adversarial pressure test (Truth-Gap)
rho-bench Qwen/Qwen2.5-7B-Instruct# Python API -- auto-detects MLX or PyTorch
from rho_eval import audit, load_model
model, tokenizer, backend = load_model("Qwen/Qwen2.5-7B-Instruct")
report = audit(model=model, tokenizer=tokenizer, behaviors="all")v2.2.3 -- Rho-Surgery CLI + Platform-Aware Pipeline:
rho-surgery-- End-to-end behavioral repair: diagnose category-level risk, SVD compress, LoRA SFT with gamma protection, save repaired model. Works on any platform via--device auto|mlx|cuda|cpurho-benchmark-- Comprehensive benchmarking (8-dim audit + TruthfulQA MC2) with optional baseline comparison- Rho-Surgery -- Gamma protection loss prevents collateral bias damage during sycophancy SFT. CatSAE broke Religion's structural ceiling (+22.2% vs +11.1% uniform)
- Hybrid weight + activation control -- SVD compression + SAE steering + rho-guided SFT. SAE at layer 17 cuts bias collateral damage 39%
- Biology-grounded bias probes -- 37 new probes based on peer-reviewed science
Earlier findings (v2.1.1 -- paper):
- SFT toxicity inversion -- Standard fine-tuning inverts toxicity discrimination (5 seeds, p<0.001); contrastive confidence loss repairs it (d=10.8 toxicity, d=13.7 bias, p<0.0001)
- Variance collapse -- Factual sigma drops 63% from SFT-only to rho-guided, making training more reliable across seeds
- Refusal buffer -- Contrastive-only training erodes refusal (d=-8.4, p=0.0005); the SFT component preserves it
- MLX-native training -- Full alignment pipeline runs on Apple Silicon. 7B model trains in ~90 min on M3 Ultra
| Repo | Install | What it does |
|---|---|---|
| knowledge-fidelity | pip install rho-eval |
Behavioral auditing + alignment toolkit for LLMs. PyPI. DOI: 10.5281/zenodo.18743959. 1,826 probes, 8 dimensions, Rho-Surgery, MLX training. CLIs: rho-eval, rho-surgery, rho-benchmark, rho-bench, rho-steer, rho-align, rho-hybrid. Featured in Awesome-LLM-Compression. |
| confidence-cartography-toolkit | pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git |
Teacher-forced confidence as a false-belief sensor. CLI: cartography. Human false-belief correlation rho=0.652 across Pythia 160M-12B. |
| intelligent-svd | pip install git+https://github.com/SolomonB14D3/intelligent-svd.git |
Knowledge-preserving SVD compression. CF90 method: TruthfulQA +5%, 75% fact retention. pip install intelligent-svd[audit] for rho-eval integration. |
General-purpose behavioral diagnostic toolkit-- Done (rho-eval v2.0.0)Mechanistic interpretability of behavioral subspaces-- Done (SVD subspace analysis, Grassmann angles, Universal Kill Zone discovery)Rho-guided alignment / fine-tuning-- Done (v2.1.1 -- paper, SFT inversion discovery, contrastive repair, dose-response)Hybrid weight + activation control framework-- Done (v2.2.0 -- SVD + SAE + rho SFT unified pipeline, 7B sweep, SAE mitigates 39% bias collateral)Rho-Surgery: targeted category-aware alignment-- Done (v2.2.2 --rho-surgeryCLI, gamma protection loss, CatSAE, platform-aware pipeline)- Open behavioral benchmark suite (Fidelity-Bench 2.0)