Bryan SolomonB14D3

Bryan -- Independent ML Researcher

Building behavioral auditing and alignment tools for LLMs. Current focus: surgical behavioral alignment -- using category-aware contrastive losses to fix sycophancy without damaging bias fairness.

Core finding: Standard SFT inverts toxicity discrimination in Qwen 7B (+0.145 baseline to -0.003 post-SFT, 5 seeds, p<0.001). Rho-guided SFT with a contrastive confidence loss repairs this (d=10.8 on toxicity, d=13.7 on bias, p<0.0001). At 7B scale, sycophancy SFT causes concentrated collateral damage on Age (-29%), Race (-14%), and Religion (-11%) -- motivating the Rho-Surgery approach with targeted protection losses.

Current Work

rho-eval (v2.2.3) -- Drop-in behavioral audit for any LLM. 1,826 probes across 8 dimensions (factual, toxicity, sycophancy, bias, reasoning, refusal, deception, over-refusal), no internet required. Plugin architecture for custom behaviors. Apple Silicon MLX acceleration. Rho-guided SFT alignment. Rho-Surgery with gamma protection loss.

pip install rho-eval

CLI Tools -- works on any platform (Apple Silicon MLX, CUDA, CPU):

# Audit any model across 8 behavioral dimensions
rho-eval Qwen/Qwen2.5-7B-Instruct --behaviors all

# One-command behavioral repair: diagnose → compress → LoRA SFT → save
rho-surgery Qwen/Qwen2.5-7B-Instruct -o ./repaired-7b/

# Comprehensive benchmarking with before/after comparison
rho-benchmark ./repaired-7b/model/ --baseline Qwen/Qwen2.5-7B-Instruct

# Adversarial pressure test (Truth-Gap)
rho-bench Qwen/Qwen2.5-7B-Instruct

# Python API -- auto-detects MLX or PyTorch
from rho_eval import audit, load_model

model, tokenizer, backend = load_model("Qwen/Qwen2.5-7B-Instruct")
report = audit(model=model, tokenizer=tokenizer, behaviors="all")

v2.2.3 -- Rho-Surgery CLI + Platform-Aware Pipeline:

rho-surgery -- End-to-end behavioral repair: diagnose category-level risk, SVD compress, LoRA SFT with gamma protection, save repaired model. Works on any platform via --device auto|mlx|cuda|cpu
rho-benchmark -- Comprehensive benchmarking (8-dim audit + TruthfulQA MC2) with optional baseline comparison
Rho-Surgery -- Gamma protection loss prevents collateral bias damage during sycophancy SFT. CatSAE broke Religion's structural ceiling (+22.2% vs +11.1% uniform)
Hybrid weight + activation control -- SVD compression + SAE steering + rho-guided SFT. SAE at layer 17 cuts bias collateral damage 39%
Biology-grounded bias probes -- 37 new probes based on peer-reviewed science

Earlier findings (v2.1.1 -- paper):

SFT toxicity inversion -- Standard fine-tuning inverts toxicity discrimination (5 seeds, p<0.001); contrastive confidence loss repairs it (d=10.8 toxicity, d=13.7 bias, p<0.0001)
Variance collapse -- Factual sigma drops 63% from SFT-only to rho-guided, making training more reliable across seeds
Refusal buffer -- Contrastive-only training erodes refusal (d=-8.4, p=0.0005); the SFT component preserves it
MLX-native training -- Full alignment pipeline runs on Apple Silicon. 7B model trains in ~90 min on M3 Ultra

Published Repos

Repo	Install	What it does
knowledge-fidelity	`pip install rho-eval`	Behavioral auditing + alignment toolkit for LLMs. PyPI. DOI: 10.5281/zenodo.18743959. 1,826 probes, 8 dimensions, Rho-Surgery, MLX training. CLIs: `rho-eval`, `rho-surgery`, `rho-benchmark`, `rho-bench`, `rho-steer`, `rho-align`, `rho-hybrid`. Featured in Awesome-LLM-Compression.
confidence-cartography-toolkit	`pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git`	Teacher-forced confidence as a false-belief sensor. CLI: `cartography`. Human false-belief correlation rho=0.652 across Pythia 160M-12B.
intelligent-svd	`pip install git+https://github.com/SolomonB14D3/intelligent-svd.git`	Knowledge-preserving SVD compression. CF90 method: TruthfulQA +5%, 75% fact retention. `pip install intelligent-svd[audit]` for rho-eval integration.

Research Roadmap

~~General-purpose behavioral diagnostic toolkit~~ -- Done (rho-eval v2.0.0)
~~Mechanistic interpretability of behavioral subspaces~~ -- Done (SVD subspace analysis, Grassmann angles, Universal Kill Zone discovery)
~~Rho-guided alignment / fine-tuning~~ -- Done (v2.1.1 -- paper, SFT inversion discovery, contrastive repair, dose-response)
~~Hybrid weight + activation control framework~~ -- Done (v2.2.0 -- SVD + SAE + rho SFT unified pipeline, 7B sweep, SAE mitigates 39% bias collateral)
~~Rho-Surgery: targeted category-aware alignment~~ -- Done (v2.2.2 -- rho-surgery CLI, gamma protection loss, CatSAE, platform-aware pipeline)
Open behavioral benchmark suite (Fidelity-Bench 2.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bryan SolomonB14D3

Block or report SolomonB14D3

Bryan -- Independent ML Researcher

Current Work

Published Repos

Research Roadmap

Pinned Loading

Uh oh!