Skip to content
View SolomonB14D3's full-sized avatar

Block or report SolomonB14D3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SolomonB14D3/README.md

Bryan -- Independent ML Researcher

Sponsor

Building behavioral auditing and alignment tools for LLMs. Current focus: surgical behavioral alignment -- using category-aware contrastive losses to fix sycophancy without damaging bias fairness.

Core finding: Standard SFT inverts toxicity discrimination in Qwen 7B (+0.145 baseline to -0.003 post-SFT, 5 seeds, p<0.001). Rho-guided SFT with a contrastive confidence loss repairs this (d=10.8 on toxicity, d=13.7 on bias, p<0.0001). At 7B scale, sycophancy SFT causes concentrated collateral damage on Age (-29%), Race (-14%), and Religion (-11%) -- motivating the Rho-Surgery approach with targeted protection losses.


Current Work

rho-eval (v2.2.3) -- Drop-in behavioral audit for any LLM. 1,826 probes across 8 dimensions (factual, toxicity, sycophancy, bias, reasoning, refusal, deception, over-refusal), no internet required. Plugin architecture for custom behaviors. Apple Silicon MLX acceleration. Rho-guided SFT alignment. Rho-Surgery with gamma protection loss.

pip install rho-eval

CLI Tools -- works on any platform (Apple Silicon MLX, CUDA, CPU):

# Audit any model across 8 behavioral dimensions
rho-eval Qwen/Qwen2.5-7B-Instruct --behaviors all

# One-command behavioral repair: diagnose → compress → LoRA SFT → save
rho-surgery Qwen/Qwen2.5-7B-Instruct -o ./repaired-7b/

# Comprehensive benchmarking with before/after comparison
rho-benchmark ./repaired-7b/model/ --baseline Qwen/Qwen2.5-7B-Instruct

# Adversarial pressure test (Truth-Gap)
rho-bench Qwen/Qwen2.5-7B-Instruct
# Python API -- auto-detects MLX or PyTorch
from rho_eval import audit, load_model

model, tokenizer, backend = load_model("Qwen/Qwen2.5-7B-Instruct")
report = audit(model=model, tokenizer=tokenizer, behaviors="all")

v2.2.3 -- Rho-Surgery CLI + Platform-Aware Pipeline:

  • rho-surgery -- End-to-end behavioral repair: diagnose category-level risk, SVD compress, LoRA SFT with gamma protection, save repaired model. Works on any platform via --device auto|mlx|cuda|cpu
  • rho-benchmark -- Comprehensive benchmarking (8-dim audit + TruthfulQA MC2) with optional baseline comparison
  • Rho-Surgery -- Gamma protection loss prevents collateral bias damage during sycophancy SFT. CatSAE broke Religion's structural ceiling (+22.2% vs +11.1% uniform)
  • Hybrid weight + activation control -- SVD compression + SAE steering + rho-guided SFT. SAE at layer 17 cuts bias collateral damage 39%
  • Biology-grounded bias probes -- 37 new probes based on peer-reviewed science

Earlier findings (v2.1.1 -- paper):

  • SFT toxicity inversion -- Standard fine-tuning inverts toxicity discrimination (5 seeds, p<0.001); contrastive confidence loss repairs it (d=10.8 toxicity, d=13.7 bias, p<0.0001)
  • Variance collapse -- Factual sigma drops 63% from SFT-only to rho-guided, making training more reliable across seeds
  • Refusal buffer -- Contrastive-only training erodes refusal (d=-8.4, p=0.0005); the SFT component preserves it
  • MLX-native training -- Full alignment pipeline runs on Apple Silicon. 7B model trains in ~90 min on M3 Ultra

Published Repos

Repo Install What it does
knowledge-fidelity pip install rho-eval Behavioral auditing + alignment toolkit for LLMs. PyPI. DOI: 10.5281/zenodo.18743959. 1,826 probes, 8 dimensions, Rho-Surgery, MLX training. CLIs: rho-eval, rho-surgery, rho-benchmark, rho-bench, rho-steer, rho-align, rho-hybrid. Featured in Awesome-LLM-Compression.
confidence-cartography-toolkit pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git Teacher-forced confidence as a false-belief sensor. CLI: cartography. Human false-belief correlation rho=0.652 across Pythia 160M-12B.
intelligent-svd pip install git+https://github.com/SolomonB14D3/intelligent-svd.git Knowledge-preserving SVD compression. CF90 method: TruthfulQA +5%, 75% fact retention. pip install intelligent-svd[audit] for rho-eval integration.

Research Roadmap

  1. General-purpose behavioral diagnostic toolkit -- Done (rho-eval v2.0.0)
  2. Mechanistic interpretability of behavioral subspaces -- Done (SVD subspace analysis, Grassmann angles, Universal Kill Zone discovery)
  3. Rho-guided alignment / fine-tuning -- Done (v2.1.1 -- paper, SFT inversion discovery, contrastive repair, dose-response)
  4. Hybrid weight + activation control framework -- Done (v2.2.0 -- SVD + SAE + rho SFT unified pipeline, 7B sweep, SAE mitigates 39% bias collateral)
  5. Rho-Surgery: targeted category-aware alignment -- Done (v2.2.2 -- rho-surgery CLI, gamma protection loss, CatSAE, platform-aware pipeline)
  6. Open behavioral benchmark suite (Fidelity-Bench 2.0)

Pinned Loading

  1. knowledge-fidelity knowledge-fidelity Public

    Behavioral auditing & repair toolkit for LLMs. Measures 8 dimensions via confidence probes. Rho-Surgery fixes miscalibration with contrastive LoRA + gamma protection. CatSAE for feature-level inter…

    Python 1