Autonomous overnight LLM evaluation pipeline for local GGUF models. Runs multi-turn agentic tasks with the Hermes tool-calling format, scores them with dimension-routed judge models, and produces comparison reports.
Built for dual-GPU rigs running llama.cpp + llama-swap, but the orchestrator is just an OpenAI-compatible HTTP client — any backend that speaks /v1/chat/completions will work.
Evaluates local models across 4 dimensions (15 tasks total):
- Research (3 tasks) — multi-step information gathering and synthesis
- Planning (3 tasks) — task decomposition, dependency ordering, risk assessment
- Coding (4 tasks) — code generation with iterative tool use
- Agentic (5 tasks) — multi-turn tool calling, error recovery, context maintenance
Each task runs as a multi-turn conversation with mock tools (deterministic responses defined in YAML). Scoring is a mix of deterministic methods (checklist, sequence match, subset match) and LLM judges (rubric-based ternary scoring). Results persist to SQLite with checkpoint/resume.
| GPU | Role | Port |
|---|---|---|
| RTX 3090 (24GB) | Target models via llama-swap | :9070 |
| Tesla P40 (24GB) | Judge models (both fit simultaneously) | :9091, :9092 |
You can use any GPU layout — just adjust ports and GPU indices in .env. CPU-only also works if you have patience.
# 1. Clone and install
git clone https://github.com/w00jay/nite-eval.git
cd nite-eval
uv sync
# 2. Configure paths
cp .env.example .env
$EDITOR .env # set LLAMA_SERVER_BIN, LLAMA_SWAP_BIN, GGUF_DIR, JUDGE_MODEL_DIR, GPUs
# 3. Configure target models for llama-swap
cp config/llama_swap_config.example.yaml config/llama_swap_config.yaml
$EDITOR config/llama_swap_config.yaml # absolute paths to your GGUFsYou'll need:
- llama.cpp built with CUDA (
llama-serverbinary) - llama-swap binary
- Target GGUFs of your choice (defaults reference Qwen 3.5 and Gemma 4)
- Judge GGUFs:
- RewardAnything-8B-v1 (Q6_K)
- Flow-Judge-v0.1 (Q6_K)
# All models — starts servers, evaluates, generates report, cleans up
./scripts/run_nightly.sh
# Single model
NITE_MODELS="qwen3.5-9b" ./scripts/run_nightly.sh
# Filter to one dimension
NITE_DIMENSION="agentic" ./scripts/run_nightly.sh
# Combine
NITE_MODELS="qwen3.5-27b qwen3.5-9b" NITE_DIMENSION="agentic" ./scripts/run_nightly.shThe launcher starts target llama-swap and both judge servers, runs the orchestrator, and tears everything down on exit. Already-running servers are reused. Ctrl-C saves progress (resumable).
nohup ./scripts/run_nightly.sh > results/nightly.log 2>&1 &uv run python -m nite_eval.orchestrator --resume run-20260405-232559| Variable | Default | Description |
|---|---|---|
NITE_MODELS |
all from config | Space-separated model list |
NITE_DIMENSION |
all | Filter to one dimension |
NITE_CONFIG |
config/eval_config.yaml |
Config path |
NITE_TARGET_GPU |
from .env or 1 |
GPU index for target llama-swap |
NITE_JUDGE_GPU |
from .env or 2 |
GPU index for judge servers |
Path/binary configuration lives in .env (see .env.example).
| Name | Model | Quant |
|---|---|---|
qwen3.5-27b |
Qwen 3.5 27B | Q4_K_M |
qwen3.5-9b |
Qwen 3.5 9B | Q4_K_M |
gemma4-26b-a4b |
Gemma 4 26B-A4B | Q4_K_M |
Add or replace models by editing config/llama_swap_config.yaml and the models: block in config/eval_config.yaml.
Two judges with complementary biases, routed by scoring dimension:
| Judge | Params | Dimensions | Bias |
|---|---|---|---|
Flow-Judge (:9092) |
3.8B | reasoning_quality, practical_output |
5-bias (recognizes excellence) |
RewardAnything (:9091) |
8B | everything else | 3-bias (conservative, accurate on average responses) |
Calibration: neither judge alone reaches Cohen's kappa > 0.6, but dimension routing exploits their complementary error profiles.
Results live in results/runs/:
eval_results.db— SQLite with all results, scores, tool callsrun-YYYYMMDD-HHMMSS.md— Markdown comparison report
uv run ruff check --fix . && uv run ruff format . # lint
uv run pyright # type check
uv run python -m pytest -v # testsconfig/
eval_config.yaml # models, judge URLs, scoring weights
llama_swap_config.example.yaml # template — copy to *.yaml and edit
judge_swap_config.example.yaml # template (legacy; direct llama-server is the default)
tasks/
research/ planning/ coding/ agentic/
src/nite_eval/
orchestrator.py # main pipeline
conversation_runner.py # multi-turn agent loop with Hermes tool execution
judge.py # JudgeClient + RoutedJudgeClient
scoring.py # deterministic + judge-based scoring
task_loader.py results_db.py report.py
hermes_parser.py mock_tools.py rubrics.py
model_manager.py ast_comparator.py
scripts/
run_nightly.sh # unattended runner
smoke_test.py # quick pipeline check
validate_judge_pipeline.py # judge sanity check with synthetic responses
run_calibration.py # judge calibration against human scores