nite-eval

Autonomous overnight LLM evaluation pipeline for local GGUF models. Runs multi-turn agentic tasks with the Hermes tool-calling format, scores them with dimension-routed judge models, and produces comparison reports.

Built for dual-GPU rigs running llama.cpp + llama-swap, but the orchestrator is just an OpenAI-compatible HTTP client — any backend that speaks /v1/chat/completions will work.

What it does

Evaluates local models across 4 dimensions (15 tasks total):

Research (3 tasks) — multi-step information gathering and synthesis
Planning (3 tasks) — task decomposition, dependency ordering, risk assessment
Coding (4 tasks) — code generation with iterative tool use
Agentic (5 tasks) — multi-turn tool calling, error recovery, context maintenance

Each task runs as a multi-turn conversation with mock tools (deterministic responses defined in YAML). Scoring is a mix of deterministic methods (checklist, sequence match, subset match) and LLM judges (rubric-based ternary scoring). Results persist to SQLite with checkpoint/resume.

Hardware (reference setup)

GPU	Role	Port
RTX 3090 (24GB)	Target models via llama-swap	`:9070`
Tesla P40 (24GB)	Judge models (both fit simultaneously)	`:9091`, `:9092`

You can use any GPU layout — just adjust ports and GPU indices in .env. CPU-only also works if you have patience.

Setup

# 1. Clone and install
git clone https://github.com/w00jay/nite-eval.git
cd nite-eval
uv sync

# 2. Configure paths
cp .env.example .env
$EDITOR .env   # set LLAMA_SERVER_BIN, LLAMA_SWAP_BIN, GGUF_DIR, JUDGE_MODEL_DIR, GPUs

# 3. Configure target models for llama-swap
cp config/llama_swap_config.example.yaml config/llama_swap_config.yaml
$EDITOR config/llama_swap_config.yaml   # absolute paths to your GGUFs

You'll need:

llama.cpp built with CUDA (llama-server binary)
llama-swap binary
Target GGUFs of your choice (defaults reference Qwen 3.5 and Gemma 4)
Judge GGUFs:
- RewardAnything-8B-v1 (Q6_K)
- Flow-Judge-v0.1 (Q6_K)

Usage

Quick run

# All models — starts servers, evaluates, generates report, cleans up
./scripts/run_nightly.sh

# Single model
NITE_MODELS="qwen3.5-9b" ./scripts/run_nightly.sh

# Filter to one dimension
NITE_DIMENSION="agentic" ./scripts/run_nightly.sh

# Combine
NITE_MODELS="qwen3.5-27b qwen3.5-9b" NITE_DIMENSION="agentic" ./scripts/run_nightly.sh

The launcher starts target llama-swap and both judge servers, runs the orchestrator, and tears everything down on exit. Already-running servers are reused. Ctrl-C saves progress (resumable).

Overnight (unattended)

nohup ./scripts/run_nightly.sh > results/nightly.log 2>&1 &

Resume after interruption

uv run python -m nite_eval.orchestrator --resume run-20260405-232559

Environment variables

Variable	Default	Description
`NITE_MODELS`	all from config	Space-separated model list
`NITE_DIMENSION`	all	Filter to one dimension
`NITE_CONFIG`	`config/eval_config.yaml`	Config path
`NITE_TARGET_GPU`	from `.env` or `1`	GPU index for target llama-swap
`NITE_JUDGE_GPU`	from `.env` or `2`	GPU index for judge servers

Path/binary configuration lives in .env (see .env.example).

Default target models

Name	Model	Quant
`qwen3.5-27b`	Qwen 3.5 27B	Q4_K_M
`qwen3.5-9b`	Qwen 3.5 9B	Q4_K_M
`gemma4-26b-a4b`	Gemma 4 26B-A4B	Q4_K_M

Add or replace models by editing config/llama_swap_config.yaml and the models: block in config/eval_config.yaml.

Judges

Two judges with complementary biases, routed by scoring dimension:

Judge	Params	Dimensions	Bias
Flow-Judge (`:9092`)	3.8B	`reasoning_quality`, `practical_output`	5-bias (recognizes excellence)
RewardAnything (`:9091`)	8B	everything else	3-bias (conservative, accurate on average responses)

Calibration: neither judge alone reaches Cohen's kappa > 0.6, but dimension routing exploits their complementary error profiles.

Output

Results live in results/runs/:

eval_results.db — SQLite with all results, scores, tool calls
run-YYYYMMDD-HHMMSS.md — Markdown comparison report

Development

uv run ruff check --fix . && uv run ruff format .   # lint
uv run pyright                                        # type check
uv run python -m pytest -v                            # tests

Project structure

config/
  eval_config.yaml                    # models, judge URLs, scoring weights
  llama_swap_config.example.yaml      # template — copy to *.yaml and edit
  judge_swap_config.example.yaml      # template (legacy; direct llama-server is the default)
tasks/
  research/  planning/  coding/  agentic/
src/nite_eval/
  orchestrator.py            # main pipeline
  conversation_runner.py     # multi-turn agent loop with Hermes tool execution
  judge.py                   # JudgeClient + RoutedJudgeClient
  scoring.py                 # deterministic + judge-based scoring
  task_loader.py  results_db.py  report.py
  hermes_parser.py  mock_tools.py  rubrics.py
  model_manager.py  ast_comparator.py
scripts/
  run_nightly.sh             # unattended runner
  smoke_test.py              # quick pipeline check
  validate_judge_pipeline.py # judge sanity check with synthetic responses
  run_calibration.py         # judge calibration against human scores

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
config		config
judges/calibration		judges/calibration
scripts		scripts
src/nite_eval		src/nite_eval
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nite-eval

What it does

Hardware (reference setup)

Setup

Usage

Quick run

Overnight (unattended)

Resume after interruption

Environment variables

Default target models

Judges

Output

Development

Project structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nite-eval

What it does

Hardware (reference setup)

Setup

Usage

Quick run

Overnight (unattended)

Resume after interruption

Environment variables

Default target models

Judges

Output

Development

Project structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages