refactor: implement runtime benchmark architecture plan by beardedeagle · Pull Request #48 · beardedeagle/ollm

beardedeagle · 2026-04-03T05:13:26Z

Summary

Implement the runtime benchmark architecture plan end-to-end and close the follow-up configuration gap for benchmark history location.

This PR:

removes benchmark dependence on RuntimeExecutor private helpers by adding a narrow typed runtime execution seam
centralizes probe mode behavior in a typed registry instead of repeated string switches
splits the overloaded benchmark package entrypoint into dedicated report_builder, host, and fixtures modules
adds a latest-entry sidecar for constant-time benchmark-history lookup while preserving append-only history records
moves the default benchmark history root from .omx/logs/benchmark-history to .ollm/benchmark-history
adds first-class benchmark settings support so history location resolves through CLI flag, environment, config file, or direct programmatic override
updates docs and tests to keep the benchmark lane truthful and in parity with the code

Key Changes

Benchmark architecture

add src/ollm/runtime/execution_trace.py with RuntimeExecutionTrace and execute_request_with_trace()
refactor src/ollm/runtime/generation.py to expose shared runtime execution helpers instead of benchmark-only private coupling
update src/ollm/runtime/benchmark/probe_execution.py to use the typed runtime trace surface
add src/ollm/runtime/benchmark/probe_registry.py and route CLI/target/history probe behavior through the registry
split benchmark package responsibilities into:
- src/ollm/runtime/benchmark/report_builder.py
- src/ollm/runtime/benchmark/host.py
- src/ollm/runtime/benchmark/fixtures.py
reduce src/ollm/runtime/benchmark/__init__.py to a thin export surface

Benchmark history

change the default benchmark-history root to .ollm/benchmark-history
retain append-only records/ plus index.jsonl
add latest/ sidecar entries keyed by a stable comparison-key digest for steady-state constant-time lookup

Benchmark settings support

add BenchmarkSettings to the application settings schema
export benchmark settings from the settings compatibility surface
make benchmark history root resolve via:
1. --history-dir
2. OLLM_BENCHMARK__HISTORY_DIR
3. [benchmark].history_dir in ollm.toml
4. direct programmatic history_dir= override in the history API

Tests and docs

split benchmark probe execution tests into tests/test_benchmark_probe_execution.py so the standards checker stays at zero findings
add settings coverage in tests/test_benchmark_settings.py
expand benchmark reporting and history tests for the new precedence and history-sidecar behavior
update README.md and docs/benchmarking.md to match the new benchmark-history location and settings contract

Verification

uv run python scripts/check_python_standards.py
uv run ruff check src tests examples scripts
uv run ty check src tests scripts
uv run python -m compileall src tests scripts
uv run pytest -q -> 421 passed
uv build
uv run python -m pip_audit
uv run --group docs mkdocs build --strict
git diff --check

Notes

The benchmark feature remains behavior-preserving in terms of probe coverage and report intent.
The main user-visible default change is the benchmark-history root moving to .ollm/benchmark-history.
The follow-up benchmark settings work closes the previously missing env/config support for benchmark history location.

Implement the benchmark-runtime architecture work captured in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md. This change removes the benchmark package's dependence on RuntimeExecutor private helpers, centralizes probe mode behavior in a typed registry, splits the overloaded benchmark package entrypoint into dedicated report/host/fixtures modules, and adds a latest-entry sidecar for constant-time benchmark-history lookup while preserving append-only records. Additional changes: - move the default benchmark-history root from .omx/logs/benchmark-history to .ollm/benchmark-history and ignore .ollm artifacts - add src/ollm/runtime/execution_trace.py with a narrow RuntimeExecutionTrace contract used by benchmark probes - refactor src/ollm/runtime/generation.py to share explicit runtime execution helpers instead of class-private benchmark coupling - split benchmark execution tests into tests/test_benchmark_probe_execution.py so the standards checker stays at zero findings - update README.md and docs/benchmarking.md to match the new benchmark artifact locations and architecture Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check

Record the closure of ollm-cly in the tracked beads interaction log after the runtime benchmark architecture refactor was completed and pushed.

Add a first-class benchmark settings surface so benchmark history location follows the standard configuration contract. This introduces AppSettings.benchmark.history_dir, preserves direct programmatic history_dir overrides in the benchmark history API, and updates scripts/benchmark_runtime.py so history root resolution follows CLI flag > environment/config-loaded settings > built-in default. Additional changes: - export BenchmarkSettings from the settings compatibility surface - add a public resolve_benchmark_history_dir helper on the benchmark history module - add tests for benchmark settings file loading, environment loading, environment-over-config precedence, and CLI-over-settings resolution - update README.md and docs/benchmarking.md to document --history-dir, OLLM_BENCHMARK__HISTORY_DIR, and [benchmark].history_dir in ollm.toml Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check

Record the closure of ollm-7zk in the tracked beads interaction log after benchmark history dir settings support was implemented and pushed.

Run ruff format across the repository and keep the resulting formatter-only changes required for the benchmark architecture branch so CI format checks match the verified local tree.

Copilot

Pull request overview

Implements an end-to-end refactor of the runtime benchmark subsystem to establish a typed runtime execution seam, centralize probe-mode behavior, split benchmark orchestration into dedicated modules, and improve benchmark-history lookup and configuration (including a new default history root under .ollm/benchmark-history).

Changes:

Introduces RuntimeExecutionTrace + execute_request_with_trace() and refactors generation helpers to remove benchmark dependence on RuntimeExecutor private methods.
Centralizes probe mode contracts/dispatch via a typed ProbeMode registry; updates CLI/targets/support utilities to use it.
Adds benchmark settings + history improvements (new default .ollm/benchmark-history, latest-entry sidecar for O(1) lookups), and updates docs/tests accordingly.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_runtime_settings.py	Extends settings tests to cover `[benchmark].history_dir` and `OLLM_BENCHMARK__HISTORY_DIR`.
tests/test_benchmarks.py	Removes probe-execution unit scaffolding from this file and adds a guard test against `RuntimeExecutor` private helper usage.
tests/test_benchmark_settings.py	Adds focused tests for benchmark settings file/env parsing and precedence.
tests/test_benchmark_reporting.py	Adds coverage for resolving benchmark history dir precedence (CLI vs settings).
tests/test_benchmark_probe_execution.py	Adds dedicated probe-execution and runtime trace seam tests (moved from `test_benchmarks.py`).
tests/test_benchmark_history.py	Updates history root and adds tests for latest-sidecar write/read/fallback behavior.
src/ollm/runtime/settings.py	Re-exports `BenchmarkSettings` from the compatibility settings surface.
src/ollm/runtime/settings_schema.py	Adds `BenchmarkSettings` to the application settings schema.
src/ollm/runtime/settings_resolution.py	Ensures default settings include benchmark settings defaults.
src/ollm/runtime/generation.py	Extracts and reuses shared runtime execution helpers; keeps `RuntimeExecutor` using the shared helpers.
src/ollm/runtime/execution_trace.py	Adds typed runtime execution tracing API used by benchmarks.
src/ollm/runtime/benchmark/targets.py	Switches probe mode handling to typed `ProbeMode` and emits canonical CLI values.
src/ollm/runtime/benchmark/report_builder.py	New module for benchmark report orchestration.
src/ollm/runtime/benchmark/probe_registry.py	New typed registry defining probe modes, runners, renderers, and history extraction.
src/ollm/runtime/benchmark/probe_execution.py	Refactors probe execution to use `execute_request_with_trace()` instead of private executor helpers.
src/ollm/runtime/benchmark/host.py	New module for host/device summary helpers.
src/ollm/runtime/benchmark/history.py	Moves default history root to `.ollm/benchmark-history` and adds latest-sidecar indexing.
src/ollm/runtime/benchmark/fixtures.py	New module to isolate heavy fixture imports to call sites.
src/ollm/runtime/benchmark/init.py	Reduces package entrypoint to thin re-exports and moves heavy logic out.
scripts/benchmark_runtime.py	Updates benchmark CLI to use probe registry + settings-based history-dir resolution.
scripts/benchmark_runtime_support.py	Uses probe registry for history request extraction; types probe mode.
README.md	Updates benchmark history location and documents precedence for history-dir resolution.
docs/benchmarking.md	Updates benchmark history location and documents env/config overrides.
BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md	Adds architecture plan document describing the refactor and design goals.
.gitignore	Ignores `.ollm/` directory (new default benchmark-history root).
.beads/interactions.jsonl	Updates project interaction log entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md

Replace machine-local absolute filesystem links in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md with repo-relative links so the document renders correctly for other contributors and in PR review.

Replace the remaining hardcoded absolute temp-path literals in tests and docs with generated platform-correct paths or neutral non-machine-local examples. This keeps path-sensitive coverage intact while removing unnecessary '/tmp' and 'file:///tmp' literals from the repo. Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check

Persist the formatter-applied change in tests/test_backend_selector.py and the tracked beads interaction log after rerunning the full pre-push quality gate.

Copilot

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/benchmark/targets.py

src/ollm/runtime/generation.py

Remove the hardcoded reasoning_effort="minimal" argument from runtime chat-template calls so runtime prompt construction does not silently override tokenizer behavior. Update the tokenizer test doubles to match the intended call contract and add a regression test that fails if runtime starts forcing reasoning_effort again.

Copilot

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/execution_trace.py

Capture benchmark generation_started_at before prepare_runtime_generate_inputs runs so TTFT and prompt-throughput semantics continue to include chunked prefill work, matching the original benchmark behavior. Add a regression test to lock that ordering in place.

beardedeagle added 4 commits April 2, 2026 23:07

chore: record benchmark architecture task closure

ed0890f

Record the closure of ollm-cly in the tracked beads interaction log after the runtime benchmark architecture refactor was completed and pushed.

chore: record benchmark settings task closure

08fe07f

Record the closure of ollm-7zk in the tracked beads interaction log after benchmark history dir settings support was implemented and pushed.

Copilot AI review requested due to automatic review settings April 3, 2026 05:13

Copilot started reviewing on behalf of beardedeagle April 3, 2026 05:14 View session

style: apply repo formatter for benchmark branch

544b817

Run ruff format across the repository and keep the resulting formatter-only changes required for the benchmark architecture branch so CI format checks match the verified local tree.

Copilot AI reviewed Apr 3, 2026

View reviewed changes

BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md Outdated Show resolved Hide resolved

BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md Show resolved Hide resolved

BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md Outdated Show resolved Hide resolved

beardedeagle added 3 commits April 3, 2026 01:09

docs: make benchmark architecture plan links portable

c77d831

Replace machine-local absolute filesystem links in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md with repo-relative links so the document renders correctly for other contributors and in PR review.

style: format backend selector test and record bead updates

5f1bebd

Persist the formatter-applied change in tests/test_backend_selector.py and the tracked beads interaction log after rerunning the full pre-push quality gate.

beardedeagle requested a review from Copilot April 3, 2026 07:20

Copilot started reviewing on behalf of beardedeagle April 3, 2026 07:21 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

src/ollm/runtime/benchmark/targets.py Show resolved Hide resolved

src/ollm/runtime/generation.py Outdated Show resolved Hide resolved

beardedeagle requested a review from Copilot April 3, 2026 08:15

Copilot started reviewing on behalf of beardedeagle April 3, 2026 08:15 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

src/ollm/runtime/execution_trace.py Show resolved Hide resolved

beardedeagle merged commit b5db1d5 into main Apr 3, 2026
5 checks passed

beardedeagle deleted the feat-benchmark-runtime-architecture branch April 3, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: implement runtime benchmark architecture plan#48

refactor: implement runtime benchmark architecture plan#48
beardedeagle merged 10 commits intomainfrom
feat-benchmark-runtime-architecture

beardedeagle commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

beardedeagle commented Apr 3, 2026

Summary

Key Changes

Benchmark architecture

Benchmark history

Benchmark settings support

Tests and docs

Verification

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants