Skip to content

refactor: implement runtime benchmark architecture plan#48

Merged
beardedeagle merged 10 commits intomainfrom
feat-benchmark-runtime-architecture
Apr 3, 2026
Merged

refactor: implement runtime benchmark architecture plan#48
beardedeagle merged 10 commits intomainfrom
feat-benchmark-runtime-architecture

Conversation

@beardedeagle
Copy link
Copy Markdown
Owner

Summary

Implement the runtime benchmark architecture plan end-to-end and close the follow-up configuration gap for benchmark history location.

This PR:

  • removes benchmark dependence on RuntimeExecutor private helpers by adding a narrow typed runtime execution seam
  • centralizes probe mode behavior in a typed registry instead of repeated string switches
  • splits the overloaded benchmark package entrypoint into dedicated report_builder, host, and fixtures modules
  • adds a latest-entry sidecar for constant-time benchmark-history lookup while preserving append-only history records
  • moves the default benchmark history root from .omx/logs/benchmark-history to .ollm/benchmark-history
  • adds first-class benchmark settings support so history location resolves through CLI flag, environment, config file, or direct programmatic override
  • updates docs and tests to keep the benchmark lane truthful and in parity with the code

Key Changes

Benchmark architecture

  • add src/ollm/runtime/execution_trace.py with RuntimeExecutionTrace and execute_request_with_trace()
  • refactor src/ollm/runtime/generation.py to expose shared runtime execution helpers instead of benchmark-only private coupling
  • update src/ollm/runtime/benchmark/probe_execution.py to use the typed runtime trace surface
  • add src/ollm/runtime/benchmark/probe_registry.py and route CLI/target/history probe behavior through the registry
  • split benchmark package responsibilities into:
    • src/ollm/runtime/benchmark/report_builder.py
    • src/ollm/runtime/benchmark/host.py
    • src/ollm/runtime/benchmark/fixtures.py
  • reduce src/ollm/runtime/benchmark/__init__.py to a thin export surface

Benchmark history

  • change the default benchmark-history root to .ollm/benchmark-history
  • retain append-only records/ plus index.jsonl
  • add latest/ sidecar entries keyed by a stable comparison-key digest for steady-state constant-time lookup

Benchmark settings support

  • add BenchmarkSettings to the application settings schema
  • export benchmark settings from the settings compatibility surface
  • make benchmark history root resolve via:
    1. --history-dir
    2. OLLM_BENCHMARK__HISTORY_DIR
    3. [benchmark].history_dir in ollm.toml
    4. direct programmatic history_dir= override in the history API

Tests and docs

  • split benchmark probe execution tests into tests/test_benchmark_probe_execution.py so the standards checker stays at zero findings
  • add settings coverage in tests/test_benchmark_settings.py
  • expand benchmark reporting and history tests for the new precedence and history-sidecar behavior
  • update README.md and docs/benchmarking.md to match the new benchmark-history location and settings contract

Verification

  • uv run python scripts/check_python_standards.py
  • uv run ruff check src tests examples scripts
  • uv run ty check src tests scripts
  • uv run python -m compileall src tests scripts
  • uv run pytest -q -> 421 passed
  • uv build
  • uv run python -m pip_audit
  • uv run --group docs mkdocs build --strict
  • git diff --check

Notes

  • The benchmark feature remains behavior-preserving in terms of probe coverage and report intent.
  • The main user-visible default change is the benchmark-history root moving to .ollm/benchmark-history.
  • The follow-up benchmark settings work closes the previously missing env/config support for benchmark history location.

Implement the benchmark-runtime architecture work captured in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md. This change removes the benchmark package's dependence on RuntimeExecutor private helpers, centralizes probe mode behavior in a typed registry, splits the overloaded benchmark package entrypoint into dedicated report/host/fixtures modules, and adds a latest-entry sidecar for constant-time benchmark-history lookup while preserving append-only records.

Additional changes:
- move the default benchmark-history root from .omx/logs/benchmark-history to .ollm/benchmark-history and ignore .ollm artifacts
- add src/ollm/runtime/execution_trace.py with a narrow RuntimeExecutionTrace contract used by benchmark probes
- refactor src/ollm/runtime/generation.py to share explicit runtime execution helpers instead of class-private benchmark coupling
- split benchmark execution tests into tests/test_benchmark_probe_execution.py so the standards checker stays at zero findings
- update README.md and docs/benchmarking.md to match the new benchmark artifact locations and architecture

Verification:
- uv run python scripts/check_python_standards.py
- uv run ruff check src tests examples scripts
- uv run ty check src tests scripts
- uv run python -m compileall src tests scripts
- uv run pytest -q
- uv build
- uv run python -m pip_audit
- uv run --group docs mkdocs build --strict
- git diff --check
Record the closure of ollm-cly in the tracked beads interaction log after the runtime benchmark architecture refactor was completed and pushed.
Add a first-class benchmark settings surface so benchmark history location follows the standard configuration contract. This introduces AppSettings.benchmark.history_dir, preserves direct programmatic history_dir overrides in the benchmark history API, and updates scripts/benchmark_runtime.py so history root resolution follows CLI flag > environment/config-loaded settings > built-in default.

Additional changes:
- export BenchmarkSettings from the settings compatibility surface
- add a public resolve_benchmark_history_dir helper on the benchmark history module
- add tests for benchmark settings file loading, environment loading, environment-over-config precedence, and CLI-over-settings resolution
- update README.md and docs/benchmarking.md to document --history-dir, OLLM_BENCHMARK__HISTORY_DIR, and [benchmark].history_dir in ollm.toml

Verification:
- uv run python scripts/check_python_standards.py
- uv run ruff check src tests examples scripts
- uv run ty check src tests scripts
- uv run python -m compileall src tests scripts
- uv run pytest -q
- uv build
- uv run python -m pip_audit
- uv run --group docs mkdocs build --strict
- git diff --check
Record the closure of ollm-7zk in the tracked beads interaction log after benchmark history dir settings support was implemented and pushed.
Copilot AI review requested due to automatic review settings April 3, 2026 05:13
Run ruff format across the repository and keep the resulting formatter-only changes required for the benchmark architecture branch so CI format checks match the verified local tree.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an end-to-end refactor of the runtime benchmark subsystem to establish a typed runtime execution seam, centralize probe-mode behavior, split benchmark orchestration into dedicated modules, and improve benchmark-history lookup and configuration (including a new default history root under .ollm/benchmark-history).

Changes:

  • Introduces RuntimeExecutionTrace + execute_request_with_trace() and refactors generation helpers to remove benchmark dependence on RuntimeExecutor private methods.
  • Centralizes probe mode contracts/dispatch via a typed ProbeMode registry; updates CLI/targets/support utilities to use it.
  • Adds benchmark settings + history improvements (new default .ollm/benchmark-history, latest-entry sidecar for O(1) lookups), and updates docs/tests accordingly.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_runtime_settings.py Extends settings tests to cover [benchmark].history_dir and OLLM_BENCHMARK__HISTORY_DIR.
tests/test_benchmarks.py Removes probe-execution unit scaffolding from this file and adds a guard test against RuntimeExecutor private helper usage.
tests/test_benchmark_settings.py Adds focused tests for benchmark settings file/env parsing and precedence.
tests/test_benchmark_reporting.py Adds coverage for resolving benchmark history dir precedence (CLI vs settings).
tests/test_benchmark_probe_execution.py Adds dedicated probe-execution and runtime trace seam tests (moved from test_benchmarks.py).
tests/test_benchmark_history.py Updates history root and adds tests for latest-sidecar write/read/fallback behavior.
src/ollm/runtime/settings.py Re-exports BenchmarkSettings from the compatibility settings surface.
src/ollm/runtime/settings_schema.py Adds BenchmarkSettings to the application settings schema.
src/ollm/runtime/settings_resolution.py Ensures default settings include benchmark settings defaults.
src/ollm/runtime/generation.py Extracts and reuses shared runtime execution helpers; keeps RuntimeExecutor using the shared helpers.
src/ollm/runtime/execution_trace.py Adds typed runtime execution tracing API used by benchmarks.
src/ollm/runtime/benchmark/targets.py Switches probe mode handling to typed ProbeMode and emits canonical CLI values.
src/ollm/runtime/benchmark/report_builder.py New module for benchmark report orchestration.
src/ollm/runtime/benchmark/probe_registry.py New typed registry defining probe modes, runners, renderers, and history extraction.
src/ollm/runtime/benchmark/probe_execution.py Refactors probe execution to use execute_request_with_trace() instead of private executor helpers.
src/ollm/runtime/benchmark/host.py New module for host/device summary helpers.
src/ollm/runtime/benchmark/history.py Moves default history root to .ollm/benchmark-history and adds latest-sidecar indexing.
src/ollm/runtime/benchmark/fixtures.py New module to isolate heavy fixture imports to call sites.
src/ollm/runtime/benchmark/init.py Reduces package entrypoint to thin re-exports and moves heavy logic out.
scripts/benchmark_runtime.py Updates benchmark CLI to use probe registry + settings-based history-dir resolution.
scripts/benchmark_runtime_support.py Uses probe registry for history request extraction; types probe mode.
README.md Updates benchmark history location and documents precedence for history-dir resolution.
docs/benchmarking.md Updates benchmark history location and documents env/config overrides.
BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md Adds architecture plan document describing the refactor and design goals.
.gitignore Ignores .ollm/ directory (new default benchmark-history root).
.beads/interactions.jsonl Updates project interaction log entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace machine-local absolute filesystem links in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md with repo-relative links so the document renders correctly for other contributors and in PR review.
Replace the remaining hardcoded absolute temp-path literals in tests and docs with generated platform-correct paths or neutral non-machine-local examples. This keeps path-sensitive coverage intact while removing unnecessary '/tmp' and 'file:///tmp' literals from the repo.

Verification:
- uv run python scripts/check_python_standards.py
- uv run ruff check src tests examples scripts
- uv run ty check src tests scripts
- uv run python -m compileall src tests scripts
- uv run pytest -q
- uv build
- uv run python -m pip_audit
- uv run --group docs mkdocs build --strict
- git diff --check
Persist the formatter-applied change in tests/test_backend_selector.py and the tracked beads interaction log after rerunning the full pre-push quality gate.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Remove the hardcoded reasoning_effort="minimal" argument from runtime chat-template calls so runtime prompt construction does not silently override tokenizer behavior. Update the tokenizer test doubles to match the intended call contract and add a regression test that fails if runtime starts forcing reasoning_effort again.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Capture benchmark generation_started_at before prepare_runtime_generate_inputs runs so TTFT and prompt-throughput semantics continue to include chunked prefill work, matching the original benchmark behavior. Add a regression test to lock that ordering in place.
@beardedeagle beardedeagle merged commit b5db1d5 into main Apr 3, 2026
5 checks passed
@beardedeagle beardedeagle deleted the feat-benchmark-runtime-architecture branch April 3, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants