refactor: implement runtime benchmark architecture plan#48
refactor: implement runtime benchmark architecture plan#48beardedeagle merged 10 commits intomainfrom
Conversation
Implement the benchmark-runtime architecture work captured in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md. This change removes the benchmark package's dependence on RuntimeExecutor private helpers, centralizes probe mode behavior in a typed registry, splits the overloaded benchmark package entrypoint into dedicated report/host/fixtures modules, and adds a latest-entry sidecar for constant-time benchmark-history lookup while preserving append-only records. Additional changes: - move the default benchmark-history root from .omx/logs/benchmark-history to .ollm/benchmark-history and ignore .ollm artifacts - add src/ollm/runtime/execution_trace.py with a narrow RuntimeExecutionTrace contract used by benchmark probes - refactor src/ollm/runtime/generation.py to share explicit runtime execution helpers instead of class-private benchmark coupling - split benchmark execution tests into tests/test_benchmark_probe_execution.py so the standards checker stays at zero findings - update README.md and docs/benchmarking.md to match the new benchmark artifact locations and architecture Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check
Record the closure of ollm-cly in the tracked beads interaction log after the runtime benchmark architecture refactor was completed and pushed.
Add a first-class benchmark settings surface so benchmark history location follows the standard configuration contract. This introduces AppSettings.benchmark.history_dir, preserves direct programmatic history_dir overrides in the benchmark history API, and updates scripts/benchmark_runtime.py so history root resolution follows CLI flag > environment/config-loaded settings > built-in default. Additional changes: - export BenchmarkSettings from the settings compatibility surface - add a public resolve_benchmark_history_dir helper on the benchmark history module - add tests for benchmark settings file loading, environment loading, environment-over-config precedence, and CLI-over-settings resolution - update README.md and docs/benchmarking.md to document --history-dir, OLLM_BENCHMARK__HISTORY_DIR, and [benchmark].history_dir in ollm.toml Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check
Record the closure of ollm-7zk in the tracked beads interaction log after benchmark history dir settings support was implemented and pushed.
Run ruff format across the repository and keep the resulting formatter-only changes required for the benchmark architecture branch so CI format checks match the verified local tree.
There was a problem hiding this comment.
Pull request overview
Implements an end-to-end refactor of the runtime benchmark subsystem to establish a typed runtime execution seam, centralize probe-mode behavior, split benchmark orchestration into dedicated modules, and improve benchmark-history lookup and configuration (including a new default history root under .ollm/benchmark-history).
Changes:
- Introduces
RuntimeExecutionTrace+execute_request_with_trace()and refactors generation helpers to remove benchmark dependence onRuntimeExecutorprivate methods. - Centralizes probe mode contracts/dispatch via a typed
ProbeModeregistry; updates CLI/targets/support utilities to use it. - Adds benchmark settings + history improvements (new default
.ollm/benchmark-history, latest-entry sidecar for O(1) lookups), and updates docs/tests accordingly.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_runtime_settings.py | Extends settings tests to cover [benchmark].history_dir and OLLM_BENCHMARK__HISTORY_DIR. |
| tests/test_benchmarks.py | Removes probe-execution unit scaffolding from this file and adds a guard test against RuntimeExecutor private helper usage. |
| tests/test_benchmark_settings.py | Adds focused tests for benchmark settings file/env parsing and precedence. |
| tests/test_benchmark_reporting.py | Adds coverage for resolving benchmark history dir precedence (CLI vs settings). |
| tests/test_benchmark_probe_execution.py | Adds dedicated probe-execution and runtime trace seam tests (moved from test_benchmarks.py). |
| tests/test_benchmark_history.py | Updates history root and adds tests for latest-sidecar write/read/fallback behavior. |
| src/ollm/runtime/settings.py | Re-exports BenchmarkSettings from the compatibility settings surface. |
| src/ollm/runtime/settings_schema.py | Adds BenchmarkSettings to the application settings schema. |
| src/ollm/runtime/settings_resolution.py | Ensures default settings include benchmark settings defaults. |
| src/ollm/runtime/generation.py | Extracts and reuses shared runtime execution helpers; keeps RuntimeExecutor using the shared helpers. |
| src/ollm/runtime/execution_trace.py | Adds typed runtime execution tracing API used by benchmarks. |
| src/ollm/runtime/benchmark/targets.py | Switches probe mode handling to typed ProbeMode and emits canonical CLI values. |
| src/ollm/runtime/benchmark/report_builder.py | New module for benchmark report orchestration. |
| src/ollm/runtime/benchmark/probe_registry.py | New typed registry defining probe modes, runners, renderers, and history extraction. |
| src/ollm/runtime/benchmark/probe_execution.py | Refactors probe execution to use execute_request_with_trace() instead of private executor helpers. |
| src/ollm/runtime/benchmark/host.py | New module for host/device summary helpers. |
| src/ollm/runtime/benchmark/history.py | Moves default history root to .ollm/benchmark-history and adds latest-sidecar indexing. |
| src/ollm/runtime/benchmark/fixtures.py | New module to isolate heavy fixture imports to call sites. |
| src/ollm/runtime/benchmark/init.py | Reduces package entrypoint to thin re-exports and moves heavy logic out. |
| scripts/benchmark_runtime.py | Updates benchmark CLI to use probe registry + settings-based history-dir resolution. |
| scripts/benchmark_runtime_support.py | Uses probe registry for history request extraction; types probe mode. |
| README.md | Updates benchmark history location and documents precedence for history-dir resolution. |
| docs/benchmarking.md | Updates benchmark history location and documents env/config overrides. |
| BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md | Adds architecture plan document describing the refactor and design goals. |
| .gitignore | Ignores .ollm/ directory (new default benchmark-history root). |
| .beads/interactions.jsonl | Updates project interaction log entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace machine-local absolute filesystem links in BENCHMARK_RUNTIME_ARCHITECTURE_IMPLEMENTATION_PLAN.md with repo-relative links so the document renders correctly for other contributors and in PR review.
Replace the remaining hardcoded absolute temp-path literals in tests and docs with generated platform-correct paths or neutral non-machine-local examples. This keeps path-sensitive coverage intact while removing unnecessary '/tmp' and 'file:///tmp' literals from the repo. Verification: - uv run python scripts/check_python_standards.py - uv run ruff check src tests examples scripts - uv run ty check src tests scripts - uv run python -m compileall src tests scripts - uv run pytest -q - uv build - uv run python -m pip_audit - uv run --group docs mkdocs build --strict - git diff --check
Persist the formatter-applied change in tests/test_backend_selector.py and the tracked beads interaction log after rerunning the full pre-push quality gate.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove the hardcoded reasoning_effort="minimal" argument from runtime chat-template calls so runtime prompt construction does not silently override tokenizer behavior. Update the tokenizer test doubles to match the intended call contract and add a regression test that fails if runtime starts forcing reasoning_effort again.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 33 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Capture benchmark generation_started_at before prepare_runtime_generate_inputs runs so TTFT and prompt-throughput semantics continue to include chunked prefill work, matching the original benchmark behavior. Add a regression test to lock that ordering in place.
Summary
Implement the runtime benchmark architecture plan end-to-end and close the follow-up configuration gap for benchmark history location.
This PR:
RuntimeExecutorprivate helpers by adding a narrow typed runtime execution seamreport_builder,host, andfixturesmodules.omx/logs/benchmark-historyto.ollm/benchmark-historyKey Changes
Benchmark architecture
src/ollm/runtime/execution_trace.pywithRuntimeExecutionTraceandexecute_request_with_trace()src/ollm/runtime/generation.pyto expose shared runtime execution helpers instead of benchmark-only private couplingsrc/ollm/runtime/benchmark/probe_execution.pyto use the typed runtime trace surfacesrc/ollm/runtime/benchmark/probe_registry.pyand route CLI/target/history probe behavior through the registrysrc/ollm/runtime/benchmark/report_builder.pysrc/ollm/runtime/benchmark/host.pysrc/ollm/runtime/benchmark/fixtures.pysrc/ollm/runtime/benchmark/__init__.pyto a thin export surfaceBenchmark history
.ollm/benchmark-historyrecords/plusindex.jsonllatest/sidecar entries keyed by a stable comparison-key digest for steady-state constant-time lookupBenchmark settings support
BenchmarkSettingsto the application settings schema--history-dirOLLM_BENCHMARK__HISTORY_DIR[benchmark].history_dirinollm.tomlhistory_dir=override in the history APITests and docs
tests/test_benchmark_probe_execution.pyso the standards checker stays at zero findingstests/test_benchmark_settings.pyREADME.mdanddocs/benchmarking.mdto match the new benchmark-history location and settings contractVerification
uv run python scripts/check_python_standards.pyuv run ruff check src tests examples scriptsuv run ty check src tests scriptsuv run python -m compileall src tests scriptsuv run pytest -q->421 passeduv builduv run python -m pip_audituv run --group docs mkdocs build --strictgit diff --checkNotes
.ollm/benchmark-history.