feat: implement streamed chunked ingestion across runtime strategies#49
feat: implement streamed chunked ingestion across runtime strategies#49beardedeagle merged 9 commits intomainfrom
Conversation
Add a typed chunked-prefill scope surface so the remaining feature boundaries are encoded in runtime and benchmark code instead of being left as doc-only caveats. The new surface records runtime eligibility, whether bounded chunked prefill actually ran, the post-tokenization/full-prefix-mask execution boundary, and explicit reject decisions for the two previously open scope gaps. Thread the new scope facts through runtime execution tracing, benchmark request metrics, benchmark JSON parsing, and request metric summaries. Add focused regression coverage for the supported optimized-native causal path, the unsupported seq2seq path, benchmark trace/report serialization, and the new scope decisions. Update README and benchmarking/optimization docs so user-facing guidance matches the code-backed contract: chunked prefill begins after prompt construction, remains limited to optimized-native causal text runtimes, and publishes its non-goals through benchmark output.
Replace the earlier scope-only chunked-prefill closure with real strategy-backed runtime support. The runtime now resolves four causal chunked-prefill strategies: optimized-native text, optimized-native processor-backed multimodal, transformers-generic text, and transformers-generic processor-backed multimodal. The shared chunked-prefill core handles sequence slicing, static multimodal inputs, cache handoff, and final generate input reduction without duplicating the causal prefill loop. Thread the selected strategy through runtime metadata, execution tracing, and benchmark JSON/reporting so request metrics identify which chunked-prefill lane ran. Update the regression suite to cover the new strategy lanes and add a tiny local T5 proof showing why seq2seq source prompts still need a separate encoder-side design instead of pretending they can reuse the causal-cache contract. Refresh README and benchmarking/optimization docs so the public contract matches the implementation: four supported causal strategy lanes now exist, while prompt-construction streaming/lazy-mask work and seq2seq source-ingestion chunking are tracked separately as follow-on beads ollm-qm9 and ollm-dnl.
There was a problem hiding this comment.
Pull request overview
This PR expands chunked-prefill from an optimized-native-text-only path into a strategy-backed runtime feature across the causal runtime lanes, and propagates the selected strategy/scope through runtime metadata, execution tracing, and benchmark probe serialization.
Changes:
- Added a new
chunked_prefillruntime module that resolves aChunkedPrefillStrategyIdand executes causal chunked-prefill while tracking scope/eligibility. - Updated generation + execution tracing + benchmark probe metrics/JSON to carry
chunked_prefillscope details end-to-end. - Expanded tests and docs to cover the four causal strategy lanes and to keep seq2seq source prompts explicitly deferred.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/ollm/runtime/chunked_prefill.py |
New strategy/scope surface + shared causal chunked-prefill implementation. |
src/ollm/runtime/generation.py |
Switches to strategy-backed chunked-prefill preparation and includes chunked-prefill metadata fields. |
src/ollm/runtime/execution_trace.py |
Adds chunked-prefill scope to execution traces for benchmarking. |
src/ollm/runtime/benchmark/probe_types.py |
Extends RequestProbeMetrics with mandatory chunked_prefill scope. |
src/ollm/runtime/benchmark/probe_execution.py |
Threads chunked-prefill scope from trace into request metrics. |
src/ollm/runtime/benchmark/probe_serialization.py |
Parses/serializes the new chunked_prefill request section in probe JSON. |
src/ollm/runtime/benchmark/chunked_prefill_serialization.py |
New parser helpers for chunked_prefill JSON payloads. |
src/ollm/runtime/benchmark/details.py |
Includes chunked-prefill scope in summarized request metrics. |
tests/test_runtime_executor_prefill.py |
Adds coverage for native/generic text + multimodal chunked-prefill behavior and metadata. |
tests/test_chunked_prefill_scope.py |
Validates surfaced scope and explicitly defers seq2seq with a tiny T5 contract test. |
tests/test_benchmark_probe_execution.py |
Asserts chunked-prefill scope appears in probe execution and traces; updates wrapper signature. |
tests/test_benchmark_reporting.py |
Ensures runtime probe JSON round-trips include chunked-prefill details. |
tests/benchmark_support.py |
Updates probe metrics fixtures to include chunked-prefill scope. |
README.md |
Documents supported chunked-prefill strategy lanes and current limitations. |
docs/guides/optimization.md |
Updates optimization guide with the new chunked-prefill lane matrix and caveats. |
docs/benchmarking.md |
Documents chunked-prefill reporting in benchmark request metrics. |
.beads/interactions.jsonl |
Updates tracker interaction history for ollm-6b7. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the pure strategy-id resolution path with explicit per-lane chunked-prefill handlers for optimized-native text, optimized-native multimodal, transformers-generic text, and transformers-generic multimodal. The handlers still share the causal chunk loop, but each lane now has its own matcher and runner so future divergence can happen on a concrete strategy boundary instead of implicit branching. The full required gate was rerun from the final formatted tree before this commit: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Complete the remaining ollm-6b7 work by moving chunked strategy preparation ahead of eager model-input construction. The runtime now streams prompt tokenization from rendered prompt pieces, synthesizes prefix attention masks lazily for causal chunked lanes, and adds a dedicated transformers-generic seq2seq source-ingestion strategy instead of leaving seq2seq outside the feature. This update also splits chunked prompt support into a smaller support module to keep the repo standards checker green, refreshes benchmark/request metadata and docs to the final contract, and closes the absorbed follow-on beads ollm-qm9 and ollm-dnl. Verification rerun from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Guard forward signature inspection for uninspectable callables, compute the forward-input filter once per strategy execution instead of per chunk, and keep the streamed prompt-ingestion helpers split below the repo standards soft file-size limit. The branch was reverified with the full gate after these fixes: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Accept mapping-like processor outputs in move_input_mapping, guard processor signature inspection in call_processor_for_static_inputs, and rename the prefill bookkeeping variable for clarity. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Guard non-positive chunk budgets, avoid duplicating the full prompt token list during streamed causal ingestion, prefer special-token-safe tokenizer encode fallbacks, and remove the now-dead prompt-token helper from execution tracing. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (436 passed), build, pip_audit, mkdocs --strict, and git diff --check.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Preserve backend-provided chunked-prefill metadata, blank boundary/mask metadata when no strategy is active, guard processor prompt rendering and static-input signatures, and restore the non-callable forward fallback. Also add regressions for the latest PR review findings and keep the repo standards checker green by moving the backend-metadata test into its own file. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (441 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Add the non-callable forward fallback regression and rerun the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (442 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Summary
This PR completes
ollm-6b7by landing the full chunked-ingestion scope that was originally requested, rather than stopping at a partial causal-only subset.Implemented strategy lanes:
optimized-native-textoptimized-native-multimodaltransformers-generic-texttransformers-generic-multimodaltransformers-generic-seq2seq-sourceThe runtime now streams prompt tokenization from rendered prompt pieces inside the strategy path, synthesizes prefix attention masks lazily for causal chunked lanes, and gives seq2seq source prompts their own explicit streamed source-ingestion strategy instead of pretending they share the causal-cache contract.
Code changes
src/ollm/runtime/chunked_prefill.pysrc/ollm/runtime/chunked_prefill_support.pyto keep standards checks green and isolate the lower-level tokenization/filtering helperssrc/ollm/runtime/generation.pysrc/ollm/runtime/execution_trace.pysrc/ollm/runtime/benchmark/probe_execution.pysrc/ollm/runtime/benchmark/probe_types.pysrc/ollm/runtime/benchmark/probe_serialization.pysrc/ollm/runtime/benchmark/details.pysrc/ollm/runtime/benchmark/chunked_prefill_serialization.pyREADME.md,docs/benchmarking.md, anddocs/guides/optimization.mdReview follow-up fixes
The current PR head also includes the later review-thread hardening work:
Behavioral notes
Verification
Ran the full required gate from the final formatted tree on the current PR head:
uv run ruff format src tests examples scriptsuv run python scripts/check_python_standards.pyuv run ruff check src tests examples scriptsuv run ty check src tests scriptsuv run python -m compileall src tests scriptsuv run pytest -q->434 passeduv builduv run python -m pip_audit->No known vulnerabilities founduv run --group docs mkdocs build --strictgit diff --checkTracker
ollm-6b7closedollm-qm9ollm-dnlollm-o7a