feat: implement streamed chunked ingestion across runtime strategies by beardedeagle · Pull Request #49 · beardedeagle/ollm

beardedeagle · 2026-04-03T12:12:34Z

Summary

This PR completes ollm-6b7 by landing the full chunked-ingestion scope that was originally requested, rather than stopping at a partial causal-only subset.

Implemented strategy lanes:

optimized-native-text
optimized-native-multimodal
transformers-generic-text
transformers-generic-multimodal
transformers-generic-seq2seq-source

The runtime now streams prompt tokenization from rendered prompt pieces inside the strategy path, synthesizes prefix attention masks lazily for causal chunked lanes, and gives seq2seq source prompts their own explicit streamed source-ingestion strategy instead of pretending they share the causal-cache contract.

Code changes

Added streamed chunked-ingestion strategy handling in src/ollm/runtime/chunked_prefill.py
Split support helpers into src/ollm/runtime/chunked_prefill_support.py to keep standards checks green and isolate the lower-level tokenization/filtering helpers
Wired the final strategy contract through:
- src/ollm/runtime/generation.py
- src/ollm/runtime/execution_trace.py
- src/ollm/runtime/benchmark/probe_execution.py
- src/ollm/runtime/benchmark/probe_types.py
- src/ollm/runtime/benchmark/probe_serialization.py
- src/ollm/runtime/benchmark/details.py
- src/ollm/runtime/benchmark/chunked_prefill_serialization.py
Expanded regression coverage for:
- all four causal strategy lanes
- streamed seq2seq source-ingestion behavior
- benchmark/request metadata round-trip
- local proof for seq2seq encoder behavior
- review-driven regressions around uninspectable callables and one-time forward-filter construction
Updated docs in README.md, docs/benchmarking.md, and docs/guides/optimization.md

Review follow-up fixes

The current PR head also includes the later review-thread hardening work:

guard forward-signature inspection for uninspectable callables
build the forward-input filter once per strategy execution instead of once per chunk
keep the chunked prompt-ingestion implementation split below the repo standards soft file-size limit

Behavioral notes

Chunked ingestion is no longer limited to optimized-native decoder-only text
Prompt tokenization and causal prefix masks are no longer fully materialized before the chunked strategy begins
Seq2seq source prompts are implemented through a dedicated streamed source-ingestion lane
The causal strategies exist as explicit strategy handlers rather than only as strategy IDs over one undifferentiated entrypoint

Verification

Ran the full required gate from the final formatted tree on the current PR head:

uv run ruff format src tests examples scripts
uv run python scripts/check_python_standards.py
uv run ruff check src tests examples scripts
uv run ty check src tests scripts
uv run python -m compileall src tests scripts
uv run pytest -q -> 434 passed
uv build
uv run python -m pip_audit -> No known vulnerabilities found
uv run --group docs mkdocs build --strict
git diff --check

Tracker

ollm-6b7 closed
absorbed follow-on beads closed:
- ollm-qm9
- ollm-dnl
remaining optional follow-on retained:
- ollm-o7a

Add a typed chunked-prefill scope surface so the remaining feature boundaries are encoded in runtime and benchmark code instead of being left as doc-only caveats. The new surface records runtime eligibility, whether bounded chunked prefill actually ran, the post-tokenization/full-prefix-mask execution boundary, and explicit reject decisions for the two previously open scope gaps. Thread the new scope facts through runtime execution tracing, benchmark request metrics, benchmark JSON parsing, and request metric summaries. Add focused regression coverage for the supported optimized-native causal path, the unsupported seq2seq path, benchmark trace/report serialization, and the new scope decisions. Update README and benchmarking/optimization docs so user-facing guidance matches the code-backed contract: chunked prefill begins after prompt construction, remains limited to optimized-native causal text runtimes, and publishes its non-goals through benchmark output.

Replace the earlier scope-only chunked-prefill closure with real strategy-backed runtime support. The runtime now resolves four causal chunked-prefill strategies: optimized-native text, optimized-native processor-backed multimodal, transformers-generic text, and transformers-generic processor-backed multimodal. The shared chunked-prefill core handles sequence slicing, static multimodal inputs, cache handoff, and final generate input reduction without duplicating the causal prefill loop. Thread the selected strategy through runtime metadata, execution tracing, and benchmark JSON/reporting so request metrics identify which chunked-prefill lane ran. Update the regression suite to cover the new strategy lanes and add a tiny local T5 proof showing why seq2seq source prompts still need a separate encoder-side design instead of pretending they can reuse the causal-cache contract. Refresh README and benchmarking/optimization docs so the public contract matches the implementation: four supported causal strategy lanes now exist, while prompt-construction streaming/lazy-mask work and seq2seq source-ingestion chunking are tracked separately as follow-on beads ollm-qm9 and ollm-dnl.

Copilot

Pull request overview

This PR expands chunked-prefill from an optimized-native-text-only path into a strategy-backed runtime feature across the causal runtime lanes, and propagates the selected strategy/scope through runtime metadata, execution tracing, and benchmark probe serialization.

Changes:

Added a new chunked_prefill runtime module that resolves a ChunkedPrefillStrategyId and executes causal chunked-prefill while tracking scope/eligibility.
Updated generation + execution tracing + benchmark probe metrics/JSON to carry chunked_prefill scope details end-to-end.
Expanded tests and docs to cover the four causal strategy lanes and to keep seq2seq source prompts explicitly deferred.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/ollm/runtime/chunked_prefill.py`	New strategy/scope surface + shared causal chunked-prefill implementation.
`src/ollm/runtime/generation.py`	Switches to strategy-backed chunked-prefill preparation and includes chunked-prefill metadata fields.
`src/ollm/runtime/execution_trace.py`	Adds chunked-prefill scope to execution traces for benchmarking.
`src/ollm/runtime/benchmark/probe_types.py`	Extends `RequestProbeMetrics` with mandatory `chunked_prefill` scope.
`src/ollm/runtime/benchmark/probe_execution.py`	Threads chunked-prefill scope from trace into request metrics.
`src/ollm/runtime/benchmark/probe_serialization.py`	Parses/serializes the new `chunked_prefill` request section in probe JSON.
`src/ollm/runtime/benchmark/chunked_prefill_serialization.py`	New parser helpers for `chunked_prefill` JSON payloads.
`src/ollm/runtime/benchmark/details.py`	Includes chunked-prefill scope in summarized request metrics.
`tests/test_runtime_executor_prefill.py`	Adds coverage for native/generic text + multimodal chunked-prefill behavior and metadata.
`tests/test_chunked_prefill_scope.py`	Validates surfaced scope and explicitly defers seq2seq with a tiny T5 contract test.
`tests/test_benchmark_probe_execution.py`	Asserts chunked-prefill scope appears in probe execution and traces; updates wrapper signature.
`tests/test_benchmark_reporting.py`	Ensures runtime probe JSON round-trips include chunked-prefill details.
`tests/benchmark_support.py`	Updates probe metrics fixtures to include chunked-prefill scope.
`README.md`	Documents supported chunked-prefill strategy lanes and current limitations.
`docs/guides/optimization.md`	Updates optimization guide with the new chunked-prefill lane matrix and caveats.
`docs/benchmarking.md`	Documents chunked-prefill reporting in benchmark request metrics.
`.beads/interactions.jsonl`	Updates tracker interaction history for `ollm-6b7`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/chunked_prefill.py

Replace the pure strategy-id resolution path with explicit per-lane chunked-prefill handlers for optimized-native text, optimized-native multimodal, transformers-generic text, and transformers-generic multimodal. The handlers still share the causal chunk loop, but each lane now has its own matcher and runner so future divergence can happen on a concrete strategy boundary instead of implicit branching. The full required gate was rerun from the final formatted tree before this commit: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Complete the remaining ollm-6b7 work by moving chunked strategy preparation ahead of eager model-input construction. The runtime now streams prompt tokenization from rendered prompt pieces, synthesizes prefix attention masks lazily for causal chunked lanes, and adds a dedicated transformers-generic seq2seq source-ingestion strategy instead of leaving seq2seq outside the feature. This update also splits chunked prompt support into a smaller support module to keep the repo standards checker green, refreshes benchmark/request metadata and docs to the final contract, and closes the absorbed follow-on beads ollm-qm9 and ollm-dnl. Verification rerun from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Guard forward signature inspection for uninspectable callables, compute the forward-input filter once per strategy execution instead of per chunk, and keep the streamed prompt-ingestion helpers split below the repo standards soft file-size limit. The branch was reverified with the full gate after these fixes: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/chunked_prefill_support.py

src/ollm/runtime/chunked_prefill.py

Accept mapping-like processor outputs in move_input_mapping, guard processor signature inspection in call_processor_for_static_inputs, and rename the prefill bookkeeping variable for clarity. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/chunked_prefill_support.py

src/ollm/runtime/chunked_prefill.py

src/ollm/runtime/execution_trace.py

Guard non-positive chunk budgets, avoid duplicating the full prompt token list during streamed causal ingestion, prefer special-token-safe tokenizer encode fallbacks, and remove the now-dead prompt-token helper from execution tracing. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (436 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ollm/runtime/generation.py

src/ollm/runtime/chunked_prefill_support.py

src/ollm/runtime/chunked_prefill.py

Preserve backend-provided chunked-prefill metadata, blank boundary/mask metadata when no strategy is active, guard processor prompt rendering and static-input signatures, and restore the non-callable forward fallback. Also add regressions for the latest PR review findings and keep the repo standards checker green by moving the backend-metadata test into its own file. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (441 passed), build, pip_audit, mkdocs --strict, and git diff --check.

Add the non-callable forward fallback regression and rerun the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (442 passed), build, pip_audit, mkdocs --strict, and git diff --check.

beardedeagle added 2 commits April 3, 2026 04:49

Copilot AI review requested due to automatic review settings April 3, 2026 12:12

Copilot started reviewing on behalf of beardedeagle April 3, 2026 12:13 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

src/ollm/runtime/chunked_prefill.py Outdated Show resolved Hide resolved

src/ollm/runtime/chunked_prefill.py Outdated Show resolved Hide resolved

src/ollm/runtime/chunked_prefill.py Outdated Show resolved Hide resolved

beardedeagle added 2 commits April 3, 2026 08:12

beardedeagle changed the title ~~feat: expand chunked prefill across causal runtime lanes~~ feat: implement streamed chunked ingestion across runtime strategies Apr 3, 2026

beardedeagle requested a review from Copilot April 3, 2026 14:14

Copilot started reviewing on behalf of beardedeagle April 3, 2026 14:14 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

src/ollm/runtime/chunked_prefill_support.py Show resolved Hide resolved

src/ollm/runtime/chunked_prefill_support.py Outdated Show resolved Hide resolved

src/ollm/runtime/chunked_prefill.py Outdated Show resolved Hide resolved

beardedeagle requested a review from Copilot April 3, 2026 14:34

Copilot started reviewing on behalf of beardedeagle April 3, 2026 14:35 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

src/ollm/runtime/chunked_prefill_support.py Show resolved Hide resolved

src/ollm/runtime/chunked_prefill.py Show resolved Hide resolved

src/ollm/runtime/chunked_prefill.py Show resolved Hide resolved

src/ollm/runtime/execution_trace.py Show resolved Hide resolved

beardedeagle requested a review from Copilot April 3, 2026 15:04

Copilot started reviewing on behalf of beardedeagle April 3, 2026 15:04 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

beardedeagle added 2 commits April 3, 2026 10:20

beardedeagle merged commit 4c61af5 into main Apr 3, 2026
5 checks passed

beardedeagle deleted the feat-chunked-prefill-scope-closure branch April 3, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement streamed chunked ingestion across runtime strategies#49

feat: implement streamed chunked ingestion across runtime strategies#49
beardedeagle merged 9 commits intomainfrom
feat-chunked-prefill-scope-closure

beardedeagle commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

beardedeagle commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code changes

Review follow-up fixes

Behavioral notes

Verification

Tracker

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beardedeagle commented Apr 3, 2026 •

edited

Loading