Skip to content

feat: implement streamed chunked ingestion across runtime strategies#49

Merged
beardedeagle merged 9 commits intomainfrom
feat-chunked-prefill-scope-closure
Apr 3, 2026
Merged

feat: implement streamed chunked ingestion across runtime strategies#49
beardedeagle merged 9 commits intomainfrom
feat-chunked-prefill-scope-closure

Conversation

@beardedeagle
Copy link
Copy Markdown
Owner

@beardedeagle beardedeagle commented Apr 3, 2026

Summary

This PR completes ollm-6b7 by landing the full chunked-ingestion scope that was originally requested, rather than stopping at a partial causal-only subset.

Implemented strategy lanes:

  • optimized-native-text
  • optimized-native-multimodal
  • transformers-generic-text
  • transformers-generic-multimodal
  • transformers-generic-seq2seq-source

The runtime now streams prompt tokenization from rendered prompt pieces inside the strategy path, synthesizes prefix attention masks lazily for causal chunked lanes, and gives seq2seq source prompts their own explicit streamed source-ingestion strategy instead of pretending they share the causal-cache contract.

Code changes

  • Added streamed chunked-ingestion strategy handling in src/ollm/runtime/chunked_prefill.py
  • Split support helpers into src/ollm/runtime/chunked_prefill_support.py to keep standards checks green and isolate the lower-level tokenization/filtering helpers
  • Wired the final strategy contract through:
    • src/ollm/runtime/generation.py
    • src/ollm/runtime/execution_trace.py
    • src/ollm/runtime/benchmark/probe_execution.py
    • src/ollm/runtime/benchmark/probe_types.py
    • src/ollm/runtime/benchmark/probe_serialization.py
    • src/ollm/runtime/benchmark/details.py
    • src/ollm/runtime/benchmark/chunked_prefill_serialization.py
  • Expanded regression coverage for:
    • all four causal strategy lanes
    • streamed seq2seq source-ingestion behavior
    • benchmark/request metadata round-trip
    • local proof for seq2seq encoder behavior
    • review-driven regressions around uninspectable callables and one-time forward-filter construction
  • Updated docs in README.md, docs/benchmarking.md, and docs/guides/optimization.md

Review follow-up fixes

The current PR head also includes the later review-thread hardening work:

  • guard forward-signature inspection for uninspectable callables
  • build the forward-input filter once per strategy execution instead of once per chunk
  • keep the chunked prompt-ingestion implementation split below the repo standards soft file-size limit

Behavioral notes

  • Chunked ingestion is no longer limited to optimized-native decoder-only text
  • Prompt tokenization and causal prefix masks are no longer fully materialized before the chunked strategy begins
  • Seq2seq source prompts are implemented through a dedicated streamed source-ingestion lane
  • The causal strategies exist as explicit strategy handlers rather than only as strategy IDs over one undifferentiated entrypoint

Verification

Ran the full required gate from the final formatted tree on the current PR head:

  • uv run ruff format src tests examples scripts
  • uv run python scripts/check_python_standards.py
  • uv run ruff check src tests examples scripts
  • uv run ty check src tests scripts
  • uv run python -m compileall src tests scripts
  • uv run pytest -q -> 434 passed
  • uv build
  • uv run python -m pip_audit -> No known vulnerabilities found
  • uv run --group docs mkdocs build --strict
  • git diff --check

Tracker

  • ollm-6b7 closed
  • absorbed follow-on beads closed:
    • ollm-qm9
    • ollm-dnl
  • remaining optional follow-on retained:
    • ollm-o7a

Add a typed chunked-prefill scope surface so the remaining feature boundaries are encoded in runtime and benchmark code instead of being left as doc-only caveats. The new surface records runtime eligibility, whether bounded chunked prefill actually ran, the post-tokenization/full-prefix-mask execution boundary, and explicit reject decisions for the two previously open scope gaps.

Thread the new scope facts through runtime execution tracing, benchmark request metrics, benchmark JSON parsing, and request metric summaries. Add focused regression coverage for the supported optimized-native causal path, the unsupported seq2seq path, benchmark trace/report serialization, and the new scope decisions.

Update README and benchmarking/optimization docs so user-facing guidance matches the code-backed contract: chunked prefill begins after prompt construction, remains limited to optimized-native causal text runtimes, and publishes its non-goals through benchmark output.
Replace the earlier scope-only chunked-prefill closure with real strategy-backed runtime support. The runtime now resolves four causal chunked-prefill strategies: optimized-native text, optimized-native processor-backed multimodal, transformers-generic text, and transformers-generic processor-backed multimodal. The shared chunked-prefill core handles sequence slicing, static multimodal inputs, cache handoff, and final generate input reduction without duplicating the causal prefill loop.

Thread the selected strategy through runtime metadata, execution tracing, and benchmark JSON/reporting so request metrics identify which chunked-prefill lane ran. Update the regression suite to cover the new strategy lanes and add a tiny local T5 proof showing why seq2seq source prompts still need a separate encoder-side design instead of pretending they can reuse the causal-cache contract.

Refresh README and benchmarking/optimization docs so the public contract matches the implementation: four supported causal strategy lanes now exist, while prompt-construction streaming/lazy-mask work and seq2seq source-ingestion chunking are tracked separately as follow-on beads ollm-qm9 and ollm-dnl.
Copilot AI review requested due to automatic review settings April 3, 2026 12:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands chunked-prefill from an optimized-native-text-only path into a strategy-backed runtime feature across the causal runtime lanes, and propagates the selected strategy/scope through runtime metadata, execution tracing, and benchmark probe serialization.

Changes:

  • Added a new chunked_prefill runtime module that resolves a ChunkedPrefillStrategyId and executes causal chunked-prefill while tracking scope/eligibility.
  • Updated generation + execution tracing + benchmark probe metrics/JSON to carry chunked_prefill scope details end-to-end.
  • Expanded tests and docs to cover the four causal strategy lanes and to keep seq2seq source prompts explicitly deferred.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/ollm/runtime/chunked_prefill.py New strategy/scope surface + shared causal chunked-prefill implementation.
src/ollm/runtime/generation.py Switches to strategy-backed chunked-prefill preparation and includes chunked-prefill metadata fields.
src/ollm/runtime/execution_trace.py Adds chunked-prefill scope to execution traces for benchmarking.
src/ollm/runtime/benchmark/probe_types.py Extends RequestProbeMetrics with mandatory chunked_prefill scope.
src/ollm/runtime/benchmark/probe_execution.py Threads chunked-prefill scope from trace into request metrics.
src/ollm/runtime/benchmark/probe_serialization.py Parses/serializes the new chunked_prefill request section in probe JSON.
src/ollm/runtime/benchmark/chunked_prefill_serialization.py New parser helpers for chunked_prefill JSON payloads.
src/ollm/runtime/benchmark/details.py Includes chunked-prefill scope in summarized request metrics.
tests/test_runtime_executor_prefill.py Adds coverage for native/generic text + multimodal chunked-prefill behavior and metadata.
tests/test_chunked_prefill_scope.py Validates surfaced scope and explicitly defers seq2seq with a tiny T5 contract test.
tests/test_benchmark_probe_execution.py Asserts chunked-prefill scope appears in probe execution and traces; updates wrapper signature.
tests/test_benchmark_reporting.py Ensures runtime probe JSON round-trips include chunked-prefill details.
tests/benchmark_support.py Updates probe metrics fixtures to include chunked-prefill scope.
README.md Documents supported chunked-prefill strategy lanes and current limitations.
docs/guides/optimization.md Updates optimization guide with the new chunked-prefill lane matrix and caveats.
docs/benchmarking.md Documents chunked-prefill reporting in benchmark request metrics.
.beads/interactions.jsonl Updates tracker interaction history for ollm-6b7.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace the pure strategy-id resolution path with explicit per-lane chunked-prefill handlers for optimized-native text, optimized-native multimodal, transformers-generic text, and transformers-generic multimodal. The handlers still share the causal chunk loop, but each lane now has its own matcher and runner so future divergence can happen on a concrete strategy boundary instead of implicit branching.

The full required gate was rerun from the final formatted tree before this commit: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Complete the remaining ollm-6b7 work by moving chunked strategy preparation ahead of eager model-input construction. The runtime now streams prompt tokenization from rendered prompt pieces, synthesizes prefix attention masks lazily for causal chunked lanes, and adds a dedicated transformers-generic seq2seq source-ingestion strategy instead of leaving seq2seq outside the feature.

This update also splits chunked prompt support into a smaller support module to keep the repo standards checker green, refreshes benchmark/request metadata and docs to the final contract, and closes the absorbed follow-on beads ollm-qm9 and ollm-dnl.

Verification rerun from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (432 passed), build, pip_audit, mkdocs --strict, and git diff --check.
@beardedeagle beardedeagle changed the title feat: expand chunked prefill across causal runtime lanes feat: implement streamed chunked ingestion across runtime strategies Apr 3, 2026
Guard forward signature inspection for uninspectable callables, compute the forward-input filter once per strategy execution instead of per chunk, and keep the streamed prompt-ingestion helpers split below the repo standards soft file-size limit. The branch was reverified with the full gate after these fixes: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Accept mapping-like processor outputs in move_input_mapping, guard processor signature inspection in call_processor_for_static_inputs, and rename the prefill bookkeeping variable for clarity. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (434 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Guard non-positive chunk budgets, avoid duplicating the full prompt token list during streamed causal ingestion, prefer special-token-safe tokenizer encode fallbacks, and remove the now-dead prompt-token helper from execution tracing. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (436 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Preserve backend-provided chunked-prefill metadata, blank boundary/mask metadata when no strategy is active, guard processor prompt rendering and static-input signatures, and restore the non-callable forward fallback. Also add regressions for the latest PR review findings and keep the repo standards checker green by moving the backend-metadata test into its own file. Reverified with the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (441 passed), build, pip_audit, mkdocs --strict, and git diff --check.
Add the non-callable forward fallback regression and rerun the full required gate from the final formatted tree: ruff format/check, standards checker, ty, compileall, pytest (442 passed), build, pip_audit, mkdocs --strict, and git diff --check.
@beardedeagle beardedeagle merged commit 4c61af5 into main Apr 3, 2026
5 checks passed
@beardedeagle beardedeagle deleted the feat-chunked-prefill-scope-closure branch April 3, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants