Skip to content

Skip multipart preamble bytes before the first boundary#260

Open
Kludex wants to merge 2 commits intomasterfrom
skip-preamble
Open

Skip multipart preamble bytes before the first boundary#260
Kludex wants to merge 2 commits intomasterfrom
skip-preamble

Conversation

@Kludex
Copy link
Copy Markdown
Owner

@Kludex Kludex commented Apr 10, 2026

Summary

  • Add a PREAMBLE parser state that ignores bytes before the first boundary delimiter, per RFC 2046 section 5.1.1. Closes Skip header area? #59.
  • Raise MultipartParseError in finalize() if no boundary was ever found, preserving the existing behavior for garbage input.
  • The first non-CRLF byte determines the branch: a leading - still routes through START_BOUNDARY (unchanged), anything else enters PREAMBLE and scans forward for the boundary. This keeps all existing error cases (bad_initial_boundary, empty_message_with_bad_end) intact.

Test plan

  • Two new HTTP fixtures (preamble, preamble_mime_headers) covering the scenario from Skip header area? #59 (machine-generated payload with MIME-style headers before the first boundary).
  • Added both fixtures to single_byte_tests to exercise the byte-by-byte path where the boundary straddles chunk boundaries.
  • test_preamble_is_ignored, test_preamble_split_across_writes, test_finalize_raises_when_no_boundary_found.
  • All 141 tests pass, 100% line coverage, ruff + mypy clean.

RFC 2046 section 5.1.1 says parsers must ignore any bytes before the
first boundary delimiter. Previously the parser errored on such input,
breaking machine-generated payloads (e.g. email.mime output piped into
an HTTP body). Add a PREAMBLE state that scans forward for the boundary
while preserving the existing error path when no boundary is found.

Closes #59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bfb1ae05e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1184 to +1188
i0 = data.find(boundary, i, length)
if i0 >= 0:
# Found the full boundary. Hand off to START_BOUNDARY
# positioned to validate the trailing bytes (either
# CRLF for a new part, or "--" for an empty message).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue scanning preamble after boundary-prefix false matches

When PREAMBLE uses data.find(boundary, ...) it treats any raw \r\n--<boundary> occurrence as a real delimiter and immediately switches to START_BOUNDARY; if the next byte is not CR/- (e.g. preamble line --boundaryX), START_BOUNDARY raises instead of continuing to scan. RFC 2046 allows arbitrary preamble text, so boundary-like prefixes inside preamble should be ignored unless the full delimiter syntax matches, otherwise valid multipart bodies with such preamble content are rejected.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - confirmed with a test case. Fixed in 8cfd52a: PREAMBLE now verifies the delimiter trailer (CRLF or --) inline and treats mismatches as preamble text to keep scanning.

A preamble line like `--boundaryX` is valid preamble text - the byte
after the boundary value isn't a delimiter continuation, so the parser
must keep scanning instead of handing off to START_BOUNDARY and
erroring. Track the trailer verification inline in PREAMBLE with
expanded `index` sub-phases for CR+LF and HYPHEN+HYPHEN.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip header area?

1 participant