Skip multipart preamble bytes before the first boundary#260
Skip multipart preamble bytes before the first boundary#260
Conversation
RFC 2046 section 5.1.1 says parsers must ignore any bytes before the first boundary delimiter. Previously the parser errored on such input, breaking machine-generated payloads (e.g. email.mime output piped into an HTTP body). Add a PREAMBLE state that scans forward for the boundary while preserving the existing error path when no boundary is found. Closes #59
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6bfb1ae05e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
python_multipart/multipart.py
Outdated
| i0 = data.find(boundary, i, length) | ||
| if i0 >= 0: | ||
| # Found the full boundary. Hand off to START_BOUNDARY | ||
| # positioned to validate the trailing bytes (either | ||
| # CRLF for a new part, or "--" for an empty message). |
There was a problem hiding this comment.
Continue scanning preamble after boundary-prefix false matches
When PREAMBLE uses data.find(boundary, ...) it treats any raw \r\n--<boundary> occurrence as a real delimiter and immediately switches to START_BOUNDARY; if the next byte is not CR/- (e.g. preamble line --boundaryX), START_BOUNDARY raises instead of continuing to scan. RFC 2046 allows arbitrary preamble text, so boundary-like prefixes inside preamble should be ignored unless the full delimiter syntax matches, otherwise valid multipart bodies with such preamble content are rejected.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch - confirmed with a test case. Fixed in 8cfd52a: PREAMBLE now verifies the delimiter trailer (CRLF or --) inline and treats mismatches as preamble text to keep scanning.
A preamble line like `--boundaryX` is valid preamble text - the byte after the boundary value isn't a delimiter continuation, so the parser must keep scanning instead of handing off to START_BOUNDARY and erroring. Track the trailer verification inline in PREAMBLE with expanded `index` sub-phases for CR+LF and HYPHEN+HYPHEN.
Summary
PREAMBLEparser state that ignores bytes before the first boundary delimiter, per RFC 2046 section 5.1.1. Closes Skip header area? #59.MultipartParseErrorinfinalize()if no boundary was ever found, preserving the existing behavior for garbage input.-still routes throughSTART_BOUNDARY(unchanged), anything else entersPREAMBLEand scans forward for the boundary. This keeps all existing error cases (bad_initial_boundary,empty_message_with_bad_end) intact.Test plan
preamble,preamble_mime_headers) covering the scenario from Skip header area? #59 (machine-generated payload with MIME-style headers before the first boundary).single_byte_teststo exercise the byte-by-byte path where the boundary straddles chunk boundaries.test_preamble_is_ignored,test_preamble_split_across_writes,test_finalize_raises_when_no_boundary_found.