Skip to content

fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153

Merged
marevol merged 4 commits intomasterfrom
claude/robots-txt-parser-features-fEcja
Mar 29, 2026
Merged

fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153
marevol merged 4 commits intomasterfrom
claude/robots-txt-parser-features-fEcja

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented Mar 28, 2026

Summary

Improve robots.txt parsing and path matching to better comply with RFC 9309 and RFC 3986:

  • Fix user-agent regex escaping: Special regex characters (., +, (), [], etc.) in user-agent strings are now properly escaped instead of being interpreted as regex operators
  • Fix Sitemap directive breaking user-agent groups: Sitemap and unknown directives between User-agent lines no longer incorrectly terminate the current group, per RFC 9309 which defines them as non-group-member records
  • Add percent-encoded URL path matching: Decode only unreserved percent-encoded characters (RFC 3986) for matching while preserving reserved characters (%2F, %3F, %23, %2A, %24) in their encoded form
  • Add percent-encoding case normalization: Normalize hex digits in percent-encoding to uppercase (e.g., %2f%2F) before matching
  • Fix + sign handling: + in URI paths is treated as a literal character, not as a space (RFC 3986 compliance — + as space is only valid in form-encoding)
  • Add missing @Test annotations: All 15 test methods in RobotsTxtHelperTest were missing JUnit 5 @Test annotations, causing them to be skipped by Maven Surefire

claude and others added 4 commits March 28, 2026 23:05
Fix user-agent regex escaping to handle special characters (., +, etc.)
correctly, fix Sitemap directive incorrectly breaking user-agent groups,
add HTTP 401/403 status code handling to restrict all access per RFC 9309,
and add percent-encoded URL path matching support.

https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM
…nt-decoding

- Remove incorrect 401/403 "disallow all" handling from Hc5HttpClient and
  Hc4HttpClient. Both RFC 9309 Section 2.3.1.3 and Google spec agree that
  HTTP 4xx for robots.txt means "no restrictions" (full allow).
- Replace URLDecoder.decode() with RFC 3986-compliant percent-decoder that
  does NOT convert '+' to space ('+' is only special in form-encoding, not
  in URI percent-encoding).
- Add test for '+' literal handling in path pattern matching.

https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM
…unnecessary comments

- Fix PathPattern to only decode unreserved percent-encoded characters per RFC 9309
  (reserved characters like %2F, %3F, %23 stay encoded; %2A/%24 are not
  reinterpreted as robots metacharacters)
- Add percent-encoding case normalization (%2f -> %2F)
- Remove comment-only changes from Hc4HttpClient and Hc5HttpClient
- Add tests for reserved character handling and encoded metacharacters
All 15 test methods in RobotsTxtHelperTest were missing @test annotations,
causing them to not be detected by Maven Surefire with JUnit Platform.
Also fix percent-encoding test resource to use %20 instead of literal space
(the parser's DISALLOW_RECORD regex does not capture whitespace).
@marevol marevol changed the title feat(robots-txt): improve RFC 9309 compliance for robots.txt parsing fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing Mar 29, 2026
@marevol marevol merged commit 8d8da85 into master Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants