fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153
Merged
fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153
Conversation
Fix user-agent regex escaping to handle special characters (., +, etc.) correctly, fix Sitemap directive incorrectly breaking user-agent groups, add HTTP 401/403 status code handling to restrict all access per RFC 9309, and add percent-encoded URL path matching support. https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM
…nt-decoding
- Remove incorrect 401/403 "disallow all" handling from Hc5HttpClient and
Hc4HttpClient. Both RFC 9309 Section 2.3.1.3 and Google spec agree that
HTTP 4xx for robots.txt means "no restrictions" (full allow).
- Replace URLDecoder.decode() with RFC 3986-compliant percent-decoder that
does NOT convert '+' to space ('+' is only special in form-encoding, not
in URI percent-encoding).
- Add test for '+' literal handling in path pattern matching.
https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM
…unnecessary comments - Fix PathPattern to only decode unreserved percent-encoded characters per RFC 9309 (reserved characters like %2F, %3F, %23 stay encoded; %2A/%24 are not reinterpreted as robots metacharacters) - Add percent-encoding case normalization (%2f -> %2F) - Remove comment-only changes from Hc4HttpClient and Hc5HttpClient - Add tests for reserved character handling and encoded metacharacters
All 15 test methods in RobotsTxtHelperTest were missing @test annotations, causing them to not be detected by Maven Surefire with JUnit Platform. Also fix percent-encoding test resource to use %20 instead of literal space (the parser's DISALLOW_RECORD regex does not capture whitespace).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve robots.txt parsing and path matching to better comply with RFC 9309 and RFC 3986:
.,+,(),[], etc.) in user-agent strings are now properly escaped instead of being interpreted as regex operatorsSitemapand unknown directives betweenUser-agentlines no longer incorrectly terminate the current group, per RFC 9309 which defines them as non-group-member records%2F,%3F,%23,%2A,%24) in their encoded form%2f→%2F) before matching+sign handling:+in URI paths is treated as a literal character, not as a space (RFC 3986 compliance —+as space is only valid in form-encoding)@Testannotations: All 15 test methods inRobotsTxtHelperTestwere missing JUnit 5@Testannotations, causing them to be skipped by Maven Surefire