fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing by marevol · Pull Request #153 · codelibs/fess-crawler

marevol · 2026-03-28T23:24:21Z

Summary

Improve robots.txt parsing and path matching to better comply with RFC 9309 and RFC 3986:

Fix user-agent regex escaping: Special regex characters (., +, (), [], etc.) in user-agent strings are now properly escaped instead of being interpreted as regex operators
Fix Sitemap directive breaking user-agent groups: Sitemap and unknown directives between User-agent lines no longer incorrectly terminate the current group, per RFC 9309 which defines them as non-group-member records
Add percent-encoded URL path matching: Decode only unreserved percent-encoded characters (RFC 3986) for matching while preserving reserved characters (%2F, %3F, %23, %2A, %24) in their encoded form
Add percent-encoding case normalization: Normalize hex digits in percent-encoding to uppercase (e.g., %2f → %2F) before matching
Fix + sign handling: + in URI paths is treated as a literal character, not as a space (RFC 3986 compliance — + as space is only valid in form-encoding)
Add missing @Test annotations: All 15 test methods in RobotsTxtHelperTest were missing JUnit 5 @Test annotations, causing them to be skipped by Maven Surefire

Fix user-agent regex escaping to handle special characters (., +, etc.) correctly, fix Sitemap directive incorrectly breaking user-agent groups, add HTTP 401/403 status code handling to restrict all access per RFC 9309, and add percent-encoded URL path matching support. https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM

…nt-decoding - Remove incorrect 401/403 "disallow all" handling from Hc5HttpClient and Hc4HttpClient. Both RFC 9309 Section 2.3.1.3 and Google spec agree that HTTP 4xx for robots.txt means "no restrictions" (full allow). - Replace URLDecoder.decode() with RFC 3986-compliant percent-decoder that does NOT convert '+' to space ('+' is only special in form-encoding, not in URI percent-encoding). - Add test for '+' literal handling in path pattern matching. https://claude.ai/code/session_01Mfo8VuKbsbCrohEgDjArBM

…unnecessary comments - Fix PathPattern to only decode unreserved percent-encoded characters per RFC 9309 (reserved characters like %2F, %3F, %23 stay encoded; %2A/%24 are not reinterpreted as robots metacharacters) - Add percent-encoding case normalization (%2f -> %2F) - Remove comment-only changes from Hc4HttpClient and Hc5HttpClient - Add tests for reserved character handling and encoded metacharacters

@test

All 15 test methods in RobotsTxtHelperTest were missing @test annotations, causing them to not be detected by Maven Surefire with JUnit Platform. Also fix percent-encoding test resource to use %20 instead of literal space (the parser's DISALLOW_RECORD regex does not capture whitespace).

claude and others added 4 commits March 28, 2026 23:05

marevol changed the title ~~feat(robots-txt): improve RFC 9309 compliance for robots.txt parsing~~ fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing Mar 29, 2026

marevol merged commit 8d8da85 into master Mar 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153

fix(robots-txt): improve RFC 9309 compliance for robots.txt parsing#153
marevol merged 4 commits intomasterfrom
claude/robots-txt-parser-features-fEcja

marevol commented Mar 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marevol commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marevol commented Mar 28, 2026 •

edited

Loading