Crawl and validate websites for broken links, redirects, parameter handling, soft failures, URL patterns, and impact.
- Features
- Live demo
- Tech stack
- How it works
- Architecture diagram
- Functionality deep-dive
- Installation and setup
- API reference
- Configuration and environment variables
- Deploy to Google Cloud Run
- Performance and roadmap
- Bookmarklet (Cat Crawler)
- GitHub Pages (Landing)
- Crawl internal pages from a homepage URL
- Sitemap-first discovery when sitemap.xml is available
- robots.txt respected before fetching any URL
- Same-host crawling with optional scope limited to the start path
- Navigation audit across both anchor links and form actions
- Exclude paths using relative paths, one per line
- Language-agnostic crawl limits by path (for example
/jobalso matches/en/job,/fr/job) - Ignore job pages by default to prevent job-heavy sections from flooding results
- Redirect resolution with full redirect chains, per-step status codes, and final URL stored
- Optional broken link quick check with HTTP status recording
- Parameter audit for
?test=1,?page=2, and?filter=value - Soft-failure detection for successful pages with missing content, error text, or failed API/XHR endpoints
- URL pattern audit for duplicate structures, legacy/current paths, and inconsistent naming
- Impact analysis for broken and redirected URLs based on repetition, referrers, and core-flow heuristics
- Consolidated validation report with broken URLs, redirect issues, parameter issues, soft failures, and impact analysis
- Duplicate content candidates detection, including querystring and language variants
- Client presets saved in localStorage, export and import presets as JSON
- Bookmarklet opens in a draggable, resizable in-page panel with 4-corner resize handles
- Glass UI with progress ring and animated orb during crawl
This project is deployed on a personal Cloud Run host:
https://site-crawler-989268314020.europe-west2.run.app/
For production use, deploy to your own Cloud Run service and update APP_ORIGIN in docs/bookmarklet.js.
- React 18
- Vite 5
- Vanilla CSS
- Browser APIs: localStorage and sessionStorage
- Node.js 22+
- Express
- Cheerio for HTML parsing and link extraction
- robots-parser for robots.txt enforcement
- Built-in fetch and AbortController for HTTP requests and timeouts
- Docker
- Google Cloud Run
- Enter a homepage URL.
- Add optional exclude paths such as
/jobs,/careers,/admin. - Define crawl limits by path if required.
- Configure max pages and concurrency.
- Choose options such as ignoring job pages, running a broken link check, or enabling parameter audit.
- Run the crawl.
- Review the validation report, audit sections, and export TXT or CSV if needed.
- Paste the site homepage in Homepage URL (e.g.
https://example.com). - Add Exclude paths (one per line). Only lines starting with
/are used. - Add Crawl limits by path to cap noisy sections (e.g.
/jobmax 5). - Set Max pages and Concurrency based on how deep you want to go.
- Toggle Ignore job pages, Broken link quick check, or Parameter audit if needed.
- Click Run crawl, then download TXT or CSV from Results.
Tip: enable Broken link quick check to classify live HTTP errors, and enable Parameter audit when you need route-level querystring validation.
See a marketing-style overview at docs/landing.html (matches the in-app color scheme).
flowchart TD
Bookmarklet["Bookmarklet optional"] --> UI["Browser UI (React + Vite SPA)"]
UI --> API["Express API (Node.js backend)"]
API --> Crawler["Crawl engine"]
Crawler --> Robots["robots.txt"]
Crawler --> Sitemap["sitemap.xml"]
Crawler --> Site["Target website"]
Robots --> Crawler
Sitemap --> Crawler
Site --> Crawler
Crawler --> API
API --> UI
- The backend validates and normalises the start URL.
- robots.txt is fetched and enforced before crawling any URL.
- sitemap.xml is fetched and used as the initial discovery source when available.
- If no sitemap exists, crawling starts from the provided URL.
Each discovered URL is filtered using:
- Same-host enforcement
- Optional restriction to the start path (for example starting at
/enonly crawls/en/...) - Excluded file extensions (images, fonts, media, PDFs, JS, CSS)
- User-defined exclude paths
- Optional job page detection and exclusion
- Path limits are normalised by stripping a leading language segment.
- A rule like
/jobmatches/job,/en/job,/fr/job, etc. - Each rule tracks how many URLs were crawled under that path and stops once the limit is reached.
- URLs are processed in batches with a configurable concurrency limit.
- The UI displays progress using a time-based progress indicator while crawling.
- Returned URLs include original URL, final URL after redirects, HTTP status, source type, and referrer page.
- Audit entries are classified as
valid,broken,redirect_issue, orsoft_failure. - Redirect audit highlights loops, multi-hop redirects, dropped params, and irrelevant destinations.
- Soft-failure audit flags successful pages that still fail functionally.
- Impact audit prioritises broken and redirect issues by repetition and core-flow importance.
- Pattern audit groups URLs by structure and highlights inconsistencies.
- Duplicate candidates are grouped by base URL and flagged when query or language variants exist.
- Results can be exported as TXT or CSV.
Request body:
{
"url": "https://example.com",
"options": {
"excludePaths": ["/jobs", "/careers"],
"pathLimits": [{ "path": "/job", "maxPages": 5 }],
"maxPages": 300,
"concurrency": 6,
"includeQuery": true,
"ignoreJobPages": true,
"brokenLinkCheck": false,
"parameterAudit": true,
"patternMatchFilter": "/jobs"
}
}Key response sections:
urls: crawled page recordsaudit: validated navigation entries with referrer pages and classificationsissueReport: broken URLs, redirect issues, parameter issues, soft failures, and impact analysisimpactAudit: prioritised broken/redirect issuesredirectAudit: redirect-chain QAsoftFailureAudit: successful-but-broken pagespatternAudit: structural URL grouping and inconsistency detectionparameterAudit: query-parameter handling checks
Use the crawler on the page you are currently visiting.
- Open the GitHub Pages landing page in
docs/index.html. - Drag the Cat Crawler bookmarklet button to your bookmarks bar.
- Click the bookmark on any site to open Cat Crawler. It auto-fills the current page URL.
- Drag the panel by the top bar and resize it from any of the four corners.
Tip: The landing page button Drag Cat Crawler 😼 is now a loader bookmarklet. Reinstall it once, and future bookmarklet UI updates will come from bookmarklet.js without another reinstall.
The landing page lives in docs/index.html.
Enable Pages:
- Go to Settings → Pages.
- Source: Deploy from a branch.
- Branch:
main. - Folder:
/docs. - Save.
Then visit:
https://<org-or-user>.github.io/site-crawler/
