Intelligent web scraping with automatic strategy selection and TypeScript-first Apify Actor development.
This skill provides:
- Adaptive reconnaissance - Phases 0-5 with quality gates that skip unnecessary work (curl first, browser only if needed)
- Framework-aware detection - Identifies site framework before searching, skips irrelevant patterns
- Validated findings - Every claimed selector/path/API is tested before reporting
- Self-critiquing reports - Intelligence reports include gap analysis and staleness warnings
- Iterative implementation - Starts simple, adds complexity only if needed
- Production-ready guidance - TypeScript-first Apify Actor development
Add this skill to Claude Code by placing this directory in the skills folder.
User: "Scrape https://example.com"
Claude will automatically:
1. Phase 0: curl raw HTML — detect framework, search for data points, check sitemaps
2. QUALITY GATE: All data in HTML? → Skip browser, go to validation
3. Phase 1: Launch stealth browser (only if needed) — capture traffic, rendered DOM
4. Phase 2: Deep scan (only for missing data) — test interactions, sniff APIs
5. Phase 3: Validate every finding — test selectors, replay APIs, confirm paths
6. Phase 4: Protection testing (only if signals detected or user requested)
7. Phase 5: Generate intelligence report with self-critique
8. Implement recommended approach iteratively
9. Test with small batch, then scale
User: "Make this an Apify Actor"
Claude will:
1. Recommend TypeScript (strongly)
2. Guide through `apify create` command
3. Help choose appropriate template (Cheerio vs Playwright)
4. Port scraping logic to Actor format
5. Configure input schema
6. Test and deploy
web-scraping/
├── SKILL.md # Main entry point (proactive workflow)
├── workflows/ # Implementation patterns
│ ├── reconnaissance.md # Phase 1 interactive reconnaissance (CRITICAL)
│ ├── implementation.md # Phase 4 iterative implementation
│ └── productionization.md # Phase 5 Actor creation
├── strategies/ # Deep-dive guides
│ ├── framework-signatures.md # Framework detection lookup tables
│ ├── cheerio-vs-browser-test.md # Cheerio vs Browser decision + early exit
│ ├── proxy-escalation.md # Protection testing skip/run conditions
│ ├── traffic-interception.md # MITM proxy traffic capture
│ ├── sitemap-discovery.md # 60x faster URL discovery
│ ├── api-discovery.md # 10-100x faster than scraping
│ ├── dom-scraping.md # DevTools bridge + humanizer
│ ├── cheerio-scraping.md # HTTP-only (5x faster)
│ ├── hybrid-approaches.md # Combining strategies
│ ├── anti-blocking.md # Multi-layer anti-detection
│ └── session-workflows.md # Session recording, HAR, replay
├── examples/ # Runnable code
│ ├── traffic-interception-basic.js
│ ├── sitemap-basic.js
│ ├── api-scraper.js
│ ├── hybrid-sitemap-api.js
│ └── iterative-fallback.js
├── reference/ # Quick lookup
│ ├── report-schema.md # Intelligence report format (Sections 1-7)
│ ├── proxy-tool-reference.md # Proxy-MCP tools (80+)
│ ├── regex-patterns.md
│ ├── fingerprint-patterns.md
│ └── anti-patterns.md
├── apify/ # Production deployment
│ ├── typescript-first.md # Why TypeScript
│ ├── cli-workflow.md # apify create (CRITICAL)
│ ├── templates/ # TypeScript boilerplate
│ └── examples/ # Working actors
└── README.md # This file
This skill follows Anthropic's official best practices for skill development:
Pattern: Three-level loading system to manage context efficiently
- Level 1: YAML frontmatter (~85 tokens) - Always loaded
- Level 2: Main SKILL.md (~356 lines) - Loaded when skill invoked
- Level 3: Subdirectories - Loaded on-demand as needed
Result: 70-80% token reduction vs monolithic documentation
Source: skill-creator/SKILL.md
Pattern: Write instructions using verb-first commands, not second-person language
Examples:
- ✅ "Load this workflow when user requests"
- ✅ "Check for sitemaps automatically"
- ❌ "You should load this workflow"
- ❌ "You need to check for sitemaps"
Exception: Second-person is acceptable in user-facing prompts, code comments, and tutorial examples
Source: skill-creator/SKILL.md
Pattern: Concise, specific name and description that determine when Claude invokes the skill
Applied:
name: web-scraping- Clear, hyphen-case identifierdescription:- Specific about activation triggers and capabilities (189 chars, optimized from 244)
Source: agent_skills_spec.md
Pattern: Keep only essential procedural instructions in SKILL.md; move detailed information to subdirectories
Applied:
- SKILL.md: Core 4-phase workflow (~356 lines)
workflows/: Detailed implementation patternsstrategies/: Deep-dive guidesexamples/: Runnable codereference/: Quick lookup patternsapify/: Production deployment guides
Source: skill-creator/SKILL.md
Pattern: Separate executable code, documentation, and output resources
Applied:
examples/- Executable JavaScript learning examples (like scripts/)workflows/,strategies/,reference/,apify/- Documentation loaded as needed (like references/)apify/templates/,apify/examples/- Boilerplate code and templates (like assets/)
Source: skill-creator/SKILL.md
Pattern: Create focused skills for specific purposes rather than one skill that does everything
Applied: This skill focuses specifically on web scraping and Apify Actor development, not general web development
Source: Anthropic Skills Best Practices
Pattern: Use clear, technical language focused on "what" and "how" rather than persuasive or promotional tone
Applied: Direct technical guidance throughout ("Check for sitemaps", "Implement iteratively") vs. marketing language
Source: skill-creator/SKILL.md
Quality-gated workflow that skips unnecessary phases:
- Phase 0: curl-based assessment — detect framework, search for data, check protections
- Phase 1: Browser only if needed — stealth Chrome, traffic capture, rendered DOM
- Phase 2: Deep scan only for missing data — targeted interactions, framework-aware API sniffing
- Phase 3: Validate every finding — test selectors, replay APIs, confirm JSON paths
- Phase 4: Protection testing only if signals warrant — conditional escalation
- Phase 5: Self-critiquing report — gaps, assumptions, staleness warnings
Uses strategies/framework-signatures.md lookup tables:
- Response headers → framework identification
- HTML signatures → data location mapping
- Known major sites → direct strategy (e.g., Amazon: custom SSR, no JSON-LD)
- Detect first, then search only relevant patterns
Reports follow reference/report-schema.md with:
Validated?column for every extraction strategy (YES / PARTIAL / NO)- Self-Critique section: gaps, skipped steps, assumptions, staleness risk
- Targeted re-investigation for fixable gaps
- Start with simplest approach
- Test small batch (5-10 items)
- Scale or fallback based on results
- Add robustness last
For production actors:
- Strongly recommend TypeScript
- Always use
apify createcommand - Choose template based on site type (Cheerio for static, Playwright for JS-heavy)
- Type-safe input/output
1. User: "Scrape example.com"
2. Phase 0: curl raw HTML → detect Next.js (__NEXT_DATA__), find product data in JSON
3. GATE A: All data in __NEXT_DATA__? → YES → Skip browser
4. Phase 3: Validate JSON paths resolve to expected values
5. Phase 5: Generate report with self-critique
6. Result: No browser needed, Cheerio + JSON parsing sufficient
1. User: "Scrape protected-shop.com"
2. Phase 0: curl returns 403 → protection detected, no data in HTML
3. GATE A: NO → Continue to Phase 1
4. Phase 1: Stealth browser loads page, traffic reveals API endpoint
5. GATE B: All data covered via API → Skip Phase 2
6. Phase 3: Replay API request, validate response structure
7. Phase 4: Protection testing (403 was detected) → stealth browser + proxy needed
8. Phase 5: Report + self-critique
9. Implements with discovered API + upstream proxies
10. Tests with 10 items, scales to full dataset
1. User: "Make this an Apify Actor"
2. Claude loads apify/ module
3. Recommends TypeScript? (Yes)
4. Guides through: apify create
5. Analyzes site: Static HTML → Selects Cheerio template
6. Ports scraping logic to TypeScript
7. Adds input schema
8. Tests: apify run
9. Deploys: apify push
10. Result: Production-ready actor
| Approach | Time (1000 pages) | vs Crawling |
|---|---|---|
| Sitemap + API | 5 minutes | 60x faster |
| Sitemap + Playwright | 20 minutes | 15x faster |
| API only | 8 minutes | 40x faster |
| Playwright crawl | 45 minutes | Baseline |
- Start with curl (Phase 0) before launching browser
- Detect framework first, then search relevant patterns only
- Quality gates skip phases when data is sufficient
- Validate every selector/path/API before reporting
- Self-critique: check for gaps, assumptions, staleness
- Protection testing only when signals warrant it
- Start simple (traffic interception → sitemap → API → DOM scraping)
- Test small batch first
- Handle errors gracefully
- Respect rate limits
- Use TypeScript for Apify Actors
- Always use
apify createcommand - Choose template based on Phase 1 findings (Cheerio vs Playwright)
- Test locally with
apify run - Deploy with
apify push
→ See strategies/sitemap-discovery.md troubleshooting section
→ See strategies/api-discovery.md authentication section
→ See strategies/dom-scraping.md and consider API discovered via traffic capture
→ See apify/cli-workflow.md common issues section
- Main skill: Read
SKILL.mdfor complete workflow - Workflows: Implementation patterns in
workflows/ - Strategies: Browse
strategies/for detailed guides - Examples: Run code in
examples/directory - Reference: Quick lookups in
reference/ - Apify: Production deployment in
apify/
Intelligence first, implementation second!
This skill prioritizes:
- Reconnaissance - Understand before coding (APIs > Sitemaps > Scraping)
- Speed - Fastest approach that works (API 10-100x faster than HTML)
- Reliability - Structured data > HTML parsing
- Maintainability - TypeScript, proper tooling
- Best practices - Industry standards
5.0.0 - Traffic-interception-first scraping:
- NEW: Proxy-MCP integration (MITM traffic interception + stealth browser + humanizer)
- NEW: Automatic API discovery via traffic capture
- NEW: Multi-layer anti-detection (stealth mode, humanizer, upstream proxies, TLS spoofing)
- NEW: Session recording and HAR export/replay
- Progressive disclosure architecture
- Proactive strategy discovery
- TypeScript-first Apify guidance
- Comprehensive examples
- Modular organization
All best practices sourced from official Anthropic documentation:
- Anthropic Skills Repository
- Agent Skills Specification
- skill-creator/SKILL.md
- Anthropic Skills Announcement
Start here: Read SKILL.md for the complete proactive workflow.