Skip to content

0xDarkMatter/flarecrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 Flarecrawl CLI

GitHub Python License Cloudflare

Cloudflare Browser Rendering CLI — Firecrawl-compatible, cost-efficient at scale.

CLI that wraps Cloudflare's Browser Rendering REST API with the same command structure as Firecrawl. Supports scraping, crawling, URL discovery, screenshots, PDFs, and AI-powered data extraction — all running on Cloudflare's headless Chromium infrastructure. Cost-efficient alternative for high-volume use cases (free 10 min/day, then $0.09/hr).

Recent Updates

Version Date Changes
v0.11.0 2026-04-03 search command (Jina Search), --proxy flag, --clean for HTML, per-site YAML rulesets (flarecrawl rules), FLARECRAWL_PROXY env var, 378 tests
v0.10.0 2026-04-02 Enhanced content extraction (--paywall), stealth mode (--stealth), automatic content cleanup — multi-strategy cascade with per-site optimisations, browser TLS fingerprint impersonation via curl_cffi, archive fallbacks, ad/cruft removal on all markdown output, works without auth, batch mode support, 343 tests
v0.9.0 2026-03-26 Markdown content negotiation (Accept: text/markdown) — auto-detects sites serving markdown natively, skips browser rendering for faster/cheaper/higher-quality extraction. Domain capability cache, --no-negotiate, source metadata on all results, flarecrawl negotiate status/clear, batch session reuse, 278 tests
v0.8.0 2026-03-20 --scroll, --query, --precision/--recall, --deduplicate, --session, flarecrawl batch, --format accessibility, 215 tests
v0.7.0 2026-03-20 --archived (Wayback fallback), --language, --magic (cookie banner removal), filename collision fixes, 197 tests
v0.6.1 2026-03-19 --backup-dir for raw HTML archival, discover edge case fixes, 187 tests
v0.6.0 2026-03-19 --selector, --js-eval, --wait-for-selector, --stdin, --har, flarecrawl discover command, 185 tests
v0.5.4 2026-03-19 --user-agent on all commands for custom crawler identity or paywall bypass
v0.5.3 2026-03-19 Guided auth login with browser auto-open for token setup
v0.5.2 2026-03-19 Content filtering on crawl/download, --webhook on crawl, summary+main-content combo, 169 unit tests
v0.5.1 2026-03-19 Feature test corpus (80 live tests across 8 sites), 158 unit tests, all green
v0.5.0 2026-03-19 --only-main-content, --include-tags/--exclude-tags, --mobile, --headers, --diff, formats: images/summary/schema, new schema command
v0.4.0 2026-03-19 --auth user:pass flag on all commands for HTTP Basic Auth protected sites
v0.3.0 2026-03-19 Batch mode, response caching, connection pooling, HTTP/2, env-var config, 100 tests

Why Flarecrawl?

Firecrawl Flarecrawl
Pricing model Per-page credits Time-based (free 10 min/day, then $0.09/hr)
JS rendering Yes Yes (headless Chromium on edge)
PDF generation No Yes
AI extraction Spark models Workers AI (included)
Favicon extraction Via branding format Dedicated command
Self-hosted option Yes Cloudflare infrastructure
Web search Yes No
Branding extraction Yes Not yet

Different pricing models suit different use cases. Flarecrawl's time-based pricing is particularly cost-efficient for high-volume crawls.

Install

git clone https://github.com/0xDarkMatter/flarecrawl.git
cd flarecrawl
uv venv && uv pip install -e .

Setup

1. Create an API token

  1. Go to https://dash.cloudflare.com/profile/api-tokens
  2. Click Create Token
  3. Select Create Custom Token
  4. Configure:
    • Token name: Flarecrawl (or anything)
    • Permissions: Account → Browser Rendering → Edit
    • Account Resources: Include → your account
  5. Click Continue to summaryCreate Token
  6. Copy the token (shown only once)

2. Find your Account ID

  1. Go to https://dash.cloudflare.com
  2. Click any domain (or the account overview)
  3. Look in the right sidebar under Account ID
  4. Copy the 32-character hex string

3. Authenticate

# Interactive — opens browser to Cloudflare dashboard for guided setup
flarecrawl auth login

# Non-interactive (CI/CD)
flarecrawl auth login --account-id YOUR_ACCOUNT_ID --token YOUR_TOKEN

4. Verify

flarecrawl auth status
flarecrawl --status   # Shows auth + pricing info

Environment Variables (CI/CD)

export FLARECRAWL_ACCOUNT_ID="your-account-id"
export FLARECRAWL_API_TOKEN="your-api-token"

Commands

scrape — Fetch page content

# Default: markdown output to stdout
flarecrawl scrape https://example.com

# HTML output
flarecrawl scrape https://example.com --format html

# JSON envelope (for piping)
flarecrawl scrape https://example.com --json

# Multiple URLs (scraped concurrently)
flarecrawl scrape https://a.com https://b.com https://c.com --json

# Batch mode: file input with NDJSON output and configurable workers
flarecrawl scrape --batch urls.txt --workers 5

# From a file of URLs (backward-compatible alias for --batch)
flarecrawl scrape --input urls.txt --json

# With timing info
flarecrawl scrape https://example.com --timing

# Filter JSON fields
flarecrawl scrape https://example.com --json --fields url,content

# Extract links only
flarecrawl scrape https://example.com --format links --json

# Take screenshot via scrape
flarecrawl scrape https://example.com --screenshot -o page.png

# Wait for JS rendering (SPAs, Swagger UIs)
flarecrawl scrape https://example.com --js

# Bypass response cache
flarecrawl scrape https://example.com --no-cache

# Custom page load strategy
flarecrawl scrape https://example.com --wait-until networkidle2

Formats: markdown (default), html, links, screenshot, json (AI extraction)

HTTP Basic Auth

All commands support --auth user:password for sites protected by HTTP Basic Auth:

flarecrawl scrape https://intranet.example.com --auth admin:secret
flarecrawl crawl https://intranet.example.com --wait --limit 50 --auth admin:secret
flarecrawl download https://intranet.example.com --limit 20 --auth user:pass
flarecrawl screenshot https://intranet.example.com --auth user:pass -o page.png

Content filtering

# Strip nav/header/footer, keep main article content
flarecrawl scrape https://example.com --only-main-content

# Keep only specific CSS selectors
flarecrawl scrape https://example.com --include-tags "article,.post"

# Remove specific elements
flarecrawl scrape https://example.com --exclude-tags "nav,footer,.sidebar"

Custom headers & mobile

# Custom HTTP headers
flarecrawl scrape https://example.com --headers "Accept-Language: fr"
flarecrawl scrape https://example.com --headers '{"X-Api-Key": "abc123"}'

# Custom User-Agent (identify your crawler, or try bypassing paywalls)
flarecrawl scrape https://example.com --user-agent "MyBot/1.0 (contact@example.com)"
flarecrawl scrape https://paywalled.example.com --user-agent "Googlebot/2.1"

# Mobile device emulation (iPhone 14 Pro viewport)
flarecrawl scrape https://example.com --mobile
flarecrawl screenshot https://example.com --mobile -o mobile.png

Images, summaries & structured data

# Extract all image URLs from a page
flarecrawl scrape https://example.com --format images --json

# AI-powered content summary
flarecrawl scrape https://example.com --format summary --json

# Extract LD+JSON, OpenGraph, Twitter Cards
flarecrawl scrape https://example.com --format schema --json

# Dedicated schema command with type filtering
flarecrawl schema https://example.com --json
flarecrawl schema https://example.com --type ld-json --json
flarecrawl schema https://example.com --type opengraph --json

Webhooks

# POST crawl results to a URL when complete
flarecrawl crawl https://example.com --wait --limit 10 --webhook https://hooks.example.com/crawl

# With custom headers (e.g. auth token)
flarecrawl crawl https://example.com --wait --limit 10 \
  --webhook https://hooks.example.com/crawl \
  --webhook-headers "Authorization: Bearer token123"

CSS selector extraction & JS execution

# Extract content from specific CSS selector
flarecrawl scrape https://example.com --selector "main" --json

# Wait for a CSS element before capturing (SPAs, lazy-load)
flarecrawl scrape https://example.com --wait-for-selector ".loaded" --json

# Run JavaScript and return the result
flarecrawl scrape https://example.com --js-eval "document.title" --json
flarecrawl scrape https://example.com --js-eval "document.querySelectorAll('a').length" --json

Stdin piping & HAR capture

# Process local HTML without API call
cat page.html | flarecrawl scrape --stdin --only-main-content
curl https://example.com | flarecrawl scrape --stdin --format schema --json

# Save request metadata to HAR file
flarecrawl scrape https://example.com --har requests.har --json

# Save raw HTML alongside output (for archival/reprocessing)
flarecrawl scrape https://example.com --backup-dir ./html-backup
flarecrawl download https://example.com --limit 20 --backup-dir ./html-backup

URL discovery

# Discover all URLs via sitemaps, RSS feeds, and page links
flarecrawl discover https://example.com --json

# Sitemaps only
flarecrawl discover https://example.com --no-feed --no-links --json

# With limit
flarecrawl discover https://example.com --limit 100 --json

Cookie banner removal, language, archive fallback

# Remove cookie banners, GDPR modals, newsletter popups
flarecrawl scrape https://eu-site.example.com --magic

# Request content in a specific language
flarecrawl scrape https://example.com --language de

# Fallback to Internet Archive if page returns 404
flarecrawl scrape https://dead-link.example.com --archived

Enhanced content extraction

Multi-strategy content extraction that applies per-site optimisations before falling back to standard browser rendering. Useful for sites that serve content in SSR HTML but hide it with client-side JavaScript, or sites behind bot detection that block default request patterns.

# Enhanced extraction with multi-strategy cascade
flarecrawl scrape https://example.com/article --paywall

# JSON output shows which extraction strategy worked
flarecrawl scrape https://example.com/article --paywall --json
# metadata.source indicates the strategy used

# Batch mode
flarecrawl scrape --batch urls.txt --paywall --workers 5

# No Cloudflare auth required (direct HTTP strategies)
# With auth: falls through to browser rendering if direct strategies fail

Strategies are tried in order of speed and cost. Direct HTTP strategies consume zero browser time. When Cloudflare auth is configured, per-site header optimisations are also applied to browser rendering requests as a fallback.

Optional dependency for improved compatibility: pip install curl_cffi

Stealth mode

Opt-in browser TLS fingerprint impersonation for direct HTTP requests. Uses curl_cffi to send requests with a real Safari/Chrome TLS handshake, avoiding bot detection systems that fingerprint JA3/JA4 hashes.

# Stealth mode for content negotiation and direct fetches
flarecrawl scrape https://example.com --stealth

# Combine with paywall extraction
flarecrawl scrape https://example.com --paywall --stealth --json

# Batch mode
flarecrawl scrape --batch urls.txt --stealth --workers 5

Requires: pip install curl_cffi

Without --stealth, requests use Python's default TLS stack (httpx/OpenSSL) which is identifiable by bot detection systems. With --stealth, the TLS Client Hello is indistinguishable from a real Safari browser.

Content cleanup

Markdown output is automatically cleaned of common ad placeholders, share buttons, newsletter prompts, copyright lines, and navigation chrome. No flag needed - applied to all markdown output by default.

For HTML output, use --clean to strip ad/promo DOM elements:

# Strip ad containers, social share widgets, cookie banners from HTML
flarecrawl scrape https://example.com --format html --clean --json

Web search

Search the web and optionally scrape each result. Uses Jina Search API.

# Search and get results as JSON
flarecrawl search "python web scraping" --json

# Search and scrape each result through the normal pipeline
flarecrawl search "topic" --scrape --limit 5 --paywall --json

# Pipeline: search -> extract URLs -> batch scrape
flarecrawl search "query" --json | jq -r '.data[].url' > urls.txt
flarecrawl scrape --batch urls.txt --paywall --stealth --workers 5

Requires JINA_API_KEY env var (free at https://jina.ai/api-key). With --scrape, all scrape flags apply (--paywall, --stealth, --only-main-content, --clean).

Proxy support

Route all HTTP requests through a proxy. Affects CLI-to-Cloudflare API connections and direct HTTP (stealth, negotiate, paywall). Does NOT affect the Cloudflare browser-to-target connection (CF's browser uses its own IP).

# HTTP proxy
flarecrawl scrape https://example.com --proxy http://proxy:8080

# SOCKS5 proxy
flarecrawl scrape https://example.com --proxy socks5://localhost:9050

# Set default via env var
export FLARECRAWL_PROXY=socks5://localhost:9050
flarecrawl scrape https://example.com   # uses env var proxy

Supported on all commands: scrape, crawl, download, extract, search.

Per-site rules

Customisable per-site header rules for enhanced extraction. Default rules ship with the package; user overrides are loaded from ~/.config/flarecrawl/rules.yaml.

# List all loaded rules
flarecrawl rules list --json

# Show rules for a domain
flarecrawl rules show www.nytimes.com

# Add a custom rule
flarecrawl rules add example.com --referer https://www.google.com/ --cookie "auth=1"

# Show file paths
flarecrawl rules path

Rules use Ladder-compatible YAML format:

- domain: www.example.com
  headers:
    Referer: "https://www.google.com/"
    Cookie: "session=abc"
- domains:
    - a.com
    - b.com
  headers:
    Referer: "https://t.co/x?amp=1"

Markdown content negotiation

Sites on Cloudflare (Pro+) can serve markdown directly via Accept: text/markdown content negotiation. Flarecrawl auto-detects this on every scrape — when a site supports it, content is fetched via a simple HTTP GET instead of headless Chromium.

# Auto-detect (default) — tries content negotiation first
flarecrawl scrape https://blog.cloudflare.com/some-post

# Force browser rendering (skip negotiation)
flarecrawl scrape https://blog.cloudflare.com/some-post --no-negotiate

# JSON output shows the source
flarecrawl scrape https://blog.cloudflare.com/some-post --json
# metadata.source: "content-negotiation" (no browser) or "browser-rendering"
# metadata.markdownTokens: 1234 (from x-markdown-tokens header)
# metadata.contentSignal: {"ai-train": "yes", ...}

Benefits when negotiation succeeds:

  • Faster — ~100-200ms vs 2-3s for browser rendering
  • Cheaper — zero browser time consumed
  • Higher quality — server-side conversion by the site owner
  • Domain cached — one probe per domain, batch-friendly

Change tracking

# Compare current content against cached version
flarecrawl scrape https://example.com --diff --json

crawl — Crawl a website

# Start crawl and wait for results
flarecrawl crawl https://example.com --wait --limit 50

# With progress indicator
flarecrawl crawl https://example.com --wait --progress --limit 100

# Fire and forget (returns job ID)
flarecrawl crawl https://example.com --limit 50

# Check status of running crawl
flarecrawl crawl JOB_ID --status

# Get results from completed crawl
flarecrawl crawl JOB_ID

# Filter paths
flarecrawl crawl https://docs.example.com --wait --limit 200 \
  --include-paths "/docs,/api" --exclude-paths "/zh,/ja"

# Stream results as NDJSON (one record per line)
flarecrawl crawl JOB_ID --ndjson --fields url,markdown

# Skip JS rendering for faster crawl
flarecrawl crawl https://example.com --wait --limit 100 --no-render

# Follow subdomains
flarecrawl crawl https://example.com --wait --allow-subdomains

# Save to file
flarecrawl crawl https://example.com --wait --limit 50 -o results.json

map — Discover URLs

# List all links on a page
flarecrawl map https://example.com

# JSON output
flarecrawl map https://example.com --json

# Include subdomains
flarecrawl map https://example.com --include-subdomains

# Limit results
flarecrawl map https://example.com --limit 20 --json

download — Save site to disk

# Download as markdown files to .flarecrawl/
flarecrawl download https://docs.example.com --limit 50

# Download as HTML
flarecrawl download https://example.com --limit 20 --format html

# Filter paths
flarecrawl download https://docs.example.com --limit 100 \
  --include-paths "/docs" --exclude-paths "/changelog"

# Skip confirmation prompt
flarecrawl download https://example.com --limit 10 -y

Files are saved to .flarecrawl/<domain>/ with sanitized filenames.

extract — AI-powered data extraction

# Extract structured data with a natural language prompt
flarecrawl extract "Get all product names and prices" \
  --urls https://shop.example.com --json

# With JSON schema for structured output
flarecrawl extract "Extract article metadata" \
  --urls https://blog.example.com \
  --schema '{"type":"json_schema","schema":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}'

# Schema from file
flarecrawl extract "Extract data" --urls https://example.com --schema-file schema.json

# Multiple URLs
flarecrawl extract "Get page title" --urls https://a.com,https://b.com --json

# Batch mode: parallel extraction with NDJSON output
flarecrawl extract "Get page title" --batch urls.txt --workers 5

Uses Cloudflare Workers AI for extraction (no additional cost).

screenshot — Capture web pages

# Default: saves to screenshot.png
flarecrawl screenshot https://example.com

# Custom output path
flarecrawl screenshot https://example.com -o hero.png

# Full page
flarecrawl screenshot https://example.com -o full.png --full-page

# Specific element
flarecrawl screenshot https://example.com --selector "main" -o main.png

# Custom viewport
flarecrawl screenshot https://example.com --width 1440 --height 900 -o wide.png

# JPEG format
flarecrawl screenshot https://example.com --format jpeg -o page.jpg

# JSON output (base64 encoded)
flarecrawl screenshot https://example.com --json

pdf — Render pages as PDF

# Default: saves to page.pdf
flarecrawl pdf https://example.com

# Custom output
flarecrawl pdf https://example.com -o report.pdf

# Landscape A4
flarecrawl pdf https://example.com -o report.pdf --landscape --format a4

# JSON output (base64 encoded)
flarecrawl pdf https://example.com --json

favicon — Extract favicon URL

# Get the best (largest) favicon
flarecrawl favicon https://example.com

# Show all found icons
flarecrawl favicon https://example.com --all

# JSON output
flarecrawl favicon https://example.com --json
flarecrawl favicon https://example.com --all --json

Renders the page, parses <link rel="icon">, <link rel="apple-touch-icon">, and related tags. Returns the largest icon found. Falls back to /favicon.ico if no <link> tags found.

usage — Track browser time

# Show today's usage and history
flarecrawl usage

# JSON output
flarecrawl usage --json

Tracks the X-Browser-Ms-Used header from each API response locally. Free tier is 600,000ms (10 minutes) per day.

auth — Authentication

flarecrawl auth login                    # Interactive
flarecrawl auth login --account-id ID --token TOKEN  # Non-interactive
flarecrawl auth status                   # Human-readable
flarecrawl auth status --json            # Machine-readable
flarecrawl auth logout                   # Clear credentials

cache — Response cache management

flarecrawl cache status                  # Show entries, size, path
flarecrawl cache status --json           # Machine-readable
flarecrawl cache clear                   # Remove all cached responses

Responses are cached for 1 hour by default. Use --no-cache on any command to bypass.

negotiate — Domain capability cache

flarecrawl negotiate status              # Show domains that support text/markdown
flarecrawl negotiate status --json       # Machine-readable
flarecrawl negotiate clear               # Reset domain cache

Tracks which domains respond to Accept: text/markdown. Positive results cached 7 days, negative 24 hours.

Performance features

  • Response caching — 1-hour TTL, saves redundant browser renders
  • Connection pooling — persistent httpx session with HTTP/2 support
  • Resource rejection — skips images/fonts/media/stylesheets for text extraction
  • JS rendering — opt-in via --js flag (waits for networkidle0)

Environment variables

Variable Default Description
FLARECRAWL_ACCOUNT_ID Cloudflare account ID
FLARECRAWL_API_TOKEN Cloudflare API token
FLARECRAWL_CACHE_TTL 3600 Cache TTL in seconds
FLARECRAWL_MAX_RETRIES 3 Max retry attempts
FLARECRAWL_MAX_WORKERS 10 Max parallel workers
FLARECRAWL_TIMEOUT 120 Request timeout in seconds
FLARECRAWL_PROXY - Default proxy URL (http/https/socks5)
JINA_API_KEY - Jina API key for search command

Firecrawl Compatibility

Flarecrawl follows the same command structure as the firecrawl CLI:

firecrawl command flarecrawl equivalent Notes
firecrawl scrape URL flarecrawl scrape URL Same flags
firecrawl scrape URL1 URL2 flarecrawl scrape URL1 URL2 Concurrent
firecrawl crawl URL --wait flarecrawl crawl URL --wait Same flags
firecrawl map URL flarecrawl map URL Same flags
firecrawl download URL flarecrawl download URL Saves to .flarecrawl/
firecrawl agent PROMPT flarecrawl extract PROMPT Uses Workers AI
firecrawl credit-usage flarecrawl usage Local tracking
firecrawl search QUERY flarecrawl search QUERY Via Jina Search API
firecrawl --status flarecrawl --status Same

What's different

  • search command — uses Jina Search API (requires JINA_API_KEY), with --scrape to scrape results
  • extract instead of agent — same concept, different name to avoid confusion
  • favicon command — bonus: extract favicon/apple-touch-icon URLs from pages
  • schema command — bonus: extract LD+JSON, OpenGraph, Twitter Cards
  • PDF command — bonus: Cloudflare supports PDF rendering, Firecrawl doesn't
  • Output directory.flarecrawl/ instead of .firecrawl/
  • --only-main-content — client-side via BeautifulSoup (Firecrawl uses server-side extraction)

Output Format

All --json output follows a consistent envelope:

{
  "data": { ... },
  "meta": { "format": "markdown", "count": 1 }
}

Errors:

{
  "error": { "code": "AUTH_REQUIRED", "message": "Not authenticated..." }
}

Exit Codes

Code Meaning Action
0 Success Continue
1 Error Check stderr for details
2 Auth required Run flarecrawl auth login
3 Not found Check job ID
4 Validation Fix arguments
5 Forbidden Check token permissions
7 Rate limited Wait and retry

Batch & Parallel

Commands that operate on multiple URLs support batch mode with configurable parallelism.

Batch input (--batch)

# Plain text file (one URL per line, # comments supported)
flarecrawl scrape --batch urls.txt --workers 5

# JSON array
flarecrawl scrape --batch urls.json

# NDJSON (one JSON object per line)
flarecrawl extract "Get title" --batch urls.ndjson --workers 3

Input format is auto-detected: starts with [ → JSON array, starts with { → NDJSON, otherwise plain text.

Batch output

Batch mode outputs NDJSON (one JSON object per line) with index correlation:

{"index": 0, "status": "ok", "data": {"url": "https://a.com", "content": "...", "elapsed": 1.2}}
{"index": 1, "status": "error", "error": {"code": "TIMEOUT", "message": "Request timed out..."}}
{"index": 2, "status": "ok", "data": {"url": "https://c.com", "content": "...", "elapsed": 0.8}}

Results are sorted by index. Failed URLs don't stop processing — errors are reported inline.

Workers

Flag Default Max Notes
--workers / -w 3 10 Matches CF paid tier concurrency limit
# Conservative (free tier: 3 concurrent browsers)
flarecrawl scrape --batch urls.txt --workers 3

# Aggressive (paid tier: up to 10)
flarecrawl scrape --batch urls.txt --workers 10

Supported commands

Command --batch --workers Notes
scrape Yes Yes Also supports --input (alias)
extract Yes Yes Supplements --urls
crawl No No Has its own async job system
screenshot No No Single URL
pdf No No Single URL

Advanced Usage

Raw JSON body passthrough

Every command supports --body to send a raw JSON payload directly to the CF API, bypassing all flag processing:

flarecrawl scrape --body '{
  "url": "https://example.com",
  "gotoOptions": {"waitUntil": "networkidle0", "timeout": 60000},
  "rejectResourceTypes": ["image", "media"]
}' --json

Piping and chaining

# Map URLs then batch scrape them
flarecrawl map https://docs.example.com --json | \
  jq -r '.data[]' | head -10 > urls.txt
flarecrawl scrape --batch urls.txt --workers 5

# Crawl and extract just the markdown
flarecrawl crawl https://example.com --wait --limit 10 --json | \
  jq -r '.data.records[] | select(.status=="completed") | .markdown'

# Stream crawl results through jq
flarecrawl crawl JOB_ID --ndjson --fields url,markdown | \
  jq -r '.url + "\t" + (.markdown | length | tostring)'

Retry behavior

Requests automatically retry up to 3 times on HTTP 429 (rate limited), 502, and 503 errors with exponential backoff. Timeouts also trigger retries.

Pricing Details

Tier Browser Time Concurrent Browsers
Free 10 min/day 3 max
Paid 10 hr/month included, then $0.09/hr 10 (averaged monthly), +$2/extra

Browser time is shared between REST API calls and Workers bindings. Track your usage with flarecrawl usage.

Project Structure

flarecrawl/
├── pyproject.toml              # Package config
├── AGENTS.md                   # AI agent context
├── README.md                   # This file
├── src/flarecrawl/
│   ├── __init__.py             # Version
│   ├── batch.py                # Batch processing (parse + parallel workers)
│   ├── cache.py                # File-based response cache
│   ├── cli.py                  # Typer CLI (all commands)
│   ├── client.py               # CF Browser Rendering API client (httpx pooling, HTTP/2)
│   ├── config.py               # Credentials, usage tracking, env-var config
│   ├── extract.py              # HTML extraction (main content, images, schema, tags)
│   ├── negotiate.py            # Markdown content negotiation (Accept: text/markdown)
│   ├── paywall.py              # Paywall bypass cascade (SSR, Referer, Wayback, Jina)
│   ├── rules.py                # Per-site YAML rulesets (load, merge, cache)
│   ├── search.py               # Web search via Jina Search API
│   ├── stealth.py              # Stealth HTTP (curl_cffi TLS impersonation)
│   └── default_rules.yaml      # Shipped per-site header rules
└── tests/
    ├── conftest.py             # Test fixtures
    ├── corpus.py               # Feature test corpus (80 live tests x 8 sites)
    ├── test_batch.py           # Batch module tests
    ├── test_cache.py           # Cache module tests
    ├── test_cli.py             # CLI tests
    ├── test_client.py          # Client tests
    ├── test_extract.py         # Extract module tests
    ├── test_paywall.py         # Paywall bypass tests
    ├── test_rules.py           # Per-site rules tests
    └── test_search.py          # Search module tests

Development

# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff check src/

# Reinstall after changes
uv pip install -e .

License

MIT

About

Flarecrawl CLI — Cloudflare Browser Rendering alternative to Firecrawl, cost-efficient at scale

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages