Skip to content

getobyte/NeuralScraper

Repository files navigation

NeuralScraper v2.0

Web scraping, analysis & content extraction for AI agents.

Scrape pages, crawl sites, extract UI/brand/SEO data. MCP server + CLI + HTTP API. Local-first, self-hosted.

TypeScript License MCP


Part of the Neural* ecosystem. NeuralScraper handles web scraping & analysis — but it doesn't work alone. It pairs with NeuralVaultCore (persistent memory), NeuralVaultSkill (session automation), and NeuralVaultFlow (dev workflow orchestration). Each component has its own repository and documentation. See the Neural* Ecosystem section at the bottom.


What It Does

NeuralScraper gives AI agents (and humans) a clean, structured way to extract data from the web — no fluff, no cloud dependency.

Capability Description
Scrape Scrape page — web + PDF
Screenshot Full-page PNG capture
Crawl Multi-page scraping with depth and limit control
Map Fast internal URL discovery
UI Analysis Layout structure, components, spacing, typography
Brand Extraction Dominant colors, fonts, logos
SEO Audit Meta tags, headings, OG, schema markup, scoring
Analyze Scrape + screenshot + UI + brand + SEO in one command
Search Web search via SearXNG + scrape results
Extract Structured data extraction with LLM (Ollama) and custom schema
Interact Browser actions (click, type, wait) + scrape
Batch Process a list of URLs from a file

Installation

Option 1 — Local (recommended)

git clone https://github.com/getobyte/NeuralScraper.git
cd NeuralScraper
npm install
npx playwright install chromium
npm run build

Make the CLI globally available:

npm link
# Now you can run: ns scrape https://example.com

Start the MCP server:

node dist/mcp-server.js

Option 2 — Docker (homelab)

git clone https://github.com/getobyte/NeuralScraper.git
cd NeuralScraper
cp .env.example .env
docker compose up -d

MCP server starts on port 9996 inside container NeuralScraper.

Verify:

docker ps | grep NeuralScraper
docker logs NeuralScraper

Connecting to Claude Code

Add to ~/.claude.json or .claude/settings.json in your project:

{
  "mcpServers": {
    "neuralscraper": {
      "command": "node",
      "args": ["D:/path/to/NeuralScraper/dist/mcp-server.js"]
    }
  }
}

Restart Claude Code. The following 12 tools will be available:

ns_scrape · ns_screenshot · ns_crawl · ns_map · ns_ui · ns_brand · ns_seo · ns_analyze · ns_search · ns_extract · ns_interact · ns_batch


HTTP API

NeuralScraper exposes a REST API when running as a server.

Method Endpoint
GET /health
POST /scrape
POST /screenshot
POST /crawl
POST /map
POST /ui
POST /brand
POST /seo
POST /analyze
POST /search
POST /extract
POST /interact
POST /batch

Using with Ollama (Local LLM)

NeuralScraper's ns extract command uses Ollama to run a local LLM for structured data extraction — no cloud, no API keys.

Step 1 — Install Ollama

Windows / macOS: Download the installer from ollama.com/download and run it.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version

Step 2 — Pull the recommended model

ollama pull qwen3:14b

qwen3:14b — 9.3 GB, 40K context, native tool use support. Recommended for ns extract flows.

Step 3 — Run

ollama run qwen3:14b

Ollama runs as a local API server on http://localhost:11434. No internet required after the initial pull.


CLI Usage

# Scrape a page (web or PDF)
ns scrape https://example.com

# Full-page screenshot
ns screenshot https://example.com

# Crawl a site
ns crawl https://example.com --depth 2 --limit 20

# Discover URLs
ns map https://example.com

# UI analysis
ns ui https://example.com

# Brand extraction
ns brand https://example.com

# SEO audit
ns seo https://example.com

# Full analysis (scrape + screenshot + UI + brand + SEO)
ns analyze https://example.com

# Web search via SearXNG + scrape results
ns search "best react libs" --limit 5

# Structured extraction with LLM (Ollama)
ns extract https://example.com --schema '{"price":"string"}'

# Browser automation (click, type, wait) + scrape
ns interact https://example.com --actions '[{"click":".btn"}]'

# Batch processing from a file
ns batch urls.txt

CLI Options

Option Commands Default
-o, --output <dir> all ./ns-output
-d, --depth <n> crawl 2
-l, --limit <n> crawl, search 20 / 5
--no-screenshot scrape, crawl, batch
-s, --schema <json> extract
-p, --prompt <text> extract
-a, --actions <json> interact []
--no-scrape search
--no-scrape-after interact

Output Structure

Single page scrape:

ns-output/
  example.com/
    2026-03-28T14-30-00/
      page.md
      page.html
      metadata.json
      links.json
      screenshot.png
      ui-analysis.json
      brand.json
      seo-audit.json
      manifest.json

Crawl job:

ns-output/
  example.com/
    crawl-2026-03-28T14-30-00/
      manifest.json
      pages.json
      pages/
        001-home/
        002-about/
        ...

Architecture

src/
  browser/
    playwright.ts        # Browser pool management
    screenshot.ts        # Full-page screenshot
  extractors/
    markdown.ts          # HTML → Markdown (readability + turndown)
    metadata.ts          # Meta tags, OG, Twitter cards
    links.ts             # Link extraction & classification
    ui-analyzer.ts       # Layout, components, spacing, fonts
    brand.ts             # Colors, fonts, logos
    seo.ts               # SEO audit with scoring
  storage/
    writer.ts            # File output & manifest generation
  tools/
    scrape.ts
    screenshot.ts
    crawl.ts
    map.ts
    ui.ts
    brand.ts
    seo.ts
    analyze.ts
    search.ts
    extract.ts
    interact.ts
    batch.ts
  cli.ts                 # CLI entry point (commander)
  mcp-server.ts          # MCP server entry point (stdio)
  index.ts               # Library exports

Stack

Runtime Node.js 20+
Language TypeScript 5.8
Browser Playwright (Chromium)
HTML → MD @mozilla/readability + turndown
HTML parsing cheerio
MCP @modelcontextprotocol/sdk
CLI commander
Build tsup

Neural* Ecosystem

NeuralScraper is a standalone tool — but it's designed to work alongside the rest of the Neural* family. Each component lives in its own repo with its own docs.

Component Role Repo
NeuralScraper (you are here) Web scraping & analysis
NeuralVaultCore Persistent memory for AI agents → GitHub
NeuralVaultSkill Session memory automation → GitHub
NeuralVaultFlow Dev workflow orchestration → GitHub

NeuralScraper v2.0 — Cyber-Draco Legacy Built by getobyte

Releases

No releases published

Packages

 
 
 

Contributors