Web Crawler

A web crawler and RSS aggregator built with Go. Crawl any site from a seed URL, visualize its link graph, and subscribe to discovered RSS feeds — all from a single interface.

Features

BFS web crawler — concurrent worker pool, configurable depth and page limits, rate limiting, live progress tracking
Graph analysis — PageRank scoring, orphaned pages, broken links, redirect chains, link authority rankings
RSS aggregator — follow feeds, background scraping, post timeline
Feed discovery — automatically detects RSS/Atom feeds during crawl; subscribe in one click
D3.js visualization — force-directed graph with nodes sized by PageRank, colored by depth, zoom/pan, click-to-highlight
REST API — API key auth, 15+ endpoints, all responses typed via sqlc-generated queries

Stack

Go — crawler engine, REST API (Chi router)
PostgreSQL — persistence (Goose migrations, sqlc)
D3.js — graph visualization
Neon — serverless Postgres hosting

Setup

Prerequisites: Go 1.21+, PostgreSQL, Goose, sqlc

git clone https://github.com/omarraf/web-scraper.git
cd web-scraper
go mod vendor

Create a .env file:

PORT=8000
DATABASE_URL=your_postgres_connection_string

Run migrations and start:

goose -dir sql/schema postgres $DATABASE_URL up
go run .

Open http://localhost:8000 for the UI.

API

All routes under /v1. Authenticated routes require Authorization: ApiKey <key>.

POST   /v1/users                          Create user (returns API key)
GET    /v1/users                          Get current user

POST   /v1/feeds                          Add RSS feed
GET    /v1/feeds                          List all feeds
POST   /v1/feed_follows                   Follow a feed
DELETE /v1/feed_follows/{id}              Unfollow a feed
GET    /v1/posts                          Get posts from followed feeds

POST   /v1/crawl_jobs                     Start a crawl
GET    /v1/crawl_jobs                     List crawl jobs
GET    /v1/crawl_jobs/{id}                Job status + progress
DELETE /v1/crawl_jobs/{id}               Cancel running crawl
GET    /v1/crawl_jobs/{id}/graph          Nodes + edges for visualization
GET    /v1/crawl_jobs/{id}/analysis       Orphans, broken links, redirects, discovered feeds
GET    /v1/crawl_jobs/{id}/pagerank       Top pages by PageRank

Quick start

# Create a user
curl -X POST http://localhost:8000/v1/users -d '{"name":"alice"}'
# → copy the api_key from the response

# Start a crawl
curl -X POST http://localhost:8000/v1/crawl_jobs \
  -H "Authorization: ApiKey <key>" \
  -d '{"seed_url":"https://example.com","max_depth":2,"max_pages":100}'

# Or just open http://localhost:8000 and use the UI

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
crawler		crawler
internal/database		internal/database
sql		sql
vendor		vendor
web		web
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
database.go		database.go
go.mod		go.mod
go.sum		go.sum
handlers_crawler.go		handlers_crawler.go
handlers_feeds.go		handlers_feeds.go
handlers_users.go		handlers_users.go
json.go		json.go
main.go		main.go
middleware_auth.go		middleware_auth.go
models.go		models.go
rss.go		rss.go
scraper.go		scraper.go
sqlc.yaml		sqlc.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Features

Stack

Setup

API

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Features

Stack

Setup

API

Quick start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages