Web Crawler

A CLI based concurrent web crawler with configurable concurrency limits, max page controls, and CSV export for extracted page data

Features

Concurrent crawling with configurable concurrency limits using p-limit
Max page control to limit the number of pages crawled
URL normalization for consistent tracking and deduplication
Data extraction: Extracts H1 tags, first paragraph, outgoing links, and image URLs
CSV export: Automatically generates a CSV report of all crawled pages
Domain restriction: Only crawls pages within the same domain

Key Concepts & Libraries

`p-limit`

Controls concurrency by limiting the number of promises that can run simultaneously to prevents overwhelming the target server.

`AbortController`

Browser API to cancel in-flight HTTP requests when the crawler reaches its maximum page limit.

`Normalization`

Converts URLs to a consistent format (e.g., www.example.com/path) for accurate duplicate detection across different URL representations.

Examples

Crawl a blog with moderate concurrency

npm start https://blog.example.com 3 20

Quick scan with high concurrency

npm start https://example.com 10 5

Deep crawl with conservative rate limiting

npm start https://docs.example.com 2 100

Installation

# Clone the repository
git clone https://github.com/khalatevarun/webcrawler.git
cd webcrawler

# Install dependencies
npm install

Usage

Basic Usage

npm start <URL> <maxConcurrency> <maxPages>

Parameters

Parameter	Description	Required	Default
`URL`	The starting URL to crawl	Yes	-
`maxConcurrency`	Max number of concurrent HTTP requests	Yes	-
`maxPages`	Maximum number of pages to crawl	Yes	-

Output

The crawler generates a report.csv file in the project root with the following columns:

page_url: The URL of the crawled page
h1: The H1 heading text
first_paragraph: The first paragraph text (prioritizes <main> content)
outgoing_link_urls: Semicolon-separated list of outgoing links
image_urls: Semicolon-separated list of image URLs

Sample CSV Output

page_url,h1,first_paragraph,outgoing_link_urls,image_urls
https://example.com,Welcome to Example,This is the first paragraph.,https://example.com/about;https://example.com/contact,https://example.com/logo.png
https://example.com/about,About Us,Learn more about our company.,https://example.com/;https://example.com/team,https://example.com/team-photo.jpg

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Features

Key Concepts & Libraries

`p-limit`

`AbortController`

`Normalization`

Examples

Crawl a blog with moderate concurrency

Quick scan with high concurrency

Deep crawl with conservative rate limiting

Installation

Usage

Basic Usage

Parameters

Output

Sample CSV Output

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Features

Key Concepts & Libraries

p-limit

AbortController

Normalization

Examples

Crawl a blog with moderate concurrency

Quick scan with high concurrency

Deep crawl with conservative rate limiting

Installation

Usage

Basic Usage

Parameters

Output

Sample CSV Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`p-limit`

`AbortController`

`Normalization`