Skip to content

khalatevarun/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A CLI based concurrent web crawler with configurable concurrency limits, max page controls, and CSV export for extracted page data

Features

  • Concurrent crawling with configurable concurrency limits using p-limit
  • Max page control to limit the number of pages crawled
  • URL normalization for consistent tracking and deduplication
  • Data extraction: Extracts H1 tags, first paragraph, outgoing links, and image URLs
  • CSV export: Automatically generates a CSV report of all crawled pages
  • Domain restriction: Only crawls pages within the same domain

Key Concepts & Libraries

p-limit

Controls concurrency by limiting the number of promises that can run simultaneously to prevents overwhelming the target server.

AbortController

Browser API to cancel in-flight HTTP requests when the crawler reaches its maximum page limit.

Normalization

Converts URLs to a consistent format (e.g., www.example.com/path) for accurate duplicate detection across different URL representations.

Examples

Crawl a blog with moderate concurrency

npm start https://blog.example.com 3 20

Quick scan with high concurrency

npm start https://example.com 10 5

Deep crawl with conservative rate limiting

npm start https://docs.example.com 2 100

Installation

# Clone the repository
git clone https://github.com/khalatevarun/webcrawler.git
cd webcrawler

# Install dependencies
npm install

Usage

Basic Usage

npm start <URL> <maxConcurrency> <maxPages>

Parameters

Parameter Description Required Default
URL The starting URL to crawl Yes -
maxConcurrency Max number of concurrent HTTP requests Yes -
maxPages Maximum number of pages to crawl Yes -

Output

The crawler generates a report.csv file in the project root with the following columns:

  • page_url: The URL of the crawled page
  • h1: The H1 heading text
  • first_paragraph: The first paragraph text (prioritizes <main> content)
  • outgoing_link_urls: Semicolon-separated list of outgoing links
  • image_urls: Semicolon-separated list of image URLs

Sample CSV Output

page_url,h1,first_paragraph,outgoing_link_urls,image_urls
https://example.com,Welcome to Example,This is the first paragraph.,https://example.com/about;https://example.com/contact,https://example.com/logo.png
https://example.com/about,About Us,Learn more about our company.,https://example.com/;https://example.com/team,https://example.com/team-photo.jpg

About

CLI based concurrent web crawler with configurable concurrency limits, max page controls, and CSV export for extracted page data

Resources

Stars

Watchers

Forks

Contributors