Skip to content

andersonledo/headlines-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

headlines-crawler

A Scrapy web-crawling project that scrapes articles from major Brazilian newspapers and persists them to MongoDB.

Supported sources

Spider name Newspaper Domain
g1 G1 (Globo) g1.globo.com
folha Folha de S.Paulo folha.uol.com.br
uol UOL Notícias noticias.uol.com.br

Requirements

  • Python 3.10+
  • MongoDB 5+

Installation

git clone https://github.com/andersonledo/headlines-crawler.git
cd headlines-crawler
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configuration

All settings live in headlines_crawler/settings.py. The most important ones:

Setting Default Description
MONGODB_URI mongodb://localhost:27017 MongoDB connection URI
MONGODB_DATABASE headlines_crawler Target database
MONGODB_COLLECTION articles Target collection
DOWNLOAD_DELAY 2 Seconds between requests (per domain)
ROBOTSTXT_OBEY True Respect robots.txt

Override any setting on the command line with -s KEY=VALUE.

Running a spider

# Crawl G1
scrapy crawl g1

# Crawl Folha de S.Paulo, stop after 100 articles
scrapy crawl folha -s CLOSESPIDER_ITEMCOUNT=100

# Crawl UOL using a custom MongoDB URI
scrapy crawl uol -s MONGODB_URI="mongodb://user:pass@host:27017"

# Run all spiders sequentially (bash)
for spider in g1 folha uol; do scrapy crawl "$spider"; done

Article schema

Each document saved to MongoDB has the following fields:

Field Type Description
newspaper string Source name (e.g. "G1")
url string Canonical article URL (unique index)
title string Article headline
subtitle string Subtitle or summary
author string Author name(s)
body string Full article text
tags list[string] Categories / tags
published_at ISO 8601 Original publication datetime
updated_at ISO 8601 Last update datetime
scraped_at ISO 8601 Crawl timestamp
image_url string URL of the hero image
image_caption string Caption for the hero image

A unique index on url prevents duplicate articles. Re-running a spider updates existing documents via upsert.

Project structure

headlines-crawler/
├── scrapy.cfg
├── requirements.txt
└── headlines_crawler/
    ├── settings.py       # All Scrapy / MongoDB settings
    ├── items.py          # ArticleItem definition
    ├── pipelines.py      # DuplicateFilterPipeline + MongoDBPipeline
    ├── middlewares.py    # Rotating User-Agent + 429-retry middleware
    └── spiders/
        ├── base.py       # Shared helpers
        ├── g1.py         # G1 spider
        ├── folha.py      # Folha de S.Paulo spider
        └── uol.py        # UOL Notícias spider

Adding a new spider

  1. Create headlines_crawler/spiders/<name>.py inheriting from BaseNewsSpider.
  2. Set name, allowed_domains, and start_urls.
  3. Implement parse (link-following) and parse_article (data extraction).
  4. Yield ArticleItem instances — the pipeline handles MongoDB persistence automatically.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors