A Scrapy web-crawling project that scrapes articles from major Brazilian newspapers and persists them to MongoDB.
| Spider name | Newspaper | Domain |
|---|---|---|
g1 |
G1 (Globo) | g1.globo.com |
folha |
Folha de S.Paulo | folha.uol.com.br |
uol |
UOL Notícias | noticias.uol.com.br |
- Python 3.10+
- MongoDB 5+
git clone https://github.com/andersonledo/headlines-crawler.git
cd headlines-crawler
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtAll settings live in headlines_crawler/settings.py. The most important ones:
| Setting | Default | Description |
|---|---|---|
MONGODB_URI |
mongodb://localhost:27017 |
MongoDB connection URI |
MONGODB_DATABASE |
headlines_crawler |
Target database |
MONGODB_COLLECTION |
articles |
Target collection |
DOWNLOAD_DELAY |
2 |
Seconds between requests (per domain) |
ROBOTSTXT_OBEY |
True |
Respect robots.txt |
Override any setting on the command line with -s KEY=VALUE.
# Crawl G1
scrapy crawl g1
# Crawl Folha de S.Paulo, stop after 100 articles
scrapy crawl folha -s CLOSESPIDER_ITEMCOUNT=100
# Crawl UOL using a custom MongoDB URI
scrapy crawl uol -s MONGODB_URI="mongodb://user:pass@host:27017"
# Run all spiders sequentially (bash)
for spider in g1 folha uol; do scrapy crawl "$spider"; doneEach document saved to MongoDB has the following fields:
| Field | Type | Description |
|---|---|---|
newspaper |
string | Source name (e.g. "G1") |
url |
string | Canonical article URL (unique index) |
title |
string | Article headline |
subtitle |
string | Subtitle or summary |
author |
string | Author name(s) |
body |
string | Full article text |
tags |
list[string] | Categories / tags |
published_at |
ISO 8601 | Original publication datetime |
updated_at |
ISO 8601 | Last update datetime |
scraped_at |
ISO 8601 | Crawl timestamp |
image_url |
string | URL of the hero image |
image_caption |
string | Caption for the hero image |
A unique index on url prevents duplicate articles. Re-running a spider updates existing documents via upsert.
headlines-crawler/
├── scrapy.cfg
├── requirements.txt
└── headlines_crawler/
├── settings.py # All Scrapy / MongoDB settings
├── items.py # ArticleItem definition
├── pipelines.py # DuplicateFilterPipeline + MongoDBPipeline
├── middlewares.py # Rotating User-Agent + 429-retry middleware
└── spiders/
├── base.py # Shared helpers
├── g1.py # G1 spider
├── folha.py # Folha de S.Paulo spider
└── uol.py # UOL Notícias spider
- Create
headlines_crawler/spiders/<name>.pyinheriting fromBaseNewsSpider. - Set
name,allowed_domains, andstart_urls. - Implement
parse(link-following) andparse_article(data extraction). - Yield
ArticleIteminstances — the pipeline handles MongoDB persistence automatically.
MIT