headlines-crawler

A Scrapy web-crawling project that scrapes articles from major Brazilian newspapers and persists them to MongoDB.

Supported sources

Spider name	Newspaper	Domain
`g1`	G1 (Globo)	g1.globo.com
`folha`	Folha de S.Paulo	folha.uol.com.br
`uol`	UOL Notícias	noticias.uol.com.br

Requirements

Python 3.10+
MongoDB 5+

Installation

git clone https://github.com/andersonledo/headlines-crawler.git
cd headlines-crawler
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configuration

All settings live in headlines_crawler/settings.py. The most important ones:

Setting	Default	Description
`MONGODB_URI`	`mongodb://localhost:27017`	MongoDB connection URI
`MONGODB_DATABASE`	`headlines_crawler`	Target database
`MONGODB_COLLECTION`	`articles`	Target collection
`DOWNLOAD_DELAY`	`2`	Seconds between requests (per domain)
`ROBOTSTXT_OBEY`	`True`	Respect robots.txt

Override any setting on the command line with -s KEY=VALUE.

Running a spider

# Crawl G1
scrapy crawl g1

# Crawl Folha de S.Paulo, stop after 100 articles
scrapy crawl folha -s CLOSESPIDER_ITEMCOUNT=100

# Crawl UOL using a custom MongoDB URI
scrapy crawl uol -s MONGODB_URI="mongodb://user:pass@host:27017"

# Run all spiders sequentially (bash)
for spider in g1 folha uol; do scrapy crawl "$spider"; done

Article schema

Each document saved to MongoDB has the following fields:

Field	Type	Description
`newspaper`	string	Source name (e.g. "G1")
`url`	string	Canonical article URL (unique index)
`title`	string	Article headline
`subtitle`	string	Subtitle or summary
`author`	string	Author name(s)
`body`	string	Full article text
`tags`	list[string]	Categories / tags
`published_at`	ISO 8601	Original publication datetime
`updated_at`	ISO 8601	Last update datetime
`scraped_at`	ISO 8601	Crawl timestamp
`image_url`	string	URL of the hero image
`image_caption`	string	Caption for the hero image

A unique index on url prevents duplicate articles. Re-running a spider updates existing documents via upsert.

Project structure

headlines-crawler/
├── scrapy.cfg
├── requirements.txt
└── headlines_crawler/
    ├── settings.py       # All Scrapy / MongoDB settings
    ├── items.py          # ArticleItem definition
    ├── pipelines.py      # DuplicateFilterPipeline + MongoDBPipeline
    ├── middlewares.py    # Rotating User-Agent + 429-retry middleware
    └── spiders/
        ├── base.py       # Shared helpers
        ├── g1.py         # G1 spider
        ├── folha.py      # Folha de S.Paulo spider
        └── uol.py        # UOL Notícias spider

Adding a new spider

Create headlines_crawler/spiders/<name>.py inheriting from BaseNewsSpider.
Set name, allowed_domains, and start_urls.
Implement parse (link-following) and parse_article (data extraction).
Yield ArticleItem instances — the pipeline handles MongoDB persistence automatically.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
api		api
headlines_crawler		headlines_crawler
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_demo.py		run_demo.py
scheduler.py		scheduler.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

headlines-crawler

Supported sources

Requirements

Installation

Configuration

Running a spider

Article schema

Project structure

Adding a new spider

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

headlines-crawler

Supported sources

Requirements

Installation

Configuration

Running a spider

Article schema

Project structure

Adding a new spider

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages