Oxford English Dictionary ETL Pipeline

A high-throughput Python scraper and data transformation engine for large-scale linguistic datasets.

OED Word Parser is an end-to-end ETL pipeline designed to extract, normalize, and store the vast lexicon of the Oxford English Dictionary. Beyond simple scraping, this project implements robust error-handling, automated retries for network instability, and a flexible data conversion layer to output web-ready JSON and relational SQL data.

🏗️ The ETL Pipeline

The system is architected as a modular three-stage pipeline:

Extraction (oed-parser.py): A multi-threaded-capable scraper that navigates the OED search index. It manages state to allow resuming from specific page indices.
Transformation (convert-data.py): A normalization engine that cleans "messy" HTML snippets, removes duplicates via dictionary mapping, and formats data into JSON, CSV, or TXT.
Loading (load-data-mysql.py): A high-speed database injector utilizing LOAD DATA LOCAL INFILE for $O(n)$ performance, significantly faster than standard INSERT statements for 400k+ rows.

🧠 Resilience & Error Handling

Scraping 400,000+ entries takes time, making the script vulnerable to network timeouts and 502 Bad Gateway errors.

Backoff Strategy: Implements a customizable --error-delay (default 60s) and --max-retries to wait out temporary server bans or IP rate-limits.
Dynamic Exit Logic: The parser intelligently detects "Empty Result" states to prevent infinite polling.
Quote Sanitization: Implements a custom delimiter strategy and double-quote escaping to ensure snippets containing complex punctuation don't break the CSV structure.

📊 Data Outputs

The repository includes the result of a full 439,362-word scrape:

english-words.json: Pretty-printed JSON for easy inspection.
english-words.min.json: Minified version optimized for production web use.
english-words.csv: Database-ready format including Part of Speech and Page indexing.

🚦 Getting Started

Prerequisites

Python 3.8+
MySQL (Optional, for database integration)

Installation

git clone https://github.com/ronbodnar/oxford-english-dictionary-parser.git

cd oxford-english-dictionary-parser

pip install -r requirements.txt

Execution Example

Extract 100 pages with a safety delay to prevent IP flagging:

python src/oed-parser.py --request-delay 2 --max-pages 100 --output-file data/output.txt

Continue extracting from page 100:

python src/oed-parser.py --request-delay 2 --starting-page 100 --output-file data/output.txt

Convert the results to JSON:

python src/convert-data.py -i data/output.txt -o data/output.json -f json

📫 Connect

Created by Ron Bodnar

LinkedIn: linkedin.com/in/ronbodnar
Portfolio: ronbodnar.com

⚖️ License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.env		example.env
parsed-data.csv		parsed-data.csv
parsed-data.json		parsed-data.json
parsed-data.txt		parsed-data.txt
requirements.txt		requirements.txt
word-list.txt		word-list.txt
words.txt		words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oxford English Dictionary ETL Pipeline

🏗️ The ETL Pipeline

🧠 Resilience & Error Handling

📊 Data Outputs

🚦 Getting Started

Prerequisites

Installation

Execution Example

📫 Connect

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oxford English Dictionary ETL Pipeline

🏗️ The ETL Pipeline

🧠 Resilience & Error Handling

📊 Data Outputs

🚦 Getting Started

Prerequisites

Installation

Execution Example

📫 Connect

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages