A high-throughput Python scraper and data transformation engine for large-scale linguistic datasets.
OED Word Parser is an end-to-end ETL pipeline designed to extract, normalize, and store the vast lexicon of the Oxford English Dictionary. Beyond simple scraping, this project implements robust error-handling, automated retries for network instability, and a flexible data conversion layer to output web-ready JSON and relational SQL data.
The system is architected as a modular three-stage pipeline:
-
Extraction (
oed-parser.py): A multi-threaded-capable scraper that navigates the OED search index. It manages state to allow resuming from specific page indices. -
Transformation (
convert-data.py): A normalization engine that cleans "messy" HTML snippets, removes duplicates via dictionary mapping, and formats data into JSON, CSV, or TXT. -
Loading (
load-data-mysql.py): A high-speed database injector utilizingLOAD DATA LOCAL INFILEfor$O(n)$ performance, significantly faster than standardINSERTstatements for 400k+ rows.
Scraping 400,000+ entries takes time, making the script vulnerable to network timeouts and 502 Bad Gateway errors.
- Backoff Strategy: Implements a customizable
--error-delay(default 60s) and--max-retriesto wait out temporary server bans or IP rate-limits. - Dynamic Exit Logic: The parser intelligently detects "Empty Result" states to prevent infinite polling.
- Quote Sanitization: Implements a custom delimiter strategy and double-quote escaping to ensure snippets containing complex punctuation don't break the CSV structure.
The repository includes the result of a full 439,362-word scrape:
english-words.json: Pretty-printed JSON for easy inspection.english-words.min.json: Minified version optimized for production web use.english-words.csv: Database-ready format including Part of Speech and Page indexing.
- Python 3.8+
- MySQL (Optional, for database integration)
git clone https://github.com/ronbodnar/oxford-english-dictionary-parser.git
cd oxford-english-dictionary-parser
pip install -r requirements.txtExtract 100 pages with a safety delay to prevent IP flagging:
python src/oed-parser.py --request-delay 2 --max-pages 100 --output-file data/output.txtContinue extracting from page 100:
python src/oed-parser.py --request-delay 2 --starting-page 100 --output-file data/output.txtConvert the results to JSON:
python src/convert-data.py -i data/output.txt -o data/output.json -f jsonCreated by Ron Bodnar
- LinkedIn: linkedin.com/in/ronbodnar
- Portfolio: ronbodnar.com
Distributed under the MIT License. See LICENSE for more information.