A Large-Scale Bilingual Dataset for Idiomatic Expression Understanding
Author: Ayman Ali Sharara
Affiliation:
MSc Data Science & Machine Learning (SPOC S21)
DSTI School of Engineering
https://dsti.school/
Project Context:
Deep Learning with Python
Supervised by Prof. Hanna Abi Akl
Contact:
- Academic: ayman.sharara@edu.dsti.institute
- Personal: aymanshar@gmail.com
IdiomX is a research-driven project for building a large-scale bilingual idiom dataset and benchmark for English–Arabic idiom understanding, retrieval, normalization, and generation. The project combines multi-source idiom collection, large language model (LLM)-based enrichment, quality-controlled validation, and downstream deep learning experiments to support idiomatic language research in both monolingual and cross-lingual settings.
The core motivation behind IdiomX is that idioms remain a challenging phenomenon for natural language processing systems because their meanings are often non-literal, context-dependent, and culturally grounded. Existing resources are often small, monolingual, weakly contextualized, or not designed for modern transformer-based learning. IdiomX addresses these limitations by providing a reproducible pipeline that transforms a raw idiom collection into a high-quality bilingual benchmark with contextual examples, semantic annotations, surface-form variation, and evaluation-ready splits.
The objective of IdiomX is to construct a reproducible research pipeline for:
- collecting idioms from multiple linguistic resources
- enriching idiom entries using structured LLM generation
- validating and correcting generated annotations
- building benchmark-ready datasets for multiple NLP tasks
- training and evaluating deep learning models for idiom-related understanding and generation
The project is designed not only as a dataset release but as a complete research framework that supports experimentation, reproducibility, and extension.
IdiomX is designed around the following contributions:
- A large bilingual idiom dataset centered on English idioms with Arabic semantic annotations.
- Contextual expansion of idioms into multiple natural example sentences.
- Canonical idiom normalization and surface-form modeling.
- Cross-lingual semantic annotations supporting Arabic-to-English idiom tasks.
- A structured LLM-based enrichment and verification pipeline.
- A benchmark design covering multiple downstream idiom understanding tasks.
- A reproducible repository structure for collection, enrichment, and deep learning.
The current LLM enrichment stage has been completed successfully.
- Raw idioms: 16,107
- Generated examples per idiom: 4
- Final enriched rows: 64,428
- Automatically valid rows: 63,286
- Verified rows: 658
- Corrected rows: 476
Final dataset file:
data/enriched/idiomx_enriched_final.csv
## Repository Structure
IdiomX/
│
│
├── deep_learning/
│ ├── datasets/
│ ├── models/
│ ├── training/
│ ├── evaluation/
│ └── README.md
│
├── data/
│ ├── raw/
│ ├── enriched/
│ └── splits/
│
├── figures/
├── paper/
└── README.md
## Project Workflow
This module prepares task-specific datasets and trains models for multiple benchmark tasks, including:
- idiom detection
- idiom meaning retrieval
- context-to-idiom prediction
- cross-lingual idiom retrieval
- idiom surface-form normalization
A critical design principle is that all train/validation/test splits are created by idiom, not by row, to prevent leakage and memorization.
IdiomX supports multiple research tasks:
Determine whether a phrase is used idiomatically or literally in context.
Predict the semantic meaning of an idiom in English and Arabic.
Predict the most appropriate idiom from a contextual sentence.
Retrieve the correct English idiom from an Arabic sentence or semantic context.
Map contextual idiom surface forms to canonical idiom entries.
This repository is structured for full reproducibility. Each major module has its own dedicated README.md with:
- overview
- inputs and outputs
- step-by-step execution instructions
- required files
- code entry points
- recommended run order
For environment setup and detailed execution instructions, refer to the module-specific README files.
IdiomX is intended as a benchmark-oriented research artifact rather than only a static dataset. The project is designed to support publication in NLP venues focused on lexical semantics, figurative language, multilingual NLP, and low-resource semantic transfer.
The LLM enrichment stage demonstrates that structured generation and targeted verification can be used to construct a high-quality bilingual idiom resource at scale. The downstream modeling stage is intended to evaluate how well neural architectures can generalize to unseen idioms and cross-lingual contexts.
Current project checkpoint:
-
Data collection: completed
-
LLM enrichment: completed
-
Dataset verification: completed here: https://github.com/aymanshar/idiomx-dataset
-
Deep learning benchmark preparation: in progress
If you use IdiomX in academic work, please cite the associated paper once available.