IdiomX: A Large Bilingual Benchmark for Idiom Understanding, Retrieval, and Generation

IdiomX

A Large-Scale Bilingual Dataset for Idiomatic Expression Understanding

Author: Ayman Ali Sharara

Affiliation:
MSc Data Science & Machine Learning (SPOC S21)
DSTI School of Engineering
https://dsti.school/

Project Context:
Deep Learning with Python
Supervised by Prof. Hanna Abi Akl

Contact:

Academic: ayman.sharara@edu.dsti.institute
Personal: aymanshar@gmail.com

Introduction

IdiomX is a research-driven project for building a large-scale bilingual idiom dataset and benchmark for English–Arabic idiom understanding, retrieval, normalization, and generation. The project combines multi-source idiom collection, large language model (LLM)-based enrichment, quality-controlled validation, and downstream deep learning experiments to support idiomatic language research in both monolingual and cross-lingual settings.

The core motivation behind IdiomX is that idioms remain a challenging phenomenon for natural language processing systems because their meanings are often non-literal, context-dependent, and culturally grounded. Existing resources are often small, monolingual, weakly contextualized, or not designed for modern transformer-based learning. IdiomX addresses these limitations by providing a reproducible pipeline that transforms a raw idiom collection into a high-quality bilingual benchmark with contextual examples, semantic annotations, surface-form variation, and evaluation-ready splits.

Research Objective

The objective of IdiomX is to construct a reproducible research pipeline for:

collecting idioms from multiple linguistic resources
enriching idiom entries using structured LLM generation
validating and correcting generated annotations
building benchmark-ready datasets for multiple NLP tasks
training and evaluating deep learning models for idiom-related understanding and generation

The project is designed not only as a dataset release but as a complete research framework that supports experimentation, reproducibility, and extension.

Main Contributions

IdiomX is designed around the following contributions:

A large bilingual idiom dataset centered on English idioms with Arabic semantic annotations.
Contextual expansion of idioms into multiple natural example sentences.
Canonical idiom normalization and surface-form modeling.
Cross-lingual semantic annotations supporting Arabic-to-English idiom tasks.
A structured LLM-based enrichment and verification pipeline.
A benchmark design covering multiple downstream idiom understanding tasks.
A reproducible repository structure for collection, enrichment, and deep learning.

Final Dataset Snapshot

The current LLM enrichment stage has been completed successfully.

Raw idioms: 16,107
Generated examples per idiom: 4
Final enriched rows: 64,428
Automatically valid rows: 63,286
Verified rows: 658
Corrected rows: 476

Final dataset file:

data/enriched/idiomx_enriched_final.csv

## Repository Structure

IdiomX/
│
│
├── deep_learning/
│   ├── datasets/
│   ├── models/
│   ├── training/
│   ├── evaluation/
│   └── README.md
│
├── data/
│   ├── raw/
│   ├── enriched/
│   └── splits/
│
├── figures/
├── paper/
└── README.md


## Project Workflow

Deep Learning

This module prepares task-specific datasets and trains models for multiple benchmark tasks, including:

idiom detection
idiom meaning retrieval
context-to-idiom prediction
cross-lingual idiom retrieval
idiom surface-form normalization

A critical design principle is that all train/validation/test splits are created by idiom, not by row, to prevent leakage and memorization.

Benchmark Tasks

IdiomX supports multiple research tasks:

Idiom Detection

Determine whether a phrase is used idiomatically or literally in context.

Idiom Meaning Retrieval

Predict the semantic meaning of an idiom in English and Arabic.

Context-to-Idiom Prediction

Predict the most appropriate idiom from a contextual sentence.

Cross-Lingual Idiom Retrieval

Retrieve the correct English idiom from an Arabic sentence or semantic context.

Idiom Surface Normalization

Map contextual idiom surface forms to canonical idiom entries.

Reproducibility

This repository is structured for full reproducibility. Each major module has its own dedicated README.md with:

overview
inputs and outputs
step-by-step execution instructions
required files
code entry points
recommended run order

For environment setup and detailed execution instructions, refer to the module-specific README files.

Research Perspective

IdiomX is intended as a benchmark-oriented research artifact rather than only a static dataset. The project is designed to support publication in NLP venues focused on lexical semantics, figurative language, multilingual NLP, and low-resource semantic transfer.

The LLM enrichment stage demonstrates that structured generation and targeted verification can be used to construct a high-quality bilingual idiom resource at scale. The downstream modeling stage is intended to evaluate how well neural architectures can generalize to unseen idioms and cross-lingual contexts.

Status

Current project checkpoint:

Data collection: completed
LLM enrichment: completed
Dataset verification: completed here: https://github.com/aymanshar/idiomx-dataset
Deep learning benchmark preparation: in progress

Citation

If you use IdiomX in academic work, please cite the associated paper once available.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deep_learning		deep_learning
figures		figures
notebooks		notebooks
paper		paper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IdiomX: A Large Bilingual Benchmark for Idiom Understanding, Retrieval, and Generation

IdiomX

Introduction

Research Objective

Main Contributions

Final Dataset Snapshot

Deep Learning

Benchmark Tasks

Idiom Detection

Idiom Meaning Retrieval

Context-to-Idiom Prediction

Cross-Lingual Idiom Retrieval

Idiom Surface Normalization

Reproducibility

Research Perspective

Status

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IdiomX: A Large Bilingual Benchmark for Idiom Understanding, Retrieval, and Generation

IdiomX

Introduction

Research Objective

Main Contributions

Final Dataset Snapshot

Deep Learning

Benchmark Tasks

Idiom Detection

Idiom Meaning Retrieval

Context-to-Idiom Prediction

Cross-Lingual Idiom Retrieval

Idiom Surface Normalization

Reproducibility

Research Perspective

Status

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages