A machine learning project that identifies objects in images based on natural language descriptions (e.g., "the lady with the blue shirt").
This project implements a "Words as Classifiers" approach to solve referring expression comprehension. Given an image and a text description, the model identifies which object in the image the description refers to.
Reference ID: 35254 (Annotation ID: 275551)
Referring Expressions:
- "white brown sheep right"
- "black sheep on right"
- "sheep on the right"
All three expressions refer to the same object (highlighted with green bounding box).
Referring Expression Comprehension is the task of:
- Taking an image containing multiple objects
- Receiving a natural language referring expression
- Identifying the correct object that matches the expression
- Feature Extraction: Visual features (VGG19 or CLIP) + positional features (bounding box, area, etc.)
- Training: Train one binary classifier per word using positive/negative examples
- Inference: Multiply word probabilities and select object with highest score
- Python 3.7+
- CUDA-capable GPU (recommended)
pip install numpy tensorflow scikit-learn scikit-image pillow matplotlib tqdm
# OR for CLIP version:
pip install torch torchvision openai-clip scikit-learn scikit-image pillow tqdmDownload RefCOCO dataset (~12GB):
bash download_coco.shOr manually download from:
- MSCOCO images: http://images.cocodataset.org/zips/
- RefCOCO annotations: http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
Organize as: coco/images/mscoco/images/ and coco/refcoco/
VGG19 version:
python grounded_semantics.pyCLIP version:
python grounded_semantics_clip.pyThe scripts will train word classifiers and automatically evaluate on validation and test sets.
-
Feature Extraction:
- For each training reference, extract visual features (CNN) and positional features
- Combine into a single feature vector
-
Collect Positive Examples:
- For each word in each sentence, add the object's feature vector as a positive example
- Example: Object described as "blue shirt" → adds features to both "blue" and "shirt" classifiers
-
Generate Negative Examples:
- For each word, sample objects from other words as negatives
- Ratio: 3 negative examples per positive example
-
Train Classifiers:
- Train one Logistic Regression classifier per word
- Each classifier learns to distinguish positive from negative examples
- Extract features for all objects in the image
- For each referring expression:
- Compute probability for each word:
P(word | object) - Multiply probabilities:
P(blue) × P(shirt) × P(lady) - Select object with highest combined probability
- Compute probability for each word:
grounded-semantics/
├── refer.py # REFER API for dataset access
├── grounded_semantics.py # VGG19 implementation
├── grounded_semantics_clip.py # CLIP implementation
└── download_coco.sh # Dataset download script
- Feature Extraction: VGG19 (TensorFlow) or CLIP (PyTorch)
- Classifiers: Logistic Regression (scikit-learn)
- Datasets: MSCOCO + RefCOCO
- REFER Dataset: https://github.com/lichengunc/refer
- MSCOCO Dataset: https://cocodataset.org/
