This study investigates whether comparative learning can serve as an effective alternative to regression-based learning for image-caption rating.
Instead of training directly on numeric human ratings, the comparative model learns from pairwise judgments indicating which of two image-caption pairs better match each other.
Each image-caption pair is encoded using a pretrained ViLBERT model, producing an image embedding and a caption embedding that are concatenated into a unified multimodal vector. Two models are trained on this representation:
- Regression: MSE loss on normalized human ratings
- Comparative: Hinge loss on pairwise preference labels
O_ij in {+1, -1}
Both models share the same dual-encoder multimodal architecture.
Experiments are conducted on the Validated Image Caption Rating (VICR) dataset:
- 15,646 image-caption pairs
- 68,217 human ratings
- Ratings from 1 to 5
Ratings are normalized to [0, 1] for training.
An 80/20 train-test split is applied.
Sample data is included in the data/ directory.
comparative_image_caption/
├── src/ # RQ1 and RQ2
│ ├── embeddings/
│ ├── embeddings_serialize.py
│ ├── regression.py
│ ├── compare.py
│ └── compare_same_image.py
├── code/ # RQ3 human evaluation analysis
│ ├── checkpoint/
│ ├── comparative_acc.py
│ ├── comparative_agreement.py
│ ├── comparative_interrater_agreement.py
│ ├── compute_averages.py
│ └── task1_agreement_metrics.py
├── results/
│ ├── human_subject/
│ ├── image-caption_GT_VS_Human_Rating.xlsx
│ ├── Qualtrics_HumanEval_Results_Timing_vFinal.csv
│ ├── baseline.csv
│ ├── compare.csv
│ ├── compare_same_image.csv
└── README.md
embeddings_serialize.py-- generate and serialize multimodal embeddingsregression.py-- train/evaluate regression baselinecompare.py-- train comparative model with hinge loss (RQ1)compare_same_image.py-- same-image caption preference (RQ2)
python src/embeddings_serialize.pyGenerates and serializes ViLBERT multimodal embeddings for all image-caption pairs. Output is saved to src/embeddings/.
python src/regression.py
python src/compare.pyResults (Pearson rho, Spearman r_s, Kendall tau_c) are saved to results/baseline.csv and results/compare.csv.
python src/compare_same_image.pyEvaluates which caption better describes a given image. Results are saved to results/compare_same_image.csv.
Raw annotation data from Qualtrics is in results/Qualtrics_HumanEval_Results_Timing_vFinal.csv and results/image-caption_GT_VS_Human_Rating.xlsx. To compute agreement metrics:
python code/task1_agreement_metrics.py
python code/comparative_interrater_agreement.py
python code/compute_averages.py| Model | rho | r_s | tau_c |
|---|---|---|---|
| Narins et al. (2024) | -- | -- | 0.758 +/- 0.03 |
| Regression (ours) | 0.908 +/- 0.001 | 0.887 +/- 0.001 | 0.811 +/- 0.001 |
| Comparative (ours) | 0.874 +/- 0.008 | 0.880 +/- 0.002 | 0.800 +/- 0.002 |
| Model | Accuracy |
|---|---|
| Regression | 0.857 +/- 0.004 |
| Comparative | 0.846 +/- 0.009 |
| Same Image | 0.848 +/- 0.004 |
| Task | p_o | kappa |
|---|---|---|
| Direct Rating | 0.85 | 0.69 |
| Pairwise (different images) | 0.95 | 0.85 |
| Same-Image Comparison | 0.90 | 0.78 |
Supported by NSF Grant No. 2245796.

