This repository contains our solution for the kaggle competition for Advanced Machine Learning course for the master degree in Data Science.
| STUDENT | ID |
|---|---|
| Luca De Ruggiero | 2174783 |
| Elena Di Grigoli | 2011814 |
| Fabrizio Ferrara | 2207087 |
| Flavio Mangione | 2201201 |
Our challenge was to solve an image–text retrieval task, where the goal is to generate caption embeddings that maximize the Mean Reciprocal Rank (MRR) when matched against ground-truth image embeddings, while also keeping the model as lightweight and efficient as possible.
For this challenge we work with two specific datasets:
test_clean.npztrain.npz.
The train file consists of 125k captions associated with 25k unique images, while the test_clean file contains 1500 captions used for the inference task and for scoring in the Kaggle competition.
Model performance is measured using Mean Reciprocal Rank (MRR). For each test caption, its predicted embedding is compared against all gallery image embeddings, in batches of 100, and the rank of the correct image is used to compute the reciprocal rank.
├── Data/ # Dataset folder
│ ├── test_clean.npz
│ └── train.npz
├── Utils and Functions/
│ ├── metrics.py # metrics definitions
│ └── eval_2.py # evaluation function for the validation
├── Challenge_Notebook.ipynb # Notebook for the submission
├── README.md
The model used is a translator that maps 1024-dimensional text embeddings into the 1536-dimensional DINOv2 image-embedding space. It uses a two-block encoder with LayerNorm, GELU, and dropout for regularization, followed by a decoder that reconstructs the target embedding dimension. A learnable temperature parameter (logit_scale) is included for contrastive alignment. The full architecture consists of approximately the number of trainable parameters reported below