Our code is based on the original VLT5/Bart code.
# Create python environment (optional)
conda create -n MQR python=3.7
source activate MQR
# Install python dependencies
pip install -r requirements.txt
# Download language evalutation tools
https://github.com/bckim92/language-evaluation
# Download T5/BART backbone checkpoint
python download_backbones.py
# Train VL-T5
./VL-T5/
src/
modeling_t5.py modeling_bart.py <= VL-T5/VL-BART model classes
pretrain.py, pretrain_data.py, pretrain_model.py <= pretraining
vqa.py, vqa_data.py vqa_model.py ... <= fine-tuning on downstream tasks (ex. VQA, GQA, NLVR2)
multitask.py, multitask_data.py multiask_model.py <= multitask learning on 7 downstream tasks
param.py <= (argparse) configuration
tokenization.py <= custom tokenizer
utils.py, dist_utils.py <= utility functions
snap/ <= store weight checkpoints
scripts/ <= bash scripts for pretraining and finetuningThe image files (anno_images) can be found in link.
The textual files (McQR_data) can be found in link.
Image feature extraction code can be found in ./feature_extraction. All the extracted image features can also be downloaded via link
The original dataset file with image annotations can be found in link.
We host model checkpoints and features via google drive. We recommend using gdrive to download them.
- Download
snap/from Google Drive
gdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursiveFirst replace the generation_utils.py to the Huggingface transformers package installed in your device.
mv generation_utils.py [your path]/transformers/Then start fine-tuning
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/QueryRewrite_VLT5.sh 4
bash scripts/QueryRewrite_VLBart.sh 4Please cite our paper if you use the dataset and model in your works: