This project implements a vision-language model that combines a vision encoder trained on CIFAR-10 with a text decoder trained on the ELI5 dataset. The final vision-language model is fine-tuned on a visual instruction tuning dataset to generate answers conditioned on images and questions.
The goal of the project is to design a model that surpasses the baseline perplexity of 22.1 on the visual instruction tuning test set, as provided in the AI504 course assignment. This work was developed for the course: AI504 – Programming for AI
Vision-language models jointly process images and text to generate text conditioned on visual inputs. In this project, the model is trained in three steps:
1- Vision Encoder – trained on the CIFAR-10 dataset to extract image features.
2- Text Decoder – trained on the ELI5 dataset to learn text generation.
3- Vision-Language Model – combines the vision encoder with the text decoder and fine-tunes it on the visual instruction tuning dataset.
CIFAR-10
Used to train the vision encoder.
60,000 images of 32×32 pixels in 10 classes.
Standard training/validation split is used with provided data loaders.
ELI5
Used to train the text decoder.
Only the answer text is used.
Tokenized using GPT-2 tokenizer with a fixed sequence length of 128.
Visual Instruction Tuning Dataset
Used to fine-tune the combined vision-language model.
Contains 1,020 samples with images and corresponding question-answer pairs.
Preprocessing and dataloader code is provided in studentID.py.
Special token is added to the tokenizer for image embeddings.
- Important: Test sets are never used during training.
Vision Encoder
-
Based on ResNet18.
-
Output dimension: 768 (hidden size).
-
Trained with cross-entropy loss for CIFAR-10 classification.
Text Decoder
-
GPT-2 language model.
-
Trained using cross-entropy loss for next-token prediction.
Vision-Language Model
-
Maps vision encoder outputs to token embeddings via a linear layer (img_to_tokens).
-
Concatenates image token embeddings with text embeddings for GPT-2 input.
-
Fine-tuned on the visual instruction dataset using cross-entropy loss.
_ Vision Encoder: Adam optimizer, learning rate 1e-3, batch size 128, 2 epochs.
_ Text Decoder: Adam optimizer, learning rate 5e-5, batch size 32, 1 epoch.
_ Vision-Language Model: Adam optimizer, learning rate 1e-4, batch size 16, 2 epochs.
_ Seed fixed to 0 for reproducibility.
The trained model achieves:
Perplexity: 5
-
Test set size: 20 samples
-
Logits shape: (20, 32, 50257)
-
Logits are saved as Vlm.npy (float16).
-
The model surpasses the baseline perplexity of 22.1 provided in the course assignment.
-
Raw logits represent the unnormalized prediction scores for each token in the vocabulary.
-
Shape: (num_samples, sequence_length, vocab_size) → (20, 32, 50257).
-
Stored as float16 for efficient upload and usage with the provided evaluation script.
.
├── Vlm.py # Main script: preprocess, train models, generate logits
├── Vlm.ipynb # Google Colab notebook for running the script
├── test_vlm.py # Evaluation script provided by the course
└── README.md
File description:
LLM_Model.py
Main script that:
- loads and preprocesses the dataset
- trains the Transformer language model
- generates logits for the test set
- saves the logits as a
.npyfile
The project requires the following versions:
Python 3.12.12
PyTorch 2.9.0+cu126
datasets 3.1.0
transformers 4.46.2
These versions ensure reproducibility within the Google Colab Free environment.
-
Transformer-based text generation with GPT-2
-
Vision encoding with ResNet18
-
Vision-language fusion via learned token mapping
-
Autoregressive text generation conditioned on images
-
Perplexity evaluation for model assessment
-
PyTorch training pipelines with fixed seed for reproducibility
Bushra Monika Hossain
Graduate School of AI KAIST