Vision-Language Model (VLM) Project

This project implements a vision-language model that combines a vision encoder trained on CIFAR-10 with a text decoder trained on the ELI5 dataset. The final vision-language model is fine-tuned on a visual instruction tuning dataset to generate answers conditioned on images and questions.

The goal of the project is to design a model that surpasses the baseline perplexity of 22.1 on the visual instruction tuning test set, as provided in the AI504 course assignment. This work was developed for the course: AI504 – Programming for AI

Project Overview

Vision-language models jointly process images and text to generate text conditioned on visual inputs. In this project, the model is trained in three steps:

1- Vision Encoder – trained on the CIFAR-10 dataset to extract image features.

2- Text Decoder – trained on the ELI5 dataset to learn text generation.

3- Vision-Language Model – combines the vision encoder with the text decoder and fine-tunes it on the visual instruction tuning dataset.

The model is evaluated by calculating perplexity on the visual instruction test set.

Dataset

CIFAR-10

Used to train the vision encoder.

60,000 images of 32×32 pixels in 10 classes.

Standard training/validation split is used with provided data loaders.

ELI5

Used to train the text decoder.

Only the answer text is used.

Tokenized using GPT-2 tokenizer with a fixed sequence length of 128.

Visual Instruction Tuning Dataset

Used to fine-tune the combined vision-language model.

Contains 1,020 samples with images and corresponding question-answer pairs.

Preprocessing and dataloader code is provided in studentID.py.

Special token is added to the tokenizer for image embeddings.

Important: Test sets are never used during training.

Model Architecture

Vision Encoder

Based on ResNet18.
Output dimension: 768 (hidden size).
Trained with cross-entropy loss for CIFAR-10 classification.

Text Decoder

GPT-2 language model.
Token embeddings resized to include token.
Trained using cross-entropy loss for next-token prediction.

Vision-Language Model

Maps vision encoder outputs to token embeddings via a linear layer (img_to_tokens).
Concatenates image token embeddings with text embeddings for GPT-2 input.
Fine-tuned on the visual instruction dataset using cross-entropy loss.

Training Setup

_ Vision Encoder: Adam optimizer, learning rate 1e-3, batch size 128, 2 epochs.

_ Text Decoder: Adam optimizer, learning rate 5e-5, batch size 32, 1 epoch.

_ Vision-Language Model: Adam optimizer, learning rate 1e-4, batch size 16, 2 epochs.

_ Seed fixed to 0 for reproducibility.

Results

The trained model achieves:

Perplexity: 5

Test set size: 20 samples
Logits shape: (20, 32, 50257)
Logits are saved as Vlm.npy (float16).
The model surpasses the baseline perplexity of 22.1 provided in the course assignment.

Logits Generation

Raw logits represent the unnormalized prediction scores for each token in the vocabulary.
Shape: (num_samples, sequence_length, vocab_size) → (20, 32, 50257).
Stored as float16 for efficient upload and usage with the provided evaluation script.

Repository Structure

.
├── Vlm.py   # Main script: preprocess, train models, generate logits
├── Vlm.ipynb # Google Colab notebook for running the script
├── test_vlm.py   # Evaluation script provided by the course
└── README.md

File description:

LLM_Model.py

Main script that:

loads and preprocesses the dataset
trains the Transformer language model
generates logits for the test set
saves the logits as a .npy file

Environment Requirements

The project requires the following versions:

Python 3.12.12

PyTorch 2.9.0+cu126

datasets 3.1.0

transformers 4.46.2

These versions ensure reproducibility within the Google Colab Free environment.

Key Concepts Demonstrated

Transformer-based text generation with GPT-2
Vision encoding with ResNet18
Vision-language fusion via learned token mapping
Autoregressive text generation conditioned on images
Perplexity evaluation for model assessment
PyTorch training pipelines with fixed seed for reproducibility

Author

Bushra Monika Hossain

Graduate School of AI KAIST

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Model (VLM) Project

Project Overview

The model is evaluated by calculating perplexity on the visual instruction test set.

Dataset

Model Architecture

Training Setup

Results

Logits Generation

Repository Structure

Environment Requirements

Key Concepts Demonstrated

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
instruct_tuning		instruct_tuning
README.md		README.md
Vlm.ipynb		Vlm.ipynb
Vlm.py		Vlm.py
test_vlm.py		test_vlm.py

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Model (VLM) Project

Project Overview

The model is evaluated by calculating perplexity on the visual instruction test set.

Dataset

Model Architecture

Training Setup

Results

Logits Generation

Repository Structure

Environment Requirements

Key Concepts Demonstrated

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages