Skip to content

buhsnn/Vision-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Language Model (VLM) Project

This project implements a vision-language model that combines a vision encoder trained on CIFAR-10 with a text decoder trained on the ELI5 dataset. The final vision-language model is fine-tuned on a visual instruction tuning dataset to generate answers conditioned on images and questions.

The goal of the project is to design a model that surpasses the baseline perplexity of 22.1 on the visual instruction tuning test set, as provided in the AI504 course assignment. This work was developed for the course: AI504 – Programming for AI


Project Overview

Vision-language models jointly process images and text to generate text conditioned on visual inputs. In this project, the model is trained in three steps:

1- Vision Encoder – trained on the CIFAR-10 dataset to extract image features.

2- Text Decoder – trained on the ELI5 dataset to learn text generation.

3- Vision-Language Model – combines the vision encoder with the text decoder and fine-tunes it on the visual instruction tuning dataset.

The model is evaluated by calculating perplexity on the visual instruction test set.

Dataset

CIFAR-10

Used to train the vision encoder.

60,000 images of 32×32 pixels in 10 classes.

Standard training/validation split is used with provided data loaders.

ELI5

Used to train the text decoder.

Only the answer text is used.

Tokenized using GPT-2 tokenizer with a fixed sequence length of 128.

Visual Instruction Tuning Dataset

Used to fine-tune the combined vision-language model.

Contains 1,020 samples with images and corresponding question-answer pairs.

Preprocessing and dataloader code is provided in studentID.py.

Special token is added to the tokenizer for image embeddings.

  • Important: Test sets are never used during training.

Model Architecture

Vision Encoder

  • Based on ResNet18.

  • Output dimension: 768 (hidden size).

  • Trained with cross-entropy loss for CIFAR-10 classification.

Text Decoder

  • GPT-2 language model.

  • Token embeddings resized to include token.

  • Trained using cross-entropy loss for next-token prediction.

Vision-Language Model

  • Maps vision encoder outputs to token embeddings via a linear layer (img_to_tokens).

  • Concatenates image token embeddings with text embeddings for GPT-2 input.

  • Fine-tuned on the visual instruction dataset using cross-entropy loss.


Training Setup

_ Vision Encoder: Adam optimizer, learning rate 1e-3, batch size 128, 2 epochs.

_ Text Decoder: Adam optimizer, learning rate 5e-5, batch size 32, 1 epoch.

_ Vision-Language Model: Adam optimizer, learning rate 1e-4, batch size 16, 2 epochs.

_ Seed fixed to 0 for reproducibility.


Results

The trained model achieves:

Perplexity: 5

  • Test set size: 20 samples

  • Logits shape: (20, 32, 50257)

  • Logits are saved as Vlm.npy (float16).

  • The model surpasses the baseline perplexity of 22.1 provided in the course assignment.


Logits Generation

  • Raw logits represent the unnormalized prediction scores for each token in the vocabulary.

  • Shape: (num_samples, sequence_length, vocab_size) → (20, 32, 50257).

  • Stored as float16 for efficient upload and usage with the provided evaluation script.

Repository Structure

.
├── Vlm.py   # Main script: preprocess, train models, generate logits
├── Vlm.ipynb # Google Colab notebook for running the script
├── test_vlm.py   # Evaluation script provided by the course
└── README.md

File description:

LLM_Model.py

Main script that:

  • loads and preprocesses the dataset
  • trains the Transformer language model
  • generates logits for the test set
  • saves the logits as a .npy file

Environment Requirements

The project requires the following versions:

Python 3.12.12

PyTorch 2.9.0+cu126

datasets 3.1.0

transformers 4.46.2

These versions ensure reproducibility within the Google Colab Free environment.


Key Concepts Demonstrated

  • Transformer-based text generation with GPT-2

  • Vision encoding with ResNet18

  • Vision-language fusion via learned token mapping

  • Autoregressive text generation conditioned on images

  • Perplexity evaluation for model assessment

  • PyTorch training pipelines with fixed seed for reproducibility


Author

Bushra Monika Hossain

Graduate School of AI KAIST

About

Vision-language model combining a ResNet18 vision encoder with a GPT-2 decoder, fine-tuned for visual instruction tuning and image-conditioned text generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors