Skip to content

17xr/image-caption-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Caption Generator

An image captioning model with support for different styles, inspired by Florence-2. It uses facebook/dinov3 as the image encoder and leverages intfloat/e5-base-v2 for the tokenizer and embeddings.

Its decoder follows a modern, Qwen 3-inspired architecture to generate high-quality captions. Developed in pure PyTorch for learning purposes, the model utilizes a self-attention only mechanism and supports KV caching for efficient inference. The project followed various Hugging Face training recipes and includes a FastAPI backend for inference and a Streamlit frontend for interactive use.

The model was trained on custom MS COCO captions generated by a local LLM, these are included in the data directory.

🧮 Requirements

  • Python 3.8+
  • A GPU is strongly recommended for inference at reasonable speed
  • (Optional) Virtual environment to isolate dependencies

🔧 Installation

  1. Clone the repository:
git clone https://github.com/17xr/ImageCaptionGenerator.git
cd ImageWhisper
  1. Install the Python dependencies:
pip install --no-cache-dir -r requirements.txt
  1. (Optional) If you have CUDA/GPU support, ensure the correct PyTorch/CUDA version is installed.

▶️ Running the Application

1. Start the Backend

In the project root, run:

cd backend
uvicorn app.main:app --host 0.0.0.0 --port 8000

This starts the HTTP API server (using Uvicorn) on port 8000.

2. Start the Frontend

In a separate terminal window, run:

cd frontend
streamlit run src/main.py

This launches the Streamlit UI in your browser.

3. Use the Application

  • In the Streamlit UI, upload an image (or select a test image) and click "Generate Caption".
  • The frontend sends the image to the backend, which runs the model and returns several captions.
  • Captions are displayed in the UI by style.

📁 Project Structure

.
├── backend/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── dependencies.p
│   │   └── config.py
│   ├── architecture/
│   │   ├── __init__.py
│   │   └── transformer.py
│   └── utils/
│       ├── __init__.py
│       └── utils.py
├── data/
│   ├── concise/
│   │   ├── coco_train.csv
│   │   ├── coco_valid.csv
│   │   └── coco_test.csv
│   ├── narrative/
│   │   ├── coco_train.csv
│   │   ├── coco_valid.csv
│   │   └── coco_test.csv
│   └── descriptive/
│       ├── coco_train.csv
│       ├── coco_valid.csv
│       └── coco_test.csv
├── frontend/
│   └── src/
│       └── main.py
├── models/
│   └── weights.pt
├── notebooks/
│   └── training.ipynb
├── LICENSE
├── README.md
└── requirements.txt

📄 License

This project is released under the MIT License. See the LICENSE file for details.

About

A Modern, Decoder-Only Image Captioning Transformer inspired by Florence-2.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors