Image Caption Generator

An image captioning model with support for different styles, inspired by Florence-2. It uses facebook/dinov3 as the image encoder and leverages intfloat/e5-base-v2 for the tokenizer and embeddings.

Its decoder follows a modern, Qwen 3-inspired architecture to generate high-quality captions. Developed in pure PyTorch for learning purposes, the model utilizes a self-attention only mechanism and supports KV caching for efficient inference. The project followed various Hugging Face training recipes and includes a FastAPI backend for inference and a Streamlit frontend for interactive use.

The model was trained on custom MS COCO captions generated by a local LLM, these are included in the data directory.

🧮 Requirements

Python 3.8+
A GPU is strongly recommended for inference at reasonable speed
(Optional) Virtual environment to isolate dependencies

🔧 Installation

Clone the repository:

git clone https://github.com/17xr/ImageCaptionGenerator.git
cd ImageWhisper

Install the Python dependencies:

pip install --no-cache-dir -r requirements.txt

(Optional) If you have CUDA/GPU support, ensure the correct PyTorch/CUDA version is installed.

▶️ Running the Application

1. Start the Backend

In the project root, run:

cd backend
uvicorn app.main:app --host 0.0.0.0 --port 8000

This starts the HTTP API server (using Uvicorn) on port 8000.

2. Start the Frontend

In a separate terminal window, run:

cd frontend
streamlit run src/main.py

This launches the Streamlit UI in your browser.

3. Use the Application

In the Streamlit UI, upload an image (or select a test image) and click "Generate Caption".
The frontend sends the image to the backend, which runs the model and returns several captions.
Captions are displayed in the UI by style.

📁 Project Structure

.
├── backend/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── dependencies.p
│   │   └── config.py
│   ├── architecture/
│   │   ├── __init__.py
│   │   └── transformer.py
│   └── utils/
│       ├── __init__.py
│       └── utils.py
├── data/
│   ├── concise/
│   │   ├── coco_train.csv
│   │   ├── coco_valid.csv
│   │   └── coco_test.csv
│   ├── narrative/
│   │   ├── coco_train.csv
│   │   ├── coco_valid.csv
│   │   └── coco_test.csv
│   └── descriptive/
│       ├── coco_train.csv
│       ├── coco_valid.csv
│       └── coco_test.csv
├── frontend/
│   └── src/
│       └── main.py
├── models/
│   └── weights.pt
├── notebooks/
│   └── training.ipynb
├── LICENSE
├── README.md
└── requirements.txt

📄 License

This project is released under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Caption Generator

🧮 Requirements

🔧 Installation

▶️ Running the Application

1. Start the Backend

2. Start the Frontend

3. Use the Application

📁 Project Structure

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend		backend
data		data
frontend/src		frontend/src
models		models
notebooks		notebooks
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Image Caption Generator

🧮 Requirements

🔧 Installation

▶️ Running the Application

1. Start the Backend

2. Start the Frontend

3. Use the Application

📁 Project Structure

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages