An image captioning model with support for different styles, inspired by Florence-2. It uses facebook/dinov3 as the image encoder and leverages intfloat/e5-base-v2 for the tokenizer and embeddings.
Its decoder follows a modern, Qwen 3-inspired architecture to generate high-quality captions. Developed in pure PyTorch for learning purposes, the model utilizes a self-attention only mechanism and supports KV caching for efficient inference. The project followed various Hugging Face training recipes and includes a FastAPI backend for inference and a Streamlit frontend for interactive use.
The model was trained on custom MS COCO captions generated by a local LLM, these are included in the data directory.
- Python 3.8+
- A GPU is strongly recommended for inference at reasonable speed
- (Optional) Virtual environment to isolate dependencies
- Clone the repository:
git clone https://github.com/17xr/ImageCaptionGenerator.git
cd ImageWhisper- Install the Python dependencies:
pip install --no-cache-dir -r requirements.txt- (Optional) If you have CUDA/GPU support, ensure the correct PyTorch/CUDA version is installed.
In the project root, run:
cd backend
uvicorn app.main:app --host 0.0.0.0 --port 8000This starts the HTTP API server (using Uvicorn) on port 8000.
In a separate terminal window, run:
cd frontend
streamlit run src/main.pyThis launches the Streamlit UI in your browser.
- In the Streamlit UI, upload an image (or select a test image) and click "Generate Caption".
- The frontend sends the image to the backend, which runs the model and returns several captions.
- Captions are displayed in the UI by style.
.
├── backend/
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── dependencies.p
│ │ └── config.py
│ ├── architecture/
│ │ ├── __init__.py
│ │ └── transformer.py
│ └── utils/
│ ├── __init__.py
│ └── utils.py
├── data/
│ ├── concise/
│ │ ├── coco_train.csv
│ │ ├── coco_valid.csv
│ │ └── coco_test.csv
│ ├── narrative/
│ │ ├── coco_train.csv
│ │ ├── coco_valid.csv
│ │ └── coco_test.csv
│ └── descriptive/
│ ├── coco_train.csv
│ ├── coco_valid.csv
│ └── coco_test.csv
├── frontend/
│ └── src/
│ └── main.py
├── models/
│ └── weights.pt
├── notebooks/
│ └── training.ipynb
├── LICENSE
├── README.md
└── requirements.txt
This project is released under the MIT License. See the LICENSE file for details.