A PyTorch-based image captioning application that generates descriptive captions for uploaded images using a hybrid attention-based model. Built with FastAPI for the backend and a clean web interface for easy image uploads.
- 🖼️ Web Interface: Simple, user-friendly UI for uploading images and viewing generated captions
- 🤖 Hybrid Attention Model: Uses a CNN encoder (ResNet-50) with attention-based decoder for accurate image descriptions
- ⚡ FastAPI Backend: RESTful API for image caption generation
- 🐳 Docker Support: Containerized setup for easy deployment
- 📦 PyTorch & Transformers: Leverages state-of-the-art deep learning frameworks
- CNN Encoder (CNNEncoderAttention): ResNet-50 backbone that extracts spatial feature maps from images
- Attention-Based Decoder: Generates captions word-by-word using attention mechanisms over image features
- Hybrid Model: Combines the encoder and decoder into a unified
HybridModelAttentionclass
The pre-trained model included in this repository was trained on the Flickr8k dataset, which contains 8,000 images with 5 human-written captions per image. This dataset is widely used for evaluating image captioning models and ensures diverse, natural language descriptions.
- Backend: FastAPI, Python 3.13
- ML Framework: PyTorch, Transformers
- Image Processing: OpenCV, Pillow
- Deployment: Docker & Docker Compose
- Frontend: HTML, CSS, JavaScript (vanilla)
ptorch/
├── src/
│ ├── fastapi/
│ │ └── app.py # FastAPI application
│ └── imagecaptioning/
│ ├── model.py # Model definitions
│ ├── inference.py # Caption generation logic
│ ├── train.py # Training script
│ ├── vocab.py # Vocabulary utilities
│ └── token_utils.py # Tokenization utilities
├── static/
│ └── index.html # Web interface
├── Dockerfile # Container configuration
├── docker-compose.yml # Docker Compose setup
├── pyproject.toml # Python dependencies
├── model_weights.pt # Pre-trained model weights
├── vocab.json # Vocabulary mappings
└── README.md # This file
- Python 3.13+
- Docker & Docker Compose (optional)
- CUDA (optional, for GPU acceleration)
- Clone the repository:
git clone <repository-url>
cd ptorch- Install dependencies:
pip install -e .- Ensure you have the pre-trained model weights and vocabulary:
# model_weights.pt and vocab.json should be in the root directory- Build and start the application:
docker compose up --build- Open your browser and navigate to:
http://localhost:8000/app
- Start the FastAPI server:
uvicorn src.fastapi.app:app --host 0.0.0.0 --port 8000 --reload- Open your browser and navigate to:
http://localhost:8000/app
Upload an image and generate a caption.
Request:
curl -X POST "http://localhost:8000/upload" \
-F "file=@image.jpg"Response:
{
"caption": "a dog sitting on a grass field"
}Interactive API documentation (Swagger UI).
Alternative API documentation (ReDoc).
The web interface provides:
- File Upload: Browse and select images
- Image Preview: See your image before generating a caption
- Caption Display: View the generated caption
- Responsive Design: Works on desktop and mobile devices
Key packages are listed in pyproject.toml:
fastapi>=0.129.0- Web frameworktorch>=2.10.0- PyTorchtorchvision>=0.25.0- Vision utilitiestransformers>=5.1.0- Pre-trained modelsopencv-python>=4.13.0.92- Image processinguvicorn>=0.41.0- ASGI serverpython-multipart>=0.0.22- File upload handling
The model uses the following configuration (in src/fastapi/app.py):
- Device: CPU by default (change to CUDA in app.py for GPU support)
- Model Type: HybridModelAttention
- Vocabulary: Loaded from
vocab.json
PYTHONUNBUFFERED=1
PYTHONDONTWRITEBYTECODE=1- Inference Speed: ~1-3 seconds per image (CPU)
- Model Size: ~100MB (ResNet-50 backbone + decoder weights)
- Supported Image Formats: JPG, PNG, GIF, BMP, TIFF
Docker health check monitors the application:
curl -f http://localhost:8000/docsTo train your own caption generation model:
python -m src.imagecaptioning.train --config your_config.yamlNote: The current pre-trained model was trained on the Flickr8k dataset. To replicate or improve upon this model, you can fine-tune it on the same dataset or use your own custom image-caption pairs.
For batch caption generation:
python -m src.imagecaptioning.inference --image_path /path/to/image.jpg- GPU acceleration support
- Batch image processing
- Multi-language caption support
- Fine-tuning capabilities
- Advanced image preprocessing
- Confidence scores for captions
docker compose build --no-cacheEnsure model_weights.pt and vocab.json are in the project root directory.
The app defaults to CPU. To enable GPU, modify device in src/fastapi/app.py:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")MIT
Arsalanjdev
Contributions are welcome! Please feel free to submit a Pull Request.
For issues, questions, or feature requests, please open an issue on GitHub.
