A two-stage Human-Object Interaction (HOI) detection framework built on DETR and CLIP, supporting HICO-DET, V-COCO, and CAD120 benchmarks.
- DETR-based object detection backbone with deformable attention
- CLIP integration for vision-language alignment and zero-shot HOI detection
- Multi-GPU distributed training via PyTorch DDP (NCCL)
- Support for HICO-DET (600 HOI categories) and V-COCO (29 actions) benchmarks
- CAD120 video HOI recognition (F1 evaluation)
- Attention visualization for images and videos
- Multiple loss variants (focal loss, Hungarian matching) and FPN support
- Docker-ready deployment
# Build the image
docker compose build
# Train on HICO-DET (4 GPUs)
docker compose up vthoi-train
# Evaluate
docker compose up vthoi-eval
# Single image inference
docker compose up vthoi-inference
# TensorBoard monitoring
docker compose up tensorboard
# Open http://localhost:6006# Clone the repository
git clone https://github.com/<your-username>/vthoi.git
cd vthoi
# Create conda environment
conda create -n vthoi python=3.8 -y
conda activate vthoi
# Install dependencies
pip install -r requirements.txt
# Build deformable attention CUDA operators (requires GPU)
cd detr/models/necks/module/ops
python setup.py build install
cd ../../../../..See docs/dataset-guide.md for detailed instructions.
cd hicodet
bash download.shExpected structure:
hicodet/
├── hico_20160224_det/images/{train2015,test2015}/
├── instances_train2015.json
└── instances_test2015.json
cd vcoco
bash download.shDownload pretrained DETR checkpoints and place them under checkpoints/detr/:
| Backbone | Dataset | Path |
|---|---|---|
| ResNet-50 | HICO-DET | checkpoints/detr/hico_det/detr-r50-hicodet.pth |
| ResNet-101-DC5 | HICO-DET | checkpoints/detr/hico_det/detr-r101-dc5-hicodet.pth |
| ResNet-101 | V-COCO | checkpoints/detr/vcoco/detr-r101-vcoco.pth |
# HICO-DET with 4 GPUs
python main.py \
--world-size 4 \
--dataset hicodet \
--data-root hicodet/ \
--backbone resnet50 \
--batch-size 8 \
--pretrained checkpoints/detr/hico_det/detr-r50-hicodet.pth \
--output-dir logs/hicodet/exp1 \
--clip_model RN50 \
--epochs 20
# V-COCO with 4 GPUs
python main.py \
--world-size 4 \
--dataset vcoco \
--data-root vcoco/ \
--partitions trainval test \
--backbone resnet101 \
--pretrained checkpoints/detr/vcoco/detr-r101-vcoco.pth \
--output-dir logs/vcoco/exp1 \
--clip_model RN50# HICO-DET evaluation
python main.py \
--eval \
--dataset hicodet \
--data-root hicodet/ \
--resume logs/hicodet/exp1/best_model.pt \
--world-size 1
# V-COCO evaluation
python main.py \
--eval \
--dataset vcoco \
--data-root vcoco/ \
--resume logs/vcoco/exp1/best_model.pt \
--world-size 1# Single image from dataset
python inference.py \
--resume checkpoints/best_model.pt \
--dataset hicodet \
--data-root hicodet/ \
--index 0 \
--draw_pic
# Custom image
python inference.py \
--resume checkpoints/best_model.pt \
--image-path /path/to/image.jpg
# Video inference
python inference.py \
--resume checkpoints/best_model.pt \
--video \
--video_dir /path/to/videovthoi/
├── main.py # Training and evaluation entry point
├── inference.py # Inference and visualization
├── vcoco_test.py # V-COCO evaluation script
├── vthoi.py # GFIN model builder (detector + interaction head)
├── interaction_head.py # InteractionHead with CLIP encoder
├── utils.py # DataFactory, CustomisedDLE, transforms
├── ops.py # Spatial encodings, focal loss, Hungarian matching
├── Dockerfile # Docker build configuration
├── docker-compose.yml # Multi-service Docker setup
├── requirements.txt # Python dependencies
│
├── detr/ # DETR backbone (transformer-based detector)
├── CLIP/ # OpenAI CLIP (local fork)
├── pocket/ # Data loading and training utilities
├── vthoi_module/ # Base interaction head, PID decoder
├── gfin_module/ # Position encoding modules
├── add_on/ # Inference utils, attention hooks, NMS, visualization
│
├── hicodet/ # HICO-DET dataset loader and scripts
├── vcoco/ # V-COCO dataset loader and evaluation
├── cad120/ # CAD120 video HOI dataset
├── datasets/ # Priors, word embeddings, class metadata
├── experiments/ # Training shell scripts
└── docs/ # Documentation
| Argument | Default | Description |
|---|---|---|
--dataset |
hicodet |
Dataset: hicodet or vcoco |
--backbone |
resnet101 |
Backbone: resnet50, resnet101 |
--clip_model |
RN50 |
CLIP model: RN50, RN101, ViT-B/32, etc. |
--world-size |
4 |
Number of GPUs for distributed training |
--batch-size |
4 |
Per-GPU batch size |
--epochs |
20 |
Number of training epochs |
--lr-head |
1e-4 |
Learning rate for interaction head |
--zero_shot_type |
default |
Zero-shot split: rare_first, unseen_verb, etc. |
--text_compare |
false |
Use cosine similarity instead of MLP classifier |
--vis_encoder |
false |
Use CLIP ViT as visual encoder |
docker compose buildMount your datasets and checkpoints via docker-compose volumes:
volumes:
- ./data/hicodet:/workspace/vthoi/hicodet # HICO-DET dataset
- ./data/vcoco:/workspace/vthoi/vcoco_data # V-COCO dataset
- ./checkpoints:/workspace/vthoi/checkpoints # Model weights
- ./logs:/workspace/vthoi/logs # Training logsdocker compose run --rm vthoi-train python main.py \
--world-size 2 \
--dataset hicodet \
--backbone resnet50 \
--epochs 30 \
--clip_model ViT-B/32This project is licensed under the BSD 3-Clause License. See LICENSE for details.