Skip to content

ddwhzh/vthoi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VTHOI: Vision-Transformer Human-Object Interaction Detection

License Python PyTorch

A two-stage Human-Object Interaction (HOI) detection framework built on DETR and CLIP, supporting HICO-DET, V-COCO, and CAD120 benchmarks.

Features

  • DETR-based object detection backbone with deformable attention
  • CLIP integration for vision-language alignment and zero-shot HOI detection
  • Multi-GPU distributed training via PyTorch DDP (NCCL)
  • Support for HICO-DET (600 HOI categories) and V-COCO (29 actions) benchmarks
  • CAD120 video HOI recognition (F1 evaluation)
  • Attention visualization for images and videos
  • Multiple loss variants (focal loss, Hungarian matching) and FPN support
  • Docker-ready deployment

Quick Start

Docker (Recommended)

# Build the image
docker compose build

# Train on HICO-DET (4 GPUs)
docker compose up vthoi-train

# Evaluate
docker compose up vthoi-eval

# Single image inference
docker compose up vthoi-inference

# TensorBoard monitoring
docker compose up tensorboard
# Open http://localhost:6006

Local Installation

# Clone the repository
git clone https://github.com/<your-username>/vthoi.git
cd vthoi

# Create conda environment
conda create -n vthoi python=3.8 -y
conda activate vthoi

# Install dependencies
pip install -r requirements.txt

# Build deformable attention CUDA operators (requires GPU)
cd detr/models/necks/module/ops
python setup.py build install
cd ../../../../..

Dataset Preparation

See docs/dataset-guide.md for detailed instructions.

HICO-DET

cd hicodet
bash download.sh

Expected structure:

hicodet/
├── hico_20160224_det/images/{train2015,test2015}/
├── instances_train2015.json
└── instances_test2015.json

V-COCO

cd vcoco
bash download.sh

Pretrained DETR Weights

Download pretrained DETR checkpoints and place them under checkpoints/detr/:

Backbone Dataset Path
ResNet-50 HICO-DET checkpoints/detr/hico_det/detr-r50-hicodet.pth
ResNet-101-DC5 HICO-DET checkpoints/detr/hico_det/detr-r101-dc5-hicodet.pth
ResNet-101 V-COCO checkpoints/detr/vcoco/detr-r101-vcoco.pth

Usage

Training

# HICO-DET with 4 GPUs
python main.py \
  --world-size 4 \
  --dataset hicodet \
  --data-root hicodet/ \
  --backbone resnet50 \
  --batch-size 8 \
  --pretrained checkpoints/detr/hico_det/detr-r50-hicodet.pth \
  --output-dir logs/hicodet/exp1 \
  --clip_model RN50 \
  --epochs 20

# V-COCO with 4 GPUs
python main.py \
  --world-size 4 \
  --dataset vcoco \
  --data-root vcoco/ \
  --partitions trainval test \
  --backbone resnet101 \
  --pretrained checkpoints/detr/vcoco/detr-r101-vcoco.pth \
  --output-dir logs/vcoco/exp1 \
  --clip_model RN50

Evaluation

# HICO-DET evaluation
python main.py \
  --eval \
  --dataset hicodet \
  --data-root hicodet/ \
  --resume logs/hicodet/exp1/best_model.pt \
  --world-size 1

# V-COCO evaluation
python main.py \
  --eval \
  --dataset vcoco \
  --data-root vcoco/ \
  --resume logs/vcoco/exp1/best_model.pt \
  --world-size 1

Inference

# Single image from dataset
python inference.py \
  --resume checkpoints/best_model.pt \
  --dataset hicodet \
  --data-root hicodet/ \
  --index 0 \
  --draw_pic

# Custom image
python inference.py \
  --resume checkpoints/best_model.pt \
  --image-path /path/to/image.jpg

# Video inference
python inference.py \
  --resume checkpoints/best_model.pt \
  --video \
  --video_dir /path/to/video

Project Structure

vthoi/
├── main.py                 # Training and evaluation entry point
├── inference.py            # Inference and visualization
├── vcoco_test.py           # V-COCO evaluation script
├── vthoi.py                # GFIN model builder (detector + interaction head)
├── interaction_head.py     # InteractionHead with CLIP encoder
├── utils.py                # DataFactory, CustomisedDLE, transforms
├── ops.py                  # Spatial encodings, focal loss, Hungarian matching
├── Dockerfile              # Docker build configuration
├── docker-compose.yml      # Multi-service Docker setup
├── requirements.txt        # Python dependencies
│
├── detr/                   # DETR backbone (transformer-based detector)
├── CLIP/                   # OpenAI CLIP (local fork)
├── pocket/                 # Data loading and training utilities
├── vthoi_module/           # Base interaction head, PID decoder
├── gfin_module/            # Position encoding modules
├── add_on/                 # Inference utils, attention hooks, NMS, visualization
│
├── hicodet/                # HICO-DET dataset loader and scripts
├── vcoco/                  # V-COCO dataset loader and evaluation
├── cad120/                 # CAD120 video HOI dataset
├── datasets/               # Priors, word embeddings, class metadata
├── experiments/            # Training shell scripts
└── docs/                   # Documentation

Key Arguments

Argument Default Description
--dataset hicodet Dataset: hicodet or vcoco
--backbone resnet101 Backbone: resnet50, resnet101
--clip_model RN50 CLIP model: RN50, RN101, ViT-B/32, etc.
--world-size 4 Number of GPUs for distributed training
--batch-size 4 Per-GPU batch size
--epochs 20 Number of training epochs
--lr-head 1e-4 Learning rate for interaction head
--zero_shot_type default Zero-shot split: rare_first, unseen_verb, etc.
--text_compare false Use cosine similarity instead of MLP classifier
--vis_encoder false Use CLIP ViT as visual encoder

Docker Usage

Build

docker compose build

Data Volumes

Mount your datasets and checkpoints via docker-compose volumes:

volumes:
  - ./data/hicodet:/workspace/vthoi/hicodet        # HICO-DET dataset
  - ./data/vcoco:/workspace/vthoi/vcoco_data        # V-COCO dataset
  - ./checkpoints:/workspace/vthoi/checkpoints      # Model weights
  - ./logs:/workspace/vthoi/logs                    # Training logs

Custom Training

docker compose run --rm vthoi-train python main.py \
  --world-size 2 \
  --dataset hicodet \
  --backbone resnet50 \
  --epochs 30 \
  --clip_model ViT-B/32

License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.

Acknowledgements

  • DETR - End-to-End Object Detection with Transformers
  • CLIP - Contrastive Language-Image Pre-Training
  • HICO-DET - A Benchmark for Recognizing Human-Object Interactions in Images
  • V-COCO - Visual Semantic Role Labeling

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors