Jiefeng Li · Jinkun Cao · Haotian Zhang · Davis Rempe · Jan Kautz · Umar Iqbal · Ye Yuan
GEM is a unified generative framework for human motion estimation and generation. GEM accepts multiple conditioning modalities — video, 2D keypoints, text, and audio — and handles multiple tasks without task-specific heads.
For full-body motion estimation (hands + face), see GEM-X.
- [March 2026] 📢 GEM-SMPL is released with a multi-modal demo script.
- [December 2025] 📢 GENMO has been renamed to GEM.
- [October 2025] 📢 The GEM codebase is released!.
pip install uv && uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
bash scripts/install_env.sh
python scripts/demo/demo_smpl.py --input_list path/to/video.mp4 "text:a person walks forward" --ckpt_path inputs/pretrained/gem_smpl.ckptFor full installation instructions (body model, checkpoints), see docs/INSTALL.md.
| Model | Body Model | Description | Download |
|---|---|---|---|
| GEM-SMPL | SMPL | Regression + generation (text/audio/music/video) | HuggingFace |
Place checkpoints under inputs/pretrained/ or pass the path directly via --ckpt_path. The demo scripts will automatically download the checkpoint from HuggingFace if --ckpt is not provided.
The main demo supports mixed video and text conditioning — the core contribution of GEM.
Video + inline text:
python scripts/demo/demo_smpl.py \
--input_list video1.mp4 "text:a person acting like a monkey" video2.mp4 \
--ckpt_path inputs/pretrained/gem_smpl.ckptVideo + text file:
python scripts/demo/demo_smpl.py \
--input_list video1.mp4 prompt.txt video2.mp4 \
--ckpt_path inputs/pretrained/gem_smpl.ckptMultiple videos + multiple text prompts:
python scripts/demo/demo_smpl.py \
--input_list video1.mp4 "text:a person acting like a monkey" video2.mp4 "text:a person dances" \
--ckpt_path inputs/pretrained/gem_smpl.ckpt| Argument | Default | Description |
|---|---|---|
--input_list |
— | Input list (required): .mp4/.avi/.mov files, .txt files, or text:prompt strings |
--ckpt_path |
null |
Pretrained checkpoint path |
--text_length |
300 |
Number of frames for each text segment (300 = 10s at 30fps) |
--hmr2_ckpt |
inputs/checkpoints/hmr2/epoch=10-step=25000.ckpt |
HMR2 checkpoint for image features |
-s / --static_cam |
off | Assume static camera |
--output_root |
outputs |
Output directory |
--no_render |
off | Skip visualization, only save SMPL parameters |
Results are saved to outputs/<first_video_name>_mix/:
| File | Description |
|---|---|
1_incam.mp4 |
In-camera mesh overlay |
2_global.mp4 |
Global-coordinate render |
3_incam_global_horiz.mp4 |
Side-by-side comparison |
smpl_params.pt |
SMPL parameters (body_params_global, body_params_incam, K_fullimg, segment_info) |
For simple pose estimation without text conditioning, use demo_smpl_hpe.py:
python scripts/demo/demo_smpl_hpe.py \
--video path/to/video.mp4 \
--ckpt_path inputs/pretrained/gem_smpl.ckptSee Dataset Preparation for download links and directory structure.
Regression model (video → SMPL):
python scripts/train.py exp=gem_smpl_regressionFull model (regression + text/audio generation):
python scripts/train.py exp=gem_smplMulti-GPU (DDP):
python scripts/train.py exp=gem_smpl_regression pl_trainer.devices=4SLURM:
python scripts/train_slurm.py exp=gem_smpl_regressionFrom configs/exp/gem_smpl_regression.yaml:
- Body model: SMPLx
- Optimizer: AdamW (lr=2e-4)
- Precision: 16-mixed
- Max steps: 500K
- Gradient clipping: 0.5
- Validation every 3000 steps
Logging uses W&B by default. To disable:
python scripts/train.py exp=gem_smpl_regression use_wandb=falseSee FAQ for common issues.
GEM is part of a larger effort to enable humanoid motion data for robotics, physical AI, and other applications.
Check out these related works:
@inproceedings{genmo2025,
title = {GENMO: A GENeralist Model for Human MOtion},
author = {Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}This project is released under the Apache 2.0 License — see LICENSE for details. Third-party components are subject to their own licenses; see ATTRIBUTIONS.md for specifics.
