GEM: A Generalist Model for Human Motion

Jiefeng Li · Jinkun Cao · Haotian Zhang · Davis Rempe · Jan Kautz · Umar Iqbal · Ye Yuan

ICCV 2025 (Highlight)

GEM is a unified generative framework for human motion estimation and generation. GEM accepts multiple conditioning modalities — video, 2D keypoints, text, and audio — and handles multiple tasks without task-specific heads.

For full-body motion estimation (hands + face), see GEM-X.

📰 News

[March 2026] 📢 GEM-SMPL is released with a multi-modal demo script.
[December 2025] 📢 GENMO has been renamed to GEM.
[October 2025] 📢 The GEM codebase is released!.

🚀 Quick Start

pip install uv && uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
bash scripts/install_env.sh
python scripts/demo/demo_smpl.py --input_list path/to/video.mp4 "text:a person walks forward" --ckpt_path inputs/pretrained/gem_smpl.ckpt

For full installation instructions (body model, checkpoints), see docs/INSTALL.md.

📦 Pretrained Models

Model	Body Model	Description	Download
GEM-SMPL	SMPL	Regression + generation (text/audio/music/video)	HuggingFace

Place checkpoints under inputs/pretrained/ or pass the path directly via --ckpt_path. The demo scripts will automatically download the checkpoint from HuggingFace if --ckpt is not provided.

🎬 Demo

Multi-modal demo (video + text)

The main demo supports mixed video and text conditioning — the core contribution of GEM.

Video + inline text:

python scripts/demo/demo_smpl.py \
  --input_list video1.mp4 "text:a person acting like a monkey" video2.mp4 \
  --ckpt_path inputs/pretrained/gem_smpl.ckpt

Video + text file:

python scripts/demo/demo_smpl.py \
  --input_list video1.mp4 prompt.txt video2.mp4 \
  --ckpt_path inputs/pretrained/gem_smpl.ckpt

Multiple videos + multiple text prompts:

python scripts/demo/demo_smpl.py \
  --input_list video1.mp4 "text:a person acting like a monkey" video2.mp4 "text:a person dances" \
  --ckpt_path inputs/pretrained/gem_smpl.ckpt

Key arguments

Argument	Default	Description
`--input_list`	—	Input list (required): `.mp4`/`.avi`/`.mov` files, `.txt` files, or `text:prompt` strings
`--ckpt_path`	`null`	Pretrained checkpoint path
`--text_length`	`300`	Number of frames for each text segment (300 = 10s at 30fps)
`--hmr2_ckpt`	`inputs/checkpoints/hmr2/epoch=10-step=25000.ckpt`	HMR2 checkpoint for image features
`-s` / `--static_cam`	off	Assume static camera
`--output_root`	`outputs`	Output directory
`--no_render`	off	Skip visualization, only save SMPL parameters

Outputs

Results are saved to outputs/<first_video_name>_mix/:

File	Description
`1_incam.mp4`	In-camera mesh overlay
`2_global.mp4`	Global-coordinate render
`3_incam_global_horiz.mp4`	Side-by-side comparison
`smpl_params.pt`	SMPL parameters (`body_params_global`, `body_params_incam`, `K_fullimg`, `segment_info`)

Video-only demo

For simple pose estimation without text conditioning, use demo_smpl_hpe.py:

python scripts/demo/demo_smpl_hpe.py \
  --video path/to/video.mp4 \
  --ckpt_path inputs/pretrained/gem_smpl.ckpt

🏋️ Training

See Dataset Preparation for download links and directory structure.

Regression model (video → SMPL):

python scripts/train.py exp=gem_smpl_regression

Full model (regression + text/audio generation):

python scripts/train.py exp=gem_smpl

Multi-GPU (DDP):

python scripts/train.py exp=gem_smpl_regression pl_trainer.devices=4

SLURM:

python scripts/train_slurm.py exp=gem_smpl_regression

Key settings

From configs/exp/gem_smpl_regression.yaml:

Body model: SMPLx
Optimizer: AdamW (lr=2e-4)
Precision: 16-mixed
Max steps: 500K
Gradient clipping: 0.5
Validation every 3000 steps

Logging uses W&B by default. To disable:

python scripts/train.py exp=gem_smpl_regression use_wandb=false

See FAQ for common issues.

🤝 Related Humanoid Work at NVIDIA

GEM is part of a larger effort to enable humanoid motion data for robotics, physical AI, and other applications.

Check out these related works:

📖 Citation

@inproceedings{genmo2025,
  title     = {GENMO: A GENeralist Model for Human MOtion},
  author    = {Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

📄 License

This project is released under the Apache 2.0 License — see LICENSE for details. Third-party components are subject to their own licenses; see ATTRIBUTIONS.md for specifics.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assets		assets
configs		configs
docs		docs
gem		gem
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEM: A Generalist Model for Human Motion

ICCV 2025 (Highlight)

📰 News

🚀 Quick Start

📦 Pretrained Models

🎬 Demo

Multi-modal demo (video + text)

Key arguments

Outputs

Video-only demo

🏋️ Training

Key settings

🤝 Related Humanoid Work at NVIDIA

📖 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEM: A Generalist Model for Human Motion

ICCV 2025 (Highlight)

📰 News

🚀 Quick Start

📦 Pretrained Models

🎬 Demo

Multi-modal demo (video + text)

Key arguments

Outputs

Video-only demo

🏋️ Training

Key settings

🤝 Related Humanoid Work at NVIDIA

📖 Citation

📄 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages