Tianshuo Xu1, Zhifei Chen1, Leyi Wu1, Hao Lu1, Ying-cong Chen1,2*
1HKUST (GZ) 2HKUST * corresponding author
Motion Forcing decouples physical reasoning from visual synthesis via a hierarchical Point → Shape → Appearance paradigm, enabling precise and physically consistent video generation from a single image and user-drawn trajectories. Given sparse motion anchors, the model first generates dynamic depth (Shape), then renders high-fidelity RGB frames (Appearance) — bridging the gap between control signals and complex scene dynamics.
| Turn Left | Turn Right | Speed Up | Slow Down |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Dangerous Cut-In | Double Cut-In | Right Cut-In | Left Cut-In & Brake |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
- Inference code
- Gradio demo
- Pretrained checkpoint
- Data processing pipeline (coming soon)
- Training code (coming soon)
git clone --recurse-submodules https://github.com/Tianshuo-Xu/Motion-Forcing.git
cd Motion-Forcing
pip install -r requirements.txtBuild VGGT:
git clone git@github.com:facebookresearch/vggt.git
cd vggt
pip install -e .Download depth estimation weights:
cd Video-Depth-Anything
bash get_weights.sh
Download YOLO segmentation weights into weights/yolo11l-seg.pt (used for interactive object selection in the demo).
CogVideoX base model and the fine-tuned transformer (TSXu/MotionForcing_driving) are downloaded automatically from HuggingFace on first run.
python gradio_demo.pyOpen http://localhost:7860. Upload an image, click objects to draw trajectories, then generate.
We thank the authors of CogVideoX, Video-Depth-Anything, VGGT, and Ultralytics YOLO for their outstanding open-source contributions.
@misc{xu2026motion,
title={Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics},
author={Tianshuo Xu and Zhifei Chen and Leyi Wu and Hao Lu and Ying-cong Chen},
year={2026},
eprint={2603.10408},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10408},
}






