Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
[📄 Paper] · [⚡ GitHub] · [🎬 Project Page]
LRCM is a multimodal-guided diffusion framework for dance motion generation that simultaneously leverages audio rhythm 🎵 and hierarchical text descriptions 📝 (global style + local movements) for high-quality, controllable dance synthesis.
Current dance motion generation methods suffer from coarse semantic control and poor coherence in long sequences. LRCM addresses these through:
- 🧩 Decoupled Multimodal Dance Dataset Paradigm — Fine-grained semantic decoupling of motion, audio, and text
- 🎛️ Heterogeneous Multimodal-Guided Diffusion Architecture — Audio-latent Conformer + Text-latent Cross-Conformer
- 🔄 Motion Temporal Mamba Module (MTMM) — State space model-based autoregressive extension for long-sequence generation
- 🎵 Dual-modality conditioning: Audio rhythm + Text descriptions (global + local)
- ⏩ Autoregressive generation: Efficient long-sequence synthesis via Mamba SSM
- 💃 7 dance genres: Hip-hop, Jazz, Krump, Popping, Locking, Charleston, Tap
git clone https://github.com/OranDuanStudy/LRCM.git
cd LRCM
pip install -r requirements.txtRequirements: Python 3.10+, CUDA 12.x, PyTorch 2.4+, 4× RTX 4090 (24GB)
.
├── models/
│ ├── LightningModel.py # Main Lightning model
│ ├── BaseModel.py
│ ├── nn.py # Neural network building blocks
│ ├── mamba/mambamotion.py # Motion Temporal Mamba Module
│ ├── lgtm/ # Text encoders & diffusion components
│ │ ├── conformer.py
│ │ ├── text_encoder.py
│ │ ├── motion_diffusion.py
│ │ └── utils/
│ └── transformer/tisa_transformer.py
├── utils/
│ ├── motion_dataset.py # Dataset loaders
│ └── hparams.py # Hyperparameter management
├── pymo/ # Motion preprocessing (BVH, rotations)
├── hparams/
│ ├── LRCM_stage1.yaml # Phase 1: Global text + Audio
│ ├── LRCM_stage2.yaml # Phase 2: Add Local text
│ └── LRCM_stage3.yaml # Phase 3: Enable MTMM
├── train.py # Training script
├── synthesize.py # Inference script
└── requirements.txt
Note: The text annotations dataset below is text-only (global + local text descriptions). You must also download the Motorica Dance dataset to obtain the raw motion capture data and audio files for training.
Enhanced text annotations with hierarchical global and local descriptions for 7 dance genres. Place the downloaded files under data/Multimodal_Text_dataset_updating/.
Two model versions are provided:
| Checkpoint | Description | Usage |
|---|---|---|
NAR version |
Non-autoregressive model (Phase 2) | --checkpoints NAR/dance_LRCM_stage2.ckpt |
AR version |
Autoregressive model (Phase 3) | --checkpoints AR/dance_LRCM_stage3.ckpt |
python synthesize.py \
--checkpoints ckpt/dance_LRCM_stage3.ckpt \
--data_dirs data/Multimodal_Text_dataset_updating/ \
--input_files sample_input.pkl \
--input_text "dynamic hip-hop dance with arm waves and body rolls" \
--dest_dir results/Batch generation:
bash experiments/LRCM_manbadance_duainput_memory.sh
bash experiments/LRCM_duainput_memory_json.shArguments:
| Argument | Description | Default |
|---|---|---|
-c, --checkpoints |
Path to model checkpoint | Required |
-d, --data_dirs |
Path to data directory | Required |
-f, --input_files |
Input motion file | Required |
-t, --input_text |
Text description (global style) | Required |
-r, --seed |
Random seed | 42 |
--dest_dir |
Output directory | "results" |
-m, --segment-frames |
Segment frame length | 300 |
Phase 1 — Foundation (Global text + Audio):
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python train.py \
--dataset_root data/Multimodal_Text_dataset_updating \
--hparams_file ./hparams/LRCM_stage1.yaml \
--ckpt_file NonePhase 2 — Fine-tuning (Add Local text):
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python train.py \
--dataset_root data/Multimodal_Text_dataset_updating \
--hparams_file ./hparams/LRCM_stage2.yaml \
--ckpt_file ./pretrained_models/dance_LRCM_stage1.ckptPhase 3 — Autoregressive (Enable MTMM):
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python train.py \
--dataset_root data/Multimodal_Text_dataset_updating \
--hparams_file ./hparams/LRCM_stage3.yaml \
--ckpt_file ./pretrained_models/dance_LRCM_stage2.ckptTraining details: Adam optimizer (weight decay: 1.0e-4), 200 DDPM steps, 20 residual blocks, ~316M parameters.
@misc{lrcm2026,
title = {Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset},
author = {Oran Duan and Yinghua Shen and Yingzhu Lv and Luyang Jie and Yaxin Liu and Qiong Wu},
year = {2026},
eprint = {2601.03323},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}This project is licensed under the MIT License - see the LICENSE file for details.

