This is the official repository for the paper:
SignControl: Multi-Granular Control for Sign Language Video Generation
Xuehan Hou*, Zeyu Zhang*†, Ziye Song*, and Zheng Zhu# et al.
*Equal contribution. †Project lead. #Corresponding author.
demo.mp4
SignControl extends the Wan 2.1 T2V 1.3B family with a hierarchical ControlNeXt-based conditioning pipeline that is tailored to the fine-grained needs of sign language video generation. Built on the DiffSynth implementation, this repository reproduces the three-stage training and inference workflow described in the SignControl paper: LoRA domain calibration, multi-modal ControlNeXt integration, and Control Decay for robustness under incomplete control.
|
|
rescaled_1.mp4 |
rescaled_1_depth.mp4 |
rescaled_2.mp4 |
rescaled_2_depth.mp4 |
rescaled_3.mp4 |
rescaled_3_depth.mp4 |
rescaled_4.mp4 |
rescaled_4_depth.mp4 |
rescaled_5.mp4 |
rescaled_5_depth.mp4 |
mmexport1750059028933.mp4 |
mmexport1750059851605.mp4 |
mmexport1750059854541.mp4 |
mmexport1750059856873.mp4 |
NEUNZEHN.GRAD.NORD.SECHS.ZWANZIG.THUERINGEN.SACHSEN.TAG.KLAR.__OFF__.mp4 |
__ON__.TEMPERATUR.WIE-AUSSEHEN.__PU__.loc-SUED.__EMOTION__.ZWANZIG.BIS.DREI.ZWANZIG.GRAD.mp4 |
__ON__.WIND.WEHEN.WEHEN.__OFF__.mp4 |
REGION.SUEDOST.HAUPTSAECHLICH.BEWOELKT.TROCKEN.SONNE.WOLKE.__OFF__.mp4 |
__ON__.BAYERN.OST.loc-REGION.__EMOTION__.THUERINGEN.__OFF__.mp4 |
__ON__.NACHT.SIEBEN.SUEDWESTRAUM.BAYERN.NULL.ROT.ROT.HAAR.BERG.__OFF_4.mp4 |
SignControl_paper.pdf– Source paper describing the architecture, experiments, and evaluation on Phoenix-2014.v3.environment.yml– Conda recipe forsigncontrol-env(Python 3.9, PyTorch 2.5.1 + CUDA 12.1, DiffSynth, Transformers, etc.).SignControl/– Main project directory containing the DiffSynth-based implementation with ControlNeXt extensions.diffsynth/– Core DiffSynth library with model implementations, pipelines, and utilities.models/– Model implementations including Wan Video DiT, VAE, text encoders, and ControlNet modules.pipelines/– Inference pipelines includingwan_video.pyfor sign language video generation.
examples/– Training and inference scripts.wanvideo/train_wan_t2v.py– Multi-purpose training script for LoRA alignment and ControlNeXt tuning.wanvideo/– Wan Video specific examples and inference scripts.
Wan2.1-T2V-1.3B/– Model configuration and assets for the 1.3B parameter model.Wan2.1-T2V-14B/– Model configuration for the 14B parameter model.requirements.txt/setup.py– Python dependencies for the SignControl implementation.
- Hierarchical Multi-Modality Control – pose (DWPose) for joint-level motion, optical flow (OnlyFlow + PGMM) for pixel gradients, and depth (Depth Anything V2 + Depth LoRA) for coarse spatial layout are injected at progressively deeper DiT blocks (pose → block 1, flow → block 6, depth → block 11). Cross normalization aligns each modality before fusion.
- Control Decay Strategy – after LoRA convergence, modality weights are linearly decayed (with a floor of 0.1) so the model learns to rely on semantics and partial control signals, improving inference robustness when controls are missing.
- ControlNeXt on DiT – the architecture adapts ControlNeXt (originally designed for UNet) to the DiT backbone, letting LoRA adapters fine-tune only attention and FFN layers while ControlNeXt modules provide multi-granular conditioning.
- Phoenix-2014.v3 Evaluation – experiments on ~1,000 weather-forecast sentences show a BLEU of 21.8, ROUGE-L 47.3, and state-of-the-art SSIM / FVD scores compared to SignGen, validating the multi-modal control approach.
-
Prerequisites
- CUDA 12.x driver + compatible GPU (4× H20 or equivalent recommended for training).
- Conda / Miniconda installed.
-
Create the environment
conda env create -f environment.yml conda activate signcontrol-env
The environment pins Python 3.9.23, PyTorch 2.5.1+cu121, transformers 4.56.1, diffsynth 1.1.2, and NVIDIA CUDA libraries (cuBLAS, cuDNN, cuSPARSE, etc.).
-
Install SignControl dependencies
cd SignControl pip install -e .
This ensures the DiffSynth integrations, ControlNeXt extensions, and WanVideo training scripts are discoverable.
-
Phoenix-2014.v3 dataset – use the standard download (weather-forecast sentences + German annotations). Each video is resized to 480×832 and trimmed to 81 frames, then projected into a latent space ([21, 16, 60, 104]) via the Wan2.1 VAE.
-
Control Modalities
- Pose via DWPose (body + face + hand keypoints).
- Optical Flow via OnlyFlow with PGMM to capture pixel velocities.
- Depth via Depth Anything V2 plus Depth LoRA for coarse spatial layout.
-
Directory layout:
data/phoenix2014/ ├── metadata.csv # file_name,text,control_name └── train/ ├── phoenix_00001.mp4 ├── phoenix_00001_c.mp4 # control bundle (pose/flow/depth) └── phoenix_00001.mp4.tensors.pthEach row in
metadata.csvshould specify the text prompt and the matching control filename (the script will look for all modalities under that control prefix). -
Preprocessing Run the data process step to convert videos into
.tensors.pthand align with LoRA:CUDA_VISIBLE_DEVICES=0 python SignControl/examples/wanvideo/train_wan_t2v.py \ --task data_process \ --dataset_path data/phoenix2014 \ --output_path ./preprocessed \ --text_encoder_path path/to/models_t5_umt5-xxl-enc-bf16.pth \ --vae_path path/to/Wan2.1_VAE.pth \ --tiled --num_frames 81 --height 480 --width 832
SignControl follows the three-stage strategy outlined in the paper.
-
Stage 1 – LoRA Coarse Alignment
- Train only with text prompts to adapt LoRA adapters (q/k/v/o + FFN layers) to sign language semantics.
- Freeze the ControlNeXt modules; only text embeddings guide the DiT latent updates.
- Continue until validation loss stabilizes (e.g., 10–20 epochs on Phoenix-2014). This stage provides the baseline for downstream conditioned learning.
-
Stage 2 – Multi-Modal Control Learning
- Load the converged LoRA weights, enable ControlNeXt for each modality, and inject pose/flow/depth features into blocks 1/6/11 with learnable scaling factors.
- Use CrossNorm to align the modality activation statistics with DiT latents before summing into the transformer blocks.
- Run the training command to jointly optimize LoRA + ControlNeXt:
python SignControl/examples/wanvideo/train_wan_t2v.py \ --task train \ --train_architecture full \ --dataset_path data/phoenix2014 \ --output_path ./checkpoints/stage2 \ --dit_path path/to/diffusion_pytorch_model.safetensors \ --steps_per_epoch 500 \ --max_epochs 1000 \ --learning_rate 4e-5 \ --accumulate_grad_batches 1 \ --use_gradient_checkpointing \ --dataloader_num_workers 8 \ --control_layers 15
- The
--control_layersflag determines how many transformer blocks receive ControlNeXt features; the default (15) keeps most weights frozen, lowering memory (~26 GB on a single GPU).
-
Stage 3 – Control Decay (Robustness Fine-tuning)
- Freeze ControlNeXt, continue fine-tuning LoRA while randomly dropping modalities according to a linear decay schedule (
p_m(e)=max(0.1, 1.0−α·max(0,e−e_stable))). - This stage teaches the model to generate plausible sign videos when only a subset of controls (or only text) is available at inference time.
- Implement the decay by masking control inputs inside the training script or by toggling modality flags sampled from the decay schedule.
- Freeze ControlNeXt, continue fine-tuning LoRA while randomly dropping modalities according to a linear decay schedule (
After training, use the DiffSynth pipeline to generate videos with hierarchical control:
-
Text-only sampling (LoRA-guided)
python SignControl/examples/wanvideo/wan_1.3b_text_to_video.py
Customize the prompt, negative prompt, and sampling parameters inside the script (or parameterize with CLI flags) to produce single-sentence weather forecasts, news reports, or signer variations.
-
Controlled sampling
- Prepare control artifacts (pose + flow + depth) for a reference video.
- Provide the control tensors/frames to the pipeline (the ControlNeXt modules expect them at 81 frames and 3 channels per modality).
- Optionally simulate inference uncertainty by masking one or more modalities (e.g., drop depth) to trigger the Control Decay behavior learned during training.
The inference pipeline uses DiffSynth's WanVideoPipeline and can be extended to evaluate metrics such as BLEU, ROUGE-L, SSIM, LPIPS, and FVD against Phoenix-2014 ground truth, matching the paper's evaluation table.
-
Wan2.1-T2V-1.3B/ – Configuration files and assets for the 1.3B parameter model
- Recommended for development and testing with lower memory requirements
- Suitable for single GPU training and inference
-
Wan2.1-T2V-14B/ – Configuration files for the 14B parameter model
- Higher quality results as reported in the paper
- Requires multi-GPU setup for training and inference
- Keep the
signcontrol-envenvironment active when running scripts; the training commands expect PyTorch + GPU-enabled CUDA libraries. - The training script
SignControl/examples/wanvideo/train_wan_t2v.pysupports multiple tasks including data preprocessing, LoRA training, and full model training. - Model checkpoints and preprocessed data should be stored outside the repository to avoid large file commits.
- The full SignControl paper is available in
SignControl_paper.pdf. Consult section III and the appendix for dataset details, evaluation metrics, and ablation studies on multi-granular control and decay schedules.

