Name	Name	Last commit message	Last commit date
Latest commit History 11 Commits
SignControl	SignControl
README.md	README.md
SignControl_paper.pdf	SignControl_paper.pdf
environment.yml	environment.yml

SignControl: Multi-Granular Control for Sign Language Video Generation

This is the official repository for the paper:

SignControl: Multi-Granular Control for Sign Language Video Generation

Xuehan Hou*, Zeyu Zhang*^†, Ziye Song*, and Zheng Zhu^# et al.

*Equal contribution. ^†Project lead. ^#Corresponding author.

Paper

demo.mp4

Introduction and Visualization

SignControl extends the Wan 2.1 T2V 1.3B family with a hierarchical ControlNeXt-based conditioning pipeline that is tailored to the fine-grained needs of sign language video generation. Built on the DiffSynth implementation, this repository reproduces the three-stage training and inference workflow described in the SignControl paper: LoRA domain calibration, multi-modal ControlNeXt integration, and Control Decay for robustness under incomplete control.

Visualization

Depth

rescaled_1.mp4

rescaled_1_depth.mp4

rescaled_2.mp4

rescaled_2_depth.mp4

rescaled_3.mp4

rescaled_3_depth.mp4

rescaled_4.mp4

rescaled_4_depth.mp4

rescaled_5.mp4

rescaled_5_depth.mp4

Pose

mmexport1750059028933.mp4

mmexport1750059851605.mp4

mmexport1750059854541.mp4

mmexport1750059856873.mp4

without_pose_baseline

NEUNZEHN.GRAD.NORD.SECHS.ZWANZIG.THUERINGEN.SACHSEN.TAG.KLAR.__OFF__.mp4

__ON__.TEMPERATUR.WIE-AUSSEHEN.__PU__.loc-SUED.__EMOTION__.ZWANZIG.BIS.DREI.ZWANZIG.GRAD.mp4

__ON__.WIND.WEHEN.WEHEN.__OFF__.mp4

REGION.SUEDOST.HAUPTSAECHLICH.BEWOELKT.TROCKEN.SONNE.WOLKE.__OFF__.mp4

with_pose_overlap

__ON__.BAYERN.OST.loc-REGION.__EMOTION__.THUERINGEN.__OFF__.mp4

__ON__.NACHT.SIEBEN.SUEDWESTRAUM.BAYERN.NULL.ROT.ROT.HAAR.BERG.__OFF_4.mp4

Repository Layout

SignControl_paper.pdf – Source paper describing the architecture, experiments, and evaluation on Phoenix-2014.v3.
environment.yml – Conda recipe for signcontrol-env (Python 3.9, PyTorch 2.5.1 + CUDA 12.1, DiffSynth, Transformers, etc.).
SignControl/ – Main project directory containing the DiffSynth-based implementation with ControlNeXt extensions.
- diffsynth/ – Core DiffSynth library with model implementations, pipelines, and utilities.
  - models/ – Model implementations including Wan Video DiT, VAE, text encoders, and ControlNet modules.
  - pipelines/ – Inference pipelines including wan_video.py for sign language video generation.
- examples/ – Training and inference scripts.
  - wanvideo/train_wan_t2v.py – Multi-purpose training script for LoRA alignment and ControlNeXt tuning.
  - wanvideo/ – Wan Video specific examples and inference scripts.
- Wan2.1-T2V-1.3B/ – Model configuration and assets for the 1.3B parameter model.
- Wan2.1-T2V-14B/ – Model configuration for the 14B parameter model.
- requirements.txt / setup.py – Python dependencies for the SignControl implementation.

Highlights from the Paper

Hierarchical Multi-Modality Control – pose (DWPose) for joint-level motion, optical flow (OnlyFlow + PGMM) for pixel gradients, and depth (Depth Anything V2 + Depth LoRA) for coarse spatial layout are injected at progressively deeper DiT blocks (pose → block 1, flow → block 6, depth → block 11). Cross normalization aligns each modality before fusion.
Control Decay Strategy – after LoRA convergence, modality weights are linearly decayed (with a floor of 0.1) so the model learns to rely on semantics and partial control signals, improving inference robustness when controls are missing.
ControlNeXt on DiT – the architecture adapts ControlNeXt (originally designed for UNet) to the DiT backbone, letting LoRA adapters fine-tune only attention and FFN layers while ControlNeXt modules provide multi-granular conditioning.
Phoenix-2014.v3 Evaluation – experiments on ~1,000 weather-forecast sentences show a BLEU of 21.8, ROUGE-L 47.3, and state-of-the-art SSIM / FVD scores compared to SignGen, validating the multi-modal control approach.

Environment Setup

Prerequisites
- CUDA 12.x driver + compatible GPU (4× H20 or equivalent recommended for training).
- Conda / Miniconda installed.
Create the environment
```
conda env create -f environment.yml
conda activate signcontrol-env
```
The environment pins Python 3.9.23, PyTorch 2.5.1+cu121, transformers 4.56.1, diffsynth 1.1.2, and NVIDIA CUDA libraries (cuBLAS, cuDNN, cuSPARSE, etc.).
Install SignControl dependencies
```
cd SignControl
pip install -e .
```
This ensures the DiffSynth integrations, ControlNeXt extensions, and WanVideo training scripts are discoverable.

Data Preparation

Phoenix-2014.v3 dataset – use the standard download (weather-forecast sentences + German annotations). Each video is resized to 480×832 and trimmed to 81 frames, then projected into a latent space ([21, 16, 60, 104]) via the Wan2.1 VAE.
Control Modalities
- Pose via DWPose (body + face + hand keypoints).
- Optical Flow via OnlyFlow with PGMM to capture pixel velocities.
- Depth via Depth Anything V2 plus Depth LoRA for coarse spatial layout.

Directory layout:

data/phoenix2014/
├── metadata.csv  # file_name,text,control_name
└── train/
    ├── phoenix_00001.mp4
    ├── phoenix_00001_c.mp4  # control bundle (pose/flow/depth)
    └── phoenix_00001.mp4.tensors.pth

Each row in metadata.csv should specify the text prompt and the matching control filename (the script will look for all modalities under that control prefix).

Preprocessing Run the data process step to convert videos into .tensors.pth and align with LoRA:

CUDA_VISIBLE_DEVICES=0 python SignControl/examples/wanvideo/train_wan_t2v.py \
  --task data_process \
  --dataset_path data/phoenix2014 \
  --output_path ./preprocessed \
  --text_encoder_path path/to/models_t5_umt5-xxl-enc-bf16.pth \
  --vae_path path/to/Wan2.1_VAE.pth \
  --tiled --num_frames 81 --height 480 --width 832

Training Pipeline

SignControl follows the three-stage strategy outlined in the paper.

Stage 1 – LoRA Coarse Alignment
- Train only with text prompts to adapt LoRA adapters (q/k/v/o + FFN layers) to sign language semantics.
- Freeze the ControlNeXt modules; only text embeddings guide the DiT latent updates.
- Continue until validation loss stabilizes (e.g., 10–20 epochs on Phoenix-2014). This stage provides the baseline for downstream conditioned learning.
Stage 2 – Multi-Modal Control Learning
- Load the converged LoRA weights, enable ControlNeXt for each modality, and inject pose/flow/depth features into blocks 1/6/11 with learnable scaling factors.
- Use CrossNorm to align the modality activation statistics with DiT latents before summing into the transformer blocks.
- Run the training command to jointly optimize LoRA + ControlNeXt:
```
python SignControl/examples/wanvideo/train_wan_t2v.py \
  --task train \
  --train_architecture full \
  --dataset_path data/phoenix2014 \
  --output_path ./checkpoints/stage2 \
  --dit_path path/to/diffusion_pytorch_model.safetensors \
  --steps_per_epoch 500 \
  --max_epochs 1000 \
  --learning_rate 4e-5 \
  --accumulate_grad_batches 1 \
  --use_gradient_checkpointing \
  --dataloader_num_workers 8 \
  --control_layers 15
```
- The --control_layers flag determines how many transformer blocks receive ControlNeXt features; the default (15) keeps most weights frozen, lowering memory (~26 GB on a single GPU).
Stage 3 – Control Decay (Robustness Fine-tuning)
- Freeze ControlNeXt, continue fine-tuning LoRA while randomly dropping modalities according to a linear decay schedule (p_m(e)=max(0.1, 1.0−α·max(0,e−e_stable))).
- This stage teaches the model to generate plausible sign videos when only a subset of controls (or only text) is available at inference time.
- Implement the decay by masking control inputs inside the training script or by toggling modality flags sampled from the decay schedule.

Inference

After training, use the DiffSynth pipeline to generate videos with hierarchical control:

Text-only sampling (LoRA-guided)
```
python SignControl/examples/wanvideo/wan_1.3b_text_to_video.py
```
Customize the prompt, negative prompt, and sampling parameters inside the script (or parameterize with CLI flags) to produce single-sentence weather forecasts, news reports, or signer variations.
Controlled sampling
1. Prepare control artifacts (pose + flow + depth) for a reference video.
2. Provide the control tensors/frames to the pipeline (the ControlNeXt modules expect them at 81 frames and 3 channels per modality).
3. Optionally simulate inference uncertainty by masking one or more modalities (e.g., drop depth) to trigger the Control Decay behavior learned during training.

The inference pipeline uses DiffSynth's WanVideoPipeline and can be extended to evaluate metrics such as BLEU, ROUGE-L, SSIM, LPIPS, and FVD against Phoenix-2014 ground truth, matching the paper's evaluation table.

Model Configurations

Wan2.1-T2V-1.3B/ – Configuration files and assets for the 1.3B parameter model
- Recommended for development and testing with lower memory requirements
- Suitable for single GPU training and inference
Wan2.1-T2V-14B/ – Configuration files for the 14B parameter model
- Higher quality results as reported in the paper
- Requires multi-GPU setup for training and inference

Notes

Keep the signcontrol-env environment active when running scripts; the training commands expect PyTorch + GPU-enabled CUDA libraries.
The training script SignControl/examples/wanvideo/train_wan_t2v.py supports multiple tasks including data preprocessing, LoRA training, and full model training.
Model checkpoints and preprocessed data should be stored outside the repository to avoid large file commits.
The full SignControl paper is available in SignControl_paper.pdf. Consult section III and the appendix for dataset details, evaluation metrics, and ablation studies on multi-granular control and decay schedules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SignControl: Multi-Granular Control for Sign Language Video Generation

Paper

Introduction and Visualization