Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.1.0
title: GestureLSM Demo
emoji: πΊ
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 4.42.0
app_file: hf_space/app.py
pinned: false
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]
π Release Plans
- Inference Code
- Pretrained Models
- A web demo
- Training Code
- Clean Code to make it look nicer
- Support for MeanFlow
- Unified training and testing pipeline
- MeanFlow Training Code (Coming Soon)
- Merge with Intentional-Gesture
π Code Updates
Latest Update: The codebase has been cleaned and restructured. For legacy or historical information, please check out the old branch.
New Features:
- Added MeanFlow model support
- Unified training and testing pipeline using
train.py - New configuration files in
configs_new/directory - Updated checkpoint files with improved performance
βοΈ Installation
Build Environtment
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
π Code Structure
Understanding the codebase structure will help you navigate and customize the project effectively.
GestureLSM/
βββ π configs_new/ # New unified configuration files
β βββ diffusion_rvqvae_128.yaml # Diffusion model config
β βββ shortcut_rvqvae_128.yaml # Shortcut model config
β βββ meanflow_rvqvae_128.yaml # MeanFlow model config
βββ π configs/ # Legacy configuration files (deprecated)
βββ π ckpt/ # Pretrained model checkpoints
β βββ new_540_diffusion.bin # Diffusion model weights
β βββ shortcut_reflow.bin # Shortcut model weights
β βββ meanflow.pth # MeanFlow model weights
β βββ net_300000_*.pth # RVQ-VAE model weights
βββ π models/ # Model implementations
β βββ Diffusion.py # Diffusion model
β βββ LSM.py # Latent Shortcut Model
β βββ MeanFlow.py # MeanFlow model
β βββ π layers/ # Neural network layers
β βββ π vq/ # Vector quantization modules
β βββ π utils/ # Model utilities
βββ π dataloaders/ # Data loading and preprocessing
β βββ beat_sep_lower.py # Main dataset loader
β βββ π pymo/ # Motion processing library
β βββ π utils/ # Data utilities
βββ π trainer/ # Training framework
β βββ base_trainer.py # Base trainer class
β βββ generative_trainer.py # Generative model trainer
βββ π utils/ # General utilities
β βββ config.py # Configuration management
β βββ metric.py # Evaluation metrics
β βββ rotation_conversions.py # Rotation utilities
βββ π demo/ # Demo and visualization
β βββ examples/ # Sample audio files
β βββ install_mfa.sh # MFA installation script
βββ π datasets/ # Dataset storage
β βββ BEAT_SMPL/ # Original BEAT dataset
β βββ beat_cache/ # Preprocessed cache
β βββ hub/ # SMPL models and pretrained weights
βββ π outputs/ # Training outputs and logs
β βββ weights/ # Saved model weights
βββ train.py # Unified training/testing script
βββ demo.py # Web demo script
βββ rvq_beatx_train.py # RVQ-VAE training script
βββ requirements.txt # Python dependencies
π§ Key Components
Model Architecture
models/Diffusion.py: Denoising diffusion model for high-quality generationmodels/LSM.py: Latent Shortcut Model for fast inferencemodels/MeanFlow.py: Flow-based model for single-step generationmodels/vq/: Vector quantization modules for latent space compression
Configuration System
configs_new/: New unified configuration files for all modelsconfigs/: Legacy configuration files (deprecated)- Each config file contains model parameters, training settings, and data paths
Data Pipeline
dataloaders/beat_sep_lower.py: Main dataset loader for BEAT datasetdataloaders/pymo/: Motion processing library for gesture datadatasets/beat_cache/: Preprocessed data cache for faster loading
Training Framework
train.py: Unified script for training and testing all modelstrainer/: Training framework with base and generative trainersoptimizers/: Optimizer and scheduler implementations
Utilities
utils/config.py: Configuration management and validationutils/metric.py: Evaluation metrics (FGD, etc.)utils/rotation_conversions.py: 3D rotation utilities
π Getting Started with the Code
- For Training: Use
train.pywith configs fromconfigs_new/ - For Inference: Use
demo.pyfor web interface ortrain.py --mode test - For Customization: Modify config files in
configs_new/directory - For New Models: Add model implementation in
models/directory
Results
This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis, accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.
Important Notes
Model Performance
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.
Model Design Choices
- No speaker embedding is included to make the model capable of generating gestures for novel speakers.
- No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
- If you want to see better FGD scores, you can try adding gesture type information.
Code Structure
- Current Version: Clean, unified codebase with MeanFlow support
- Legacy Code: Available in the
oldbranch for historical reference - Accepted to ICCV 2025 - Thanks to all co-authors!
Download Models
Pretrained Models (Updated)
# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder
# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt
# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
Available Checkpoints
- Diffusion Model:
ckpt/new_540_diffusion.bin - Shortcut Model:
ckpt/shortcut_reflow.bin - MeanFlow Model:
ckpt/meanflow.pth - RVQ-VAE Models:
ckpt/net_300000_upper.pth,ckpt/net_300000_hands.pth,ckpt/net_300000_lower.pth
Download Dataset
For evaluation and training, not necessary for running a web demo or inference.
Download BEAT2 Dataset from Hugging Face
The original dataset download method is no longer available. Please use the Hugging Face dataset:
# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2
Dataset Information:
- Source: H-Liu1997/BEAT2 on Hugging Face
- Size: ~4.1K samples
- Format: CSV with train/test splits
- License: Apache 2.0
Legacy Download (Deprecated)
The original download method is no longer working
# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.sh
Testing/Evaluation
Note: Requires dataset download for evaluation. For inference only, see the Demo section below.
Unified Testing Pipeline
The codebase now uses a unified train.py script for both training and testing. Use the --mode test flag for evaluation:
# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test
# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test
Model Comparison
| Model | Steps | Description | Key Features | Use Case |
|---|---|---|---|---|
| Diffusion | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation |
| Shortcut | 2-4 | Latent shortcut with reflow | Fast inference, good quality | Recommended for most users |
| MeanFlow | 1 | Flow-based generation | Fastest inference, single step | Real-time applications |
Performance Comparison
| Model | Steps | FGD Score β | Beat Constancy β | L1Div Score β | Inference Speed |
|---|---|---|---|---|---|
| MeanFlow | 1 | 0.4031 | 0.7489 | 12.4631 | Fastest |
| Diffusion | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest |
| Shortcut | 20 | 0.4040 | 0.7144 | 13.4874 | Fast |
| Shortcut-ReFlow | 2 | 0.4104 | 0.7182 | 13.678 | Fast |
Legend:
- FGD Score (β): Lower is better - measures gesture quality
- Beat Constancy (β): Higher is better - measures audio-gesture synchronization
- L1Div Score (β): Higher is better - measures diversity of generated gestures
Recommendation: MeanFlow is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.
Legacy Testing (Deprecated)
For reference only - use the unified pipeline above instead
# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml
python test.py -c configs/diffuser_rvqvae_128.yaml
Train RVQ-VAEs (1-speaker)
Require download dataset
bash train_rvq.sh
Training
Note: Requires dataset download for training.
Unified Training Pipeline
The codebase now uses a unified train.py script for training all models. Use the new configuration files in configs_new/:
# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml
# Train Shortcut Model
python train.py --config configs_new/shortcut_rvqvae_128.yaml
# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml
Training Configuration
- Config Directory: Use
configs_new/for the latest configurations - Output Directory: Models are saved to
./outputs/weights/ - Logging: Supports Weights & Biases integration (configure in config files)
- GPU Support: Configure GPU usage in the config files
Legacy Training (Deprecated)
For reference only - use the unified pipeline above instead
# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml
Quick Start
Demo/Inference (No Dataset Required)
# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
Testing with Your Own Data
# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
Demo
The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
Features:
- Web-based interface for easy interaction
- Real-time gesture generation
- Support for custom audio and text input
- Visualization of generated gestures
π Acknowledgments
Thanks to SynTalker, EMAGE, DiffuseStyleGesture, our code is partially borrowing from them. Please check these useful repos.
π Citation
If you find our code or paper helps, please consider citing:
@inproceedings{liu2025gesturelsmlatentshortcutbased,
title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
booktitle={IEEE/CVF International Conference on Computer Vision},
year={2025},
}
