Spaces:

Tharun156
/

GestureLSM

Runtime error

App Files Files Community

GestureLSM / README.md

Tharun156

Update README.md

efd385f verified 13 days ago

preview code

raw

history blame contribute delete

14.5 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: GestureLSM Demo
emoji: 🕺
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 4.42.0
app_file: hf_space/app.py
pinned: false

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]

📝 Release Plans

Inference Code
Pretrained Models
A web demo
Training Code
Clean Code to make it look nicer
Support for MeanFlow
Unified training and testing pipeline
MeanFlow Training Code (Coming Soon)
Merge with Intentional-Gesture

🔄 Code Updates

Latest Update: The codebase has been cleaned and restructured. For legacy or historical information, please check out the old branch.

New Features:

Added MeanFlow model support
Unified training and testing pipeline using train.py
New configuration files in configs_new/ directory
Updated checkpoint files with improved performance

⚒️ Installation

Build Environtment

conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh

📁 Code Structure

Understanding the codebase structure will help you navigate and customize the project effectively.

GestureLSM/
├── 📁 configs_new/              # New unified configuration files
│   ├── diffusion_rvqvae_128.yaml    # Diffusion model config
│   ├── shortcut_rvqvae_128.yaml     # Shortcut model config
│   └── meanflow_rvqvae_128.yaml     # MeanFlow model config
├── 📁 configs/                  # Legacy configuration files (deprecated)
├── 📁 ckpt/                     # Pretrained model checkpoints
│   ├── new_540_diffusion.bin        # Diffusion model weights
│   ├── shortcut_reflow.bin          # Shortcut model weights
│   ├── meanflow.pth                 # MeanFlow model weights
│   └── net_300000_*.pth            # RVQ-VAE model weights
├── 📁 models/                   # Model implementations
│   ├── Diffusion.py                 # Diffusion model
│   ├── LSM.py                       # Latent Shortcut Model
│   ├── MeanFlow.py                  # MeanFlow model
│   ├── 📁 layers/                   # Neural network layers
│   ├── 📁 vq/                       # Vector quantization modules
│   └── 📁 utils/                    # Model utilities
├── 📁 dataloaders/              # Data loading and preprocessing
│   ├── beat_sep_lower.py            # Main dataset loader
│   ├── 📁 pymo/                     # Motion processing library
│   └── 📁 utils/                    # Data utilities
├── 📁 trainer/                  # Training framework
│   ├── base_trainer.py              # Base trainer class
│   └── generative_trainer.py        # Generative model trainer
├── 📁 utils/                    # General utilities
│   ├── config.py                    # Configuration management
│   ├── metric.py                    # Evaluation metrics
│   └── rotation_conversions.py      # Rotation utilities
├── 📁 demo/                     # Demo and visualization
│   ├── examples/                    # Sample audio files
│   └── install_mfa.sh               # MFA installation script
├── 📁 datasets/                 # Dataset storage
│   ├── BEAT_SMPL/                   # Original BEAT dataset
│   ├── beat_cache/                  # Preprocessed cache
│   └── hub/                         # SMPL models and pretrained weights
├── 📁 outputs/                  # Training outputs and logs
│   └── weights/                     # Saved model weights
├── train.py                     # Unified training/testing script
├── demo.py                      # Web demo script
├── rvq_beatx_train.py          # RVQ-VAE training script
└── requirements.txt             # Python dependencies

🔧 Key Components

Model Architecture

models/Diffusion.py: Denoising diffusion model for high-quality generation
models/LSM.py: Latent Shortcut Model for fast inference
models/MeanFlow.py: Flow-based model for single-step generation
models/vq/: Vector quantization modules for latent space compression

Configuration System

configs_new/: New unified configuration files for all models
configs/: Legacy configuration files (deprecated)
Each config file contains model parameters, training settings, and data paths

Data Pipeline

dataloaders/beat_sep_lower.py: Main dataset loader for BEAT dataset
dataloaders/pymo/: Motion processing library for gesture data
datasets/beat_cache/: Preprocessed data cache for faster loading

Training Framework

train.py: Unified script for training and testing all models
trainer/: Training framework with base and generative trainers
optimizers/: Optimizer and scheduler implementations

Utilities

utils/config.py: Configuration management and validation
utils/metric.py: Evaluation metrics (FGD, etc.)
utils/rotation_conversions.py: 3D rotation utilities

🚀 Getting Started with the Code

For Training: Use train.py with configs from configs_new/
For Inference: Use demo.py for web interface or train.py --mode test
For Customization: Modify config files in configs_new/ directory
For New Models: Add model implementation in models/ directory

Results

This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis, accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.

Important Notes

Model Performance

The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
If you want to use all-speaker, please modify the config files to include all speaker ids.
April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.

Model Design Choices

No speaker embedding is included to make the model capable of generating gestures for novel speakers.
No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
If you want to see better FGD scores, you can try adding gesture type information.

Code Structure

Current Version: Clean, unified codebase with MeanFlow support
Legacy Code: Available in the old branch for historical reference
Accepted to ICCV 2025 - Thanks to all co-authors!

Download Models

Pretrained Models (Updated)

# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder

# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt

# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder

Available Checkpoints

Diffusion Model: ckpt/new_540_diffusion.bin
Shortcut Model: ckpt/shortcut_reflow.bin
MeanFlow Model: ckpt/meanflow.pth
RVQ-VAE Models: ckpt/net_300000_upper.pth, ckpt/net_300000_hands.pth, ckpt/net_300000_lower.pth

Download Dataset

For evaluation and training, not necessary for running a web demo or inference.

Download BEAT2 Dataset from Hugging Face

The original dataset download method is no longer available. Please use the Hugging Face dataset:

# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2

Dataset Information:

Source: H-Liu1997/BEAT2 on Hugging Face
Size: ~4.1K samples
Format: CSV with train/test splits
License: Apache 2.0

Legacy Download (Deprecated)

The original download method is no longer working

# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.sh

Testing/Evaluation

Note: Requires dataset download for evaluation. For inference only, see the Demo section below.

Unified Testing Pipeline

The codebase now uses a unified train.py script for both training and testing. Use the --mode test flag for evaluation:

# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test

# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test

# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test

Model Comparison

Model	Steps	Description	Key Features	Use Case
Diffusion	20	Denoising diffusion model	High quality, slower inference	High-quality generation
Shortcut	2-4	Latent shortcut with reflow	Fast inference, good quality	Recommended for most users
MeanFlow	1	Flow-based generation	Fastest inference, single step	Real-time applications

Performance Comparison

Model	Steps	FGD Score ↓	Beat Constancy ↑	L1Div Score ↓	Inference Speed
MeanFlow	1	0.4031	0.7489	12.4631	Fastest
Diffusion	20	0.4100	0.7384	12.5752	Slowest
Shortcut	20	0.4040	0.7144	13.4874	Fast
Shortcut-ReFlow	2	0.4104	0.7182	13.678	Fast

Legend:

FGD Score (↓): Lower is better - measures gesture quality
Beat Constancy (↑): Higher is better - measures audio-gesture synchronization
L1Div Score (↑): Higher is better - measures diversity of generated gestures

Recommendation: MeanFlow is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.

Legacy Testing (Deprecated)

For reference only - use the unified pipeline above instead

# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml  
python test.py -c configs/diffuser_rvqvae_128.yaml

Train RVQ-VAEs (1-speaker)

Require download dataset

bash train_rvq.sh

Training

Note: Requires dataset download for training.

Unified Training Pipeline

The codebase now uses a unified train.py script for training all models. Use the new configuration files in configs_new/:

# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml

# Train Shortcut Model  
python train.py --config configs_new/shortcut_rvqvae_128.yaml

# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml

Training Configuration

Config Directory: Use configs_new/ for the latest configurations
Output Directory: Models are saved to ./outputs/weights/
Logging: Supports Weights & Biases integration (configure in config files)
GPU Support: Configure GPU usage in the config files

Legacy Training (Deprecated)

For reference only - use the unified pipeline above instead

# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml

Quick Start

Demo/Inference (No Dataset Required)

# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml

Testing with Your Own Data

# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test

Demo

The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.

python demo.py -c configs/shortcut_rvqvae_128_hf.yaml

Features:

Web-based interface for easy interaction
Real-time gesture generation
Support for custom audio and text input
Visualization of generated gestures

🙏 Acknowledgments

Thanks to SynTalker, EMAGE, DiffuseStyleGesture, our code is partially borrowing from them. Please check these useful repos.

📖 Citation

If you find our code or paper helps, please consider citing:

@inproceedings{liu2025gesturelsmlatentshortcutbased,
  title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
  author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2025},
}