legacymiles commited on
Commit
bfb67e4
·
1 Parent(s): 528371e

add minimal gradio app

Browse files
Files changed (4) hide show
  1. .gitattributes +13 -0
  2. README.md +12 -255
  3. app.py +15 -0
  4. requirements.txt +1 -17
.gitattributes CHANGED
@@ -1,3 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  /diffsynth/tokenizer_configs/hunyuan_video/tokenizer_2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
2
  /teaser.png filter=lfs diff=lfs merge=lfs -text
3
  *.model filter=lfs diff=lfs merge=lfs -text
 
1
+ *.mp4 filter=lfs diff=lfs merge=lfs -text
2
+ *.mov filter=lfs diff=lfs merge=lfs -text
3
+ *.webm filter=lfs diff=lfs merge=lfs -text
4
+ *.pt filter=lfs diff=lfs merge=lfs -text
5
+ *.bin filter=lfs diff=lfs merge=lfs -text
6
+ *.onnx filter=lfs diff=lfs merge=lfs -text
7
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
8
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
9
+ *.tar filter=lfs diff=lfs merge=lfs -text
10
+ *.zip filter=lfs diff=lfs merge=lfs -text
11
+ *.pth filter=lfs diff=lfs merge=lfs -text
12
+ *.npy filter=lfs diff=lfs merge=lfs -text
13
+ *.wav filter=lfs diff=lfs merge=lfs -text
14
  /diffsynth/tokenizer_configs/hunyuan_video/tokenizer_2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
15
  /teaser.png filter=lfs diff=lfs merge=lfs -text
16
  *.model filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,255 +1,12 @@
1
- # HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
2
-
3
-
4
- **[**[**📄 Paper**](https://arxiv.org/abs/2510.20822)**]**
5
- **[**[**🌐 Project Page**](https://holo-cine.github.io/)**]**
6
- **[**[**🤗 Model Weights**](https://huggingface.co/hlwang06/HoloCine/tree/main)**]**
7
-
8
-
9
-
10
- https://github.com/user-attachments/assets/c4dee993-7c6c-4604-a93d-a8eb09cfd69b
11
-
12
-
13
-
14
- _**[Yihao Meng<sup>1,2</sup>](https://yihao-meng.github.io/), [Hao Ouyang<sup>2</sup>](https://ken-ouyang.github.io/), [Yue Yu<sup>1,2</sup>](https://bruceyy.com/), [Qiuyu Wang<sup>2</sup>](https://github.com/qiuyu96), [Wen Wang<sup>2,3</sup>](https://github.com/encounter1997), [Ka Leong Cheng<sup>2</sup>](https://felixcheng97.github.io/), <br>[Hanlin Wang<sup>1,2</sup>](https://scholar.google.com/citations?user=0uO4fzkAAAAJ&hl=zh-CN), [Yixuan Li<sup>2,4</sup>](https://yixuanli98.github.io/), [Cheng Chen<sup>2,5</sup>](https://scholar.google.com/citations?user=nNQU71kAAAAJ&hl=zh-CN), [Yanhong Zeng<sup>2</sup>](https://zengyh1900.github.io/), [Yujun Shen<sup>2</sup>](https://shenyujun.github.io/), [Huamin Qu<sup>1</sup>](http://huamin.org/)**_
15
- <br>
16
- <sup>1</sup>HKUST, <sup>2</sup>Ant Group, <sup>3</sup>ZJU, <sup>4</sup>CUHK, <sup>5</sup>NTU
17
-
18
- # TLDR
19
- * **What it is:** A text-to-video model that generates full scenes, not just isolated clips.
20
- * **Key Feature:** It maintains consistency of characters, objects, and style across all shots in a scene.
21
- * **How it works:** You provide shot-by-shot text prompts, giving you directorial control over the final video.
22
-
23
- **Strongly recommend seeing our [demo page](https://holo-cine.github.io/).**
24
-
25
- If you enjoyed the videos we created, please consider giving us a star 🌟.
26
-
27
- ## 🚀 Open-Source Plan
28
-
29
- ### ✅ Released
30
- * Full inference code
31
- * `HoloCine-14B-full`
32
- * `HoloCine-14B-sparse`
33
-
34
- ### ⏰ To Be Released
35
- * `HoloCine-14B-full-l` (For videos longer than 1 minute)
36
- * `HoloCine-14B-sparse-l` (For videos longer than 1 minute)
37
- * `HoloCine-5B-full` (For limited-memory users)
38
- * `HoloCine-5B-sparse` (For limited-memory users)
39
-
40
- ### 🗺️ In Planning
41
- * Support first frame and key-frame input
42
- * `HoloCine-audio`
43
-
44
- # Setup
45
- ```shell
46
- git clone https://github.com/yihao-meng/HoloCine.git
47
- cd HoloCine
48
- ```
49
- # Environment
50
- We use a environment similar to diffsynth. If you have a diffsynth environment, you can probably reuse it.
51
- ```shell
52
- conda create -n HoloCine python=3.10
53
- pip install -e .
54
- ```
55
-
56
- We use FlashAttention-3 to implement the sparse inter-shot attention. We highly recommend using FlashAttention-3 for its fast speed. We provide a simple instruction on how to install FlashAttention-3.
57
-
58
- ```shell
59
- git clone https://github.com/Dao-AILab/flash-attention.git
60
- cd flash-attention
61
- cd hopper
62
- python setup.py install
63
- ```
64
- If you encounter environment problem when installing FlashAttention-3, you can refer to their official github page https://github.com/Dao-AILab/flash-attention.
65
-
66
- If you cannot install FlashAttention-3, you can use FlashAttention-2 as an alternative, and our code will automatically detect the FlashAttention version. It will be slower than FlashAttention-3,but can also produce the right result.
67
-
68
- If you want to install FlashAttention-2, you can use the following command:
69
- ```shell
70
- pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
71
- ```
72
-
73
- # Checkpoint
74
-
75
-
76
- ### Step 1: Download Wan 2.2 VAE and T5
77
- If you already have downloaded Wan 2.2 14B T2V before, skip this section.
78
-
79
- If not, you need the T5 text encoder and the VAE from the original Wan 2.2 repository:
80
- [https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B)
81
-
82
-
83
- Based on the repository's file structure, you **only** need to download `models_t5_umt5-xxl-enc-bf16.pth` and `Wan2.1_VAE.pth`.
84
-
85
- You do **not** need to download the `google`, `high_noise_model`, or `low_noise_model` folders, nor any other files.
86
-
87
- #### Recommended Download (CLI)
88
-
89
- We recommend using `huggingface-cli` to download only the necessary files. Make sure you have `huggingface_hub` installed (`pip install huggingface_hub`).
90
-
91
- This command will download *only* the required T5 and VAE models into the correct directory:
92
-
93
- ```bash
94
- huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
95
- --local-dir checkpoints/Wan2.2-T2V-A14B \
96
- --allow-patterns "models_t5_*.pth" "Wan2.1_VAE.pth"
97
- ```
98
-
99
- #### Manual Download
100
-
101
- Alternatively, go to the "Files" tab on the Hugging Face repo and manually download the following two files:
102
-
103
- * `models_t5_umt5-xxl-enc-bf16.pth`
104
- * `Wan2.1_VAE.pth`
105
-
106
- Place both files inside a new folder named `checkpoints/Wan2.2-T2V-A14B/`.
107
-
108
- ### Step 2: Download HoloCine Model (HoloCine\_dit)
109
-
110
- Download our fine-tuned high-noise and low-noise DiT checkpoints from the following link:
111
-
112
- **[➡️ Download HoloCine\_dit Model Checkpoints [Here](https://huggingface.co/hlwang06/HoloCine)]**
113
-
114
- This download contain the four fine-tuned model files. Two for full_attention version: `full_high_noise.safetensors`, `full_low_noise.safetensors`. And two for sparse inter-shot attention version: `sparse_high_noise.safetensors`, `sparse_high_noise.safetensors`. The sparse version is still uploading.
115
-
116
- You can choose a version to download, or try both version if you want.
117
-
118
- The full attention version will have better performance, so we suggest you start from it. The sparse inter-shot attention version will be slightly unstable (but also great in most cases), but faster than the full attention version.
119
-
120
- For full attention version:
121
- Create a new folder named `checkpoints/HoloCine_dit/full/` and place both high and low noise files inside.
122
-
123
- For sparse attention version:
124
- Create a new folder named `checkpoints/HoloCine_dit/full/` and place both high and low noise files inside.
125
- ### Step 3: Final Directory Structure
126
-
127
- If you downloaded the `full` model, your `checkpoints` directory should look like this:
128
-
129
- ```
130
- checkpoints/
131
- ├── Wan2.2-T2V-A14B/
132
- │ ├── models_t5_umt5-xxl-enc-bf16.pth
133
- │ └── Wan2.1_VAE.pth
134
- └── HoloCine_dit/
135
- └── full/
136
- ├── full_high_noise.safetensors
137
- └── full_low_noise.safetensors
138
- ```
139
- (If you downloaded the `sparse` model, replace `full` with `sparse`.)
140
-
141
-
142
- # Inference
143
- We release two version of models, one using full attention to model the multi-shot sequence (our default), the other using sparse Inter-shot attention.
144
-
145
- To use the full attention version.
146
-
147
- ```shell
148
- python HoloCine_inference_full_attention.py
149
- ```
150
-
151
- To use the sparse inter-shot attention version.
152
-
153
- ```shell
154
- python HoloCine_inference_sparse_attention.py
155
- ```
156
-
157
-
158
-
159
-
160
- ## Prompt Format
161
-
162
- To achieve precise content control of each shot, our prompt is designed to follow a format. Our inference script is designed to be flexible and we support two way to input the text prompt.
163
-
164
- ### Choice 1: Structured Input (Recommended if you want to test on your own sample)
165
-
166
- This is the easiest way to create new multi-shot prompts. You provide the components as separate arguments inside the script, and our helper function will format them correctly.
167
-
168
- * `global_caption`: A string describing the entire scene, characters, and setting.
169
- * `shot_captions`: A *list* of strings, where each string describes one shot in sequential order.
170
- * `num_frames`: The total number of frames for the video (default is `241` as we train on this sequence length).
171
- * `shot_cut_frames`: (Optional) A list of frame numbers where you want cuts to happen. By defult, the script will automatically calculate evenly spaced cuts. If you want to customize it, make sure you understand that the shot cut number indicated by `shot_cut_frames` should align with `shot_captions`.
172
-
173
- **Example (inside `HoloCine_inference_full_attention.py`):**
174
-
175
- ```python
176
- run_inference(
177
- pipe=pipe,
178
- negative_prompt=scene_negative_prompt,
179
- output_path="test_structured_output.mp4",
180
-
181
- # Choice 1 inputs
182
- global_caption="The scene is set in a lavish, 1920s Art Deco ballroom during a masquerade party. [character1] is a mysterious woman with a sleek bob, wearing a sequined silver dress and an ornate feather mask. [character2] is a dapper gentleman in a black tuxedo, his face half-hidden by a simple black domino mask. The environment is filled with champagne fountains, a live jazz band, and dancing couples in extravagant costumes. This scene contains 5 shots.",
183
- shot_captions=[
184
- "Medium shot of [character1] standing by a pillar, observing the crowd, a champagne flute in her hand.",
185
- "Close-up of [character2] watching her from across the room, a look of intrigue on his visible features.",
186
- "Medium shot as [character2] navigates the crowd and approaches [character1], offering a polite bow. ",
187
- "Close-up on [character1]'s eyes through her mask, as they crinkle in a subtle, amused smile.",
188
- "A stylish medium two-shot of them standing together, the swirling party out of focus behind them, as they begin to converse."
189
-
190
- ],
191
- num_frames=241
192
- )
193
- ```
194
-
195
- https://github.com/user-attachments/assets/10dba757-27dc-4f65-8fc3-b396cf466063
196
-
197
- ### Choice 2: Raw String Input
198
-
199
- This mode allows you to provide the full, concatenated prompt string, just like in our original script. This is useful if you want to re-using our provided prompts.
200
-
201
- The format must be exact:
202
- `[global caption] ... [per shot caption] ... [shot cut] ... [shot cut] ...`
203
-
204
- **Example (inside `HoloCine_inference_full_attention.py`):**
205
-
206
- ```python
207
- run_inference(
208
- pipe=pipe,
209
- negative_prompt=scene_negative_prompt,
210
- output_path="test_raw_string_output.mp4",
211
-
212
- # Choice 2 inputs
213
- prompt="[global caption] The scene features a young painter, [character1], with paint-smudged cheeks and intense, focused eyes. Her hair is tied up messily. The setting is a bright, sun-drenched art studio with large windows, canvases, and the smell of oil paint. This scene contains 6 shots. [per shot caption] Medium shot of [character1] standing back from a large canvas, brush in hand, critically observing her work. [shot cut] Close-up of her hand holding the brush, dabbing it thoughtfully onto a palette of vibrant colors. [shot cut] Extreme close-up of her eyes, narrowed in concentration as she studies the canvas. [shot cut] Close-up on the canvas, showing a detailed, textured brushstroke being slowly applied. [shot cut] Medium close-up of [character1]'s face, a small, satisfied smile appears as she finds the right color. [shot cut] Over-the-shoulder shot showing her add a final, delicate highlight to the painting.",
214
-
215
-
216
- num_frames=241,
217
- shot_cut_frames=[37, 73, 113, 169, 205]
218
-
219
- )
220
- ```
221
- https://github.com/user-attachments/assets/fdc12ff1-cf1b-4250-b7c9-a32e4d65731f
222
-
223
- ## Examples
224
-
225
- We provide several commented-out examples directly within the `HoloCine_inference_full_attention.py` and `HoloCine_inference_sparse_attention.py` script. You can uncomment any of these examples to try them out immediately.
226
-
227
- If you want to quickly test the model's stability on your own text prompt and don't want to design it by yourself, you can use LLM like gemini 2.5 pro to generate text prompt based on our format. Based on our test, the model is quite stable on diverse genres of text prompt.
228
-
229
- # Community Support
230
-
231
- ## Comfyui
232
- Thanks [Dango233](https://github.com/Dango233) for implementing comfyui node for HoloCine. (kijai/ComfyUI-WanVideoWrapper#1566). This part is still under test, so feel to leave an issue if you encounter any problem here.
233
-
234
-
235
- # Citation
236
-
237
- If you find this work useful, please consider citing our paper:
238
-
239
- ```bibtex
240
- @article{meng2025holocine,
241
- title={HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives},
242
- author={Meng, Yihao and Ouyang, Hao and Yu, Yue and Wang, Qiuyu and Wang, Wen and Cheng, Ka Leong and Wang, Hanlin and Li, Yixuan and Chen, Cheng and Zeng, Yanhong and Shen, Yujun and Qu, Huamin},
243
- journal={arXiv preprint arXiv:2510.20822},
244
- year={2025}
245
- }
246
- ```
247
-
248
- # License
249
-
250
- This project is licensed under the CC BY-NC-SA 4.0 ([Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/)).
251
-
252
- The code is provided for academic research purposes only.
253
-
254
- For any questions, please contact [email protected].
255
-
 
1
+ ---
2
+ title: HoloCine Demo
3
+ emoji: 🎬
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: gradio
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
10
+
11
+ # HoloCine Demo
12
+ This is a minimal Gradio app deployed on Hugging Face Spaces.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+ def hello(name):
4
+ return f"Hello, {name}! 🎬 HoloCine Space is live."
5
+
6
+ demo = gr.Interface(
7
+ fn=hello,
8
+ inputs=gr.Textbox(label="Your name"),
9
+ outputs=gr.Textbox(label="Response"),
10
+ title="HoloCine Demo",
11
+ description="Minimal Gradio app running on Hugging Face Spaces."
12
+ )
13
+
14
+ if __name__ == "__main__":
15
+ demo.launch()
requirements.txt CHANGED
@@ -1,17 +1 @@
1
- torch>=2.0.0
2
- torchvision
3
- cupy-cuda12x
4
- transformers
5
- controlnet-aux==0.0.7
6
- imageio
7
- imageio[ffmpeg]
8
- safetensors
9
- einops
10
- sentencepiece
11
- protobuf
12
- modelscope
13
- ftfy
14
- pynvml
15
- pandas
16
- accelerate
17
-
 
1
+ gradio