Merge pull request #115 from shlee007/dev/docker

Unify docker images and update inference scripts.
Tencent · Dec 12, 2024 · fae0bec · fae0bec
2 parents 3ef9a88 + 80a1a69
commit fae0bec
Show file tree

Hide file tree

Showing 8 changed files with 104 additions and 172 deletions.
diff --git a/README.md b/README.md
@@ -85,7 +85,6 @@ The video is heavily compressed to comply with GitHub's policy. The high-quality
     - [Run a Gradio Server](#run-a-gradio-server)
     - [More Configurations](#more-configurations)
   - [🚀 Parallel Inference on Multiple GPUs by xDiT](#-parallel-inference-on-multiple-gpus-by-xdit)
-    - [Install Dependencies Compatible with xDiT](#install-dependencies-compatible-with-xdit)
     - [Using Command Line](#using-command-line-1)
   - [🔗 BibTeX](#-bibtex)
   - [🧩 Projects that use HunyuanVideo](#-projects-that-use-hunyuanvideo)
@@ -201,24 +200,32 @@ cd HunyuanVideo
 
 ### Installation Guide for Linux
 
-We provide an `environment.yml` file for setting up a Conda environment.
-Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
-
 We recommend CUDA versions 12.4 or 11.8 for the manual installation.
 
+Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
+
 ```shell
-# 1. Prepare conda environment
-conda env create -f environment.yml
+# 1. Create conda environment
+conda create -n HunyuanVideo python==3.10.9
 
 # 2. Activate the environment
 conda activate HunyuanVideo
 
-# 3. Install pip dependencies
+# 3. Install PyTorch and other dependencies using conda
+# For CUDA 11.8
+conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
+# For CUDA 12.4
+conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
+
+# 4. Install pip dependencies
 python -m pip install -r requirements.txt
 
-# 4. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
+# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
 python -m pip install ninja
 python -m pip install git+https://github.com/Dao-AILab/[email protected]
+
+# 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
+python -m pip install xfuser==0.4.0
 ```
 
 In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:
@@ -230,9 +237,12 @@ export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
 
 # Option 2: Forcing to explictly use the CUDA 11.8 compiled version of Pytorch and all the other packages
 pip uninstall -r requirements.txt  # uninstall all packages
+pip uninstall -y xfuser
 pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
 pip install -r requirements.txt
-python -m pip install git+https://github.com/Dao-AILab/[email protected]
+pip install ninja
+pip install git+https://github.com/Dao-AILab/[email protected]
+pip install xfuser==0.4.0
 ```
 
 Additionally, HunyuanVideo also provides a pre-built Docker image. Use the following command to pull and run the docker image.
@@ -306,26 +316,6 @@ We list some more useful configurations for easy usage:
 [xDiT](https://github.com/xdit-project/xDiT) is a Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters.
 It has successfully provided low-latency parallel inference solutions for a variety of DiTs models, including mochi-1, CogVideoX, Flux.1, SD3, etc. This repo adopted the [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) APIs for parallel inference of the HunyuanVideo model.
 
-### Install Dependencies Compatible with xDiT
-
-```
-# 1. Create a black conda environment
-conda create -n hunyuanxdit python==3.10.9
-conda activate hunyuanxdit
-
-# 3. Install PyTorch component with CUDA 11.8
-conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0  pytorch-cuda=11.8 -c pytorch -c nvidia
-
-# 3. Install pip dependencies
-python -m pip install -r requirements_xdit.txt
-```
-
-You can skip the above steps and pull the pre-built docker image directly, which is built from [docker/Dockerfile_xDiT](./docker/Dockerfile_xDiT)
-
-```
-docker pull thufeifeibear/hunyuanvideo:latest
-```
-
 ### Using Command Line
 
 For example, to generate a video with 8 GPUs, you can use the following command:

diff --git a/README_zh.md b/README_zh.md
@@ -81,12 +81,11 @@
   - [🛠️ 安装和依赖](#️-安装和依赖)
     - [Linux 安装指引](#linux-安装指引)
   - [🧱 下载预训练模型](#-下载预训练模型)
-  - [🔑 推理](#-推理)
+  - [🔑 单卡推理](#-单卡推理)
     - [使用命令行](#使用命令行)
     - [运行gradio服务](#运行gradio服务)
     - [更多配置](#更多配置)
   - [🚀 使用 xDiT 实现多卡并行推理](#-使用-xdit-实现多卡并行推理)
-    - [安装与 xDiT 兼容的依赖项](#安装与-xdit-兼容的依赖项)
     - [使用命令行](#使用命令行-1)
   - [🔗 BibTeX](#-bibtex)
   - [🧩 使用 HunyuanVideo 的项目](#-使用-hunyuanvideo-的项目)
@@ -194,46 +193,58 @@ cd HunyuanVideo
 
 ### Linux 安装指引
 
-我们提供了 `environment.yml` 文件来设置 Conda 环境。Conda 的安装指南可以参考[这里](https://docs.anaconda.com/free/miniconda/index.html)。
+我们推荐使用 CUDA 12.4 或 11.8 的版本。
 
-我们推理使用 CUDA 12.4 或 11.8 的版本。
+Conda 的安装指南可以参考[这里](https://docs.anaconda.com/free/miniconda/index.html)。
 
 ```shell
-# 1. Prepare conda environment
-conda env create -f environment.yml
+# 1. Create conda environment
+conda create -n HunyuanVideo python==3.10.9
 
 # 2. Activate the environment
 conda activate HunyuanVideo
 
-# 3. Install pip dependencies
+# 3. Install PyTorch and other dependencies using conda
+# For CUDA 11.8
+conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
+# For CUDA 12.4
+conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
+
+# 4. Install pip dependencies
 python -m pip install -r requirements.txt
 
-# 4. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
+# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
 python -m pip install ninja
 python -m pip install git+https://github.com/Dao-AILab/[email protected]
+
+# 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
+python -m pip install xfuser==0.4.0
 ```
 
-如果在特定GPU型号上遭遇float point exception(core dump)问题，可尝试以下方案修复：
+如果在特定 GPU 型号上遭遇 float point exception(core dump) 问题，可尝试以下方案修复：
 
 ```shell
-#选项1：确保已正确安装CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00(或直接使用我们提供的CUDA12镜像)
+#选项1：确保已正确安装 CUDA 12.4, CUBLAS>=12.4.5.8, 和 CUDNN>=9.00 (或直接使用我们提供的CUDA12镜像)
 pip install nvidia-cublas-cu12==12.4.5.8
 export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
 
-#选项2：强制显式使用CUDA11.8编译的Pytorch版本以及其他所有软件包
+#选项2：强制显式使用 CUDA11.8 编译的 Pytorch 版本以及其他所有软件包
 pip uninstall -r requirements.txt  # 确保卸载所有依赖包
+pip uninstall -y xfuser
 pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
 pip install -r requirements.txt
-python -m pip install git+https://github.com/Dao-AILab/[email protected]
+pip install ninja
+pip install git+https://github.com/Dao-AILab/[email protected]
+pip install xfuser==0.4.0
 ```
 
 另外，我们提供了一个预构建的 Docker 镜像，可以使用如下命令进行拉取和运行。
 ```shell
-# 用于CUDA 12.4 (已更新避免float point exception)
+# 用于 CUDA 12.4 (已更新避免 float point exception)
 docker pull hunyuanvideo/hunyuanvideo:cuda_12
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
 
-# 用于CUDA 11.8
+# 用于 CUDA 11.8
 docker pull hunyuanvideo/hunyuanvideo:cuda_11
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
 ```
@@ -242,7 +253,7 @@ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyua
 
 下载预训练模型参考[这里](ckpts/README.md)。
 
-## 🔑 推理
+## 🔑 单卡推理
 我们在下表中列出了支持的高度/宽度/帧数设置。
 
 |      分辨率       |           h/w=9:16           |    h/w=16:9     |     h/w=4:3     |     h/w=3:4     |     h/w=1:1     |
@@ -297,26 +308,6 @@ python3 gradio_server.py --flow-reverse
 [xDiT](https://github.com/xdit-project/xDiT) 是一个针对多 GPU 集群的扩展推理引擎，用于扩展 Transformers（DiTs）。
 它成功为各种 DiT 模型（包括 mochi-1、CogVideoX、Flux.1、SD3 等）提供了低延迟的并行推理解决方案。该存储库采用了 [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) API 用于混元视频模型的并行推理。
 
-### 安装与 xDiT 兼容的依赖项
-
-```
-# 1. 创建一个空白的 conda 环境
-conda create -n hunyuanxdit python==3.10.9
-conda activate hunyuanxdit
-
-# 2. 使用 CUDA 11.8 安装 PyTorch 组件
-conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0  pytorch-cuda=11.8 -c pytorch -c nvidia
-
-# 3. 安装 pip 依赖项
-python -m pip install -r requirements_xdit.txt
-```
-
-您可以跳过上述步骤，直接拉取预构建的 Docker 镜像，这个镜像是从 [docker/Dockerfile_xDiT](./docker/Dockerfile_xDiT) 构建的
-
-```
-docker pull thufeifeibear/hunyuanvideo:latest
-```
-
 ### 使用命令行
 
 例如，可用如下命令使用8张GPU卡完成推理

diff --git a/docker/Dockerfile_xDiT b/docker/Dockerfile_xDiT
diff --git a/environment.yml b/environment.yml
diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,5 @@
-torchvision==0.16.1
 opencv-python==4.9.0.80
-diffusers==0.30.2
+diffusers==0.31.0
 transformers==4.46.3
 tokenizers==0.20.3
 accelerate==1.1.1
@@ -12,5 +11,4 @@ loguru==0.7.2
 imageio==2.34.0
 imageio-ffmpeg==0.5.1
 safetensors==0.4.3
-gradio==4.43.0
-urllib3==1.26.6
+gradio==4.43.0
diff --git a/requirements_xdit.txt b/requirements_xdit.txt
diff --git a/scripts/run_sample_video.sh b/scripts/run_sample_video.sh
@@ -1,42 +1,14 @@
 #!/bin/bash
 # Description: This script demonstrates how to inference a video based on HunyuanVideo model
 
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=8 \
-	sample_video.py --video-size 1280 720 --video-length 129 \
-     	--infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-    	--flow-reverse --ulysses-degree=8 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node 4 \
-	        sample_video.py --video-size 1280 720 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=4 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=2 \
-	        sample_video.py --video-size 1280 720 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=2 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=1 \
-	        sample_video.py --video-size 1280 720 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=1 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=6 \
-	        sample_video.py --video-size 960 960 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=6 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=3 \
-	        sample_video.py --video-size 960 960 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=3 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=2 \
-	        sample_video.py --video-size 960 960 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=2 --ring-degree=1 --seed 42 --save-path ./results
-
-TOKENIZERS_PARALLELISM=false torchrun --nproc_per_node=1 \
-	        sample_video.py --video-size 1280 720 --video-length 129 \
-		        --infer-steps 50 --prompt "A cat walks on the grass, realistic style." \
-			        --flow-reverse --ulysses-degree=1 --ring-degree=1 --seed 42 --save-path ./results
+python3 sample_video.py \
+    --video-size 720 1280 \
+    --video-length 129 \
+	--infer-steps 50 \
+    --prompt "A cat walks on the grass, realistic style." \
+    --seed 42 \
+	--embedded-cfg-scale 6.0 \
+    --flow-shift 7.0 \
+    --flow-reverse \
+    --use-cpu-offload \
+    --save-path ./results
diff --git a/scripts/run_sample_video_multigpu.sh b/scripts/run_sample_video_multigpu.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+# Description: This script demonstrates how to inference a video based on HunyuanVideo model
+
+# Supported Parallel Configurations
+# |     --video-size     | --video-length | --ulysses-degree x --ring-degree | --nproc_per_node |
+# |----------------------|----------------|----------------------------------|------------------|
+# | 1280 720 or 720 1280 | 129            | 8x1,4x2,2x4,1x8                  | 8                |
+# | 1280 720 or 720 1280 | 129            | 1x5                              | 5                |
+# | 1280 720 or 720 1280 | 129            | 4x1,2x2,1x4                      | 4                |
+# | 1280 720 or 720 1280 | 129            | 3x1,1x3                          | 3                |
+# | 1280 720 or 720 1280 | 129            | 2x1,1x2                          | 2                |
+# | 1104 832 or 832 1104 | 129            | 4x1,2x2,1x4                      | 4                |
+# | 1104 832 or 832 1104 | 129            | 3x1,1x3                          | 3                |
+# | 1104 832 or 832 1104 | 129            | 2x1,1x2                          | 2                |
+# | 960 960              | 129            | 6x1,3x2,2x3,1x6                  | 6                |
+# | 960 960              | 129            | 4x1,2x2,1x4                      | 4                |
+# | 960 960              | 129            | 3x1,1x3                          | 3                |
+# | 960 960              | 129            | 1x2,2x1                          | 2                |
+# | 960 544 or 544 960   | 129            | 6x1,3x2,2x3,1x6                  | 6                |
+# | 960 544 or 544 960   | 129            | 4x1,2x2,1x4                      | 4                |
+# | 960 544 or 544 960   | 129            | 3x1,1x3                          | 3                |
+# | 960 544 or 544 960   | 129            | 1x2,2x1                          | 2                |
+# | 832 624 or 624 832   | 129            | 4x1,2x2,1x4                      | 4                |
+# | 624 832 or 624 832   | 129            | 3x1,1x3                          | 3                |
+# | 832 624 or 624 832   | 129            | 2x1,1x2                          | 2                |
+# | 720 720              | 129            | 1x5                              | 5                |
+# | 720 720              | 129            | 3x1,1x3                          | 3                |
+
+export TOKENIZERS_PARALLELISM=false
+
+export NPROC_PER_NODE=8
+export ULYSSES_DEGREE=8
+export RING_DEGREE=1
+
+torchrun --nproc_per_node=$NPROC_PER_NODE sample_video.py \
+	--video-size 720 1280 \
+	--video-length 129 \
+	--infer-steps 50 \
+	--prompt "A cat walks on the grass, realistic style." \
+	--seed 42 \
+	--embedded-cfg-scale 6.0 \
+	--flow-shift 7.0 \
+	--flow-reverse \
+	--ulysses-degree=$ULYSSES_DEGREE \
+	--ring-degree=$RING_DEGREE \
+	--save-path ./results