[doc] update the docker tags and flux.1 performance (#317)

xdit-project · Oct 25, 2024 · f9e35f7 · f9e35f7
1 parent a3fa0c4
commit f9e35f7
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -180,7 +180,11 @@ Note that we use two self-maintained packages:
 
 The [flash_attn](https://github.com/Dao-AILab/flash-attention) used for yunchang should be >= 2.6.0
 
-### 3. Usage
+### 3. Docker
+
+We provide a docker image for developers to develop with xDiT. The docker image is [thufeifeibear/xdit-dev](https://hub.docker.com/r/thufeifeibear/xdit-dev).
+
+### 4. Usage
 
 We provide examples demonstrating how to run models with xDiT in the [./examples/](./examples/) directory. 
 You can easily modify the model type, model directory, and parallel options in the [examples/run.sh](examples/run.sh) within the script to run some already supported DiT models.

diff --git a/docs/developer/Http_Service_zh.md b/docs/developer/Http_Service_zh.md
@@ -3,18 +3,18 @@
 ### 制作服务镜像
 
 ```
-docker build -t xdit-server:0.3.1 -f ./docker/Dockerfile .
+docker build -t xdit-service -f ./docker/Dockerfile .
 ```
 
 或者直接从dockerhub拉取(版本号可能需要更新)
 ```
-docker pull thufeifeibear/xdit-service:0.3.1
+docker pull thufeifeibear/xdit-service
 ```
 
 用下面方式启动一个服务，服务相关参数写在配置脚本config.json里。我们映射了磁盘文件到docker container中，因为需要传递下载的模型文件。注意映射端口6000，如果冲突请修改。
 
 ```
-docker run --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 6000:6000 -v /cfs:/cfs xdit-server:0.3.1 --config ./config.json
+docker run --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 6000:6000 -v /cfs:/cfs xdit-service --config ./config.json
 ```
 
 ./config.json中内容如下，包括启动GPU卡数，混合并行策略，输出图片的大小，生成图片存储位置等信息。

diff --git a/docs/performance/flux.md b/docs/performance/flux.md
@@ -9,15 +9,37 @@ Deploying Flux.1 in real-time presents several challenges:
 
 2. VAE OOM: The VAE component experiences Out Of Memory (OOM) issues when attempting to generate images larger than 2048px on an A100 GPU with 80GB VRAM, despite the DiTs backbone's capability to handle higher resolutions.
 
-To address these challenges, xDiT employs a hybrid sequence parallel [USP](https://arxiv.org/abs/2405.07719) and [VAE Parallel](https://github.com/xdit-project/DistVAE) to scale Flux.1 inference across multiple GPUs.
+To address these challenges, xDiT employs a hybrid sequence parallel [USP](https://arxiv.org/abs/2405.07719), [PipeFusion](https://arxiv.org/abs/2405.14430) and [VAE Parallel](https://github.com/xdit-project/DistVAE) to scale Flux.1 inference across multiple GPUs.
+Since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatible with cfg parallel.
 
-Currently, xDiT does not support PipeFusion for the Flux.1 schnell variant due to its minimal sampling steps, as PipeFusion requires a warmup phase which is not suitable for this scenario. However, applying PipeFusion for the Pro and Dev versions is considered necessary and is still under development.
+### Scalability of Flux.1 Dev
 
-Additionally, since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatible with cfg parallel.
+We conducted performance benchmarking using FLUX.1 [dev] with 28 diffusion steps.
 
-### Scalability
+The following figure shows the scalability of Flux.1 on two 8xL40 Nodes, 16xL40 GPUs in total. 
+Consequently, the performance improvement dose not achieved with 16 GPUs, and for 1024px and 2048px tasks.
+For the 4096px task, hybrid parallel still benefits on 16 L40s, 1.9x lower than 8 GPUs, where the configuration is ulysses=2, ring=2, and pipefusion=4.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/scalability/Flux-16L40-crop.png" 
+    alt="scalability-flux_l40">
+</div>
+
+
+The following figure demonstrates the scalability of Flux.1 on 8xA100 GPUs.
+For both the 1024px and the 2048px image generation tasks, SP-Ulysses exhibits the lowest latency among the single parallel methods. The optimal hybrid strategy also are SP-Ulysses in this case.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/scalability/Flux-A100-crop.png" 
+    alt="scalability-flux_l40">
+</div>
+
+Note that the latency shown in the above figure does not yet include the use of torch.compile, which would provide further performance improvements.
+
+### Scalability of Flux.1 Schnell
 
 We conducted performance benchmarking using FLUX.1 [schnell] with 4 steps.
+Since the step number is very small, we do not apply PipeFusion.
 
 On a machine with 8xA100 (80GB) GPUs interconnected via NVLink, generating a 1024px image, the optimal strategy with USP is to apply ulysses_degree=#gpu. After using `torch.compile`, the generation of a 1024px image takes only 0.82 seconds!
 
@@ -65,3 +87,4 @@ The quality of image generation at 2048px, 3072px, and 4096px resolutions is as
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/flux/flux_image.png" 
     alt="latency-flux_l40">
 </div>
+