Skip to content

Commit

Permalink
[doc] update the docker tags and flux.1 performance (#317)
Browse files Browse the repository at this point in the history
  • Loading branch information
feifeibear authored Oct 25, 2024
1 parent a3fa0c4 commit f9e35f7
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 8 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,11 @@ Note that we use two self-maintained packages:

The [flash_attn](https://github.com/Dao-AILab/flash-attention) used for yunchang should be >= 2.6.0

### 3. Usage
### 3. Docker

We provide a docker image for developers to develop with xDiT. The docker image is [thufeifeibear/xdit-dev](https://hub.docker.com/r/thufeifeibear/xdit-dev).

### 4. Usage

We provide examples demonstrating how to run models with xDiT in the [./examples/](./examples/) directory.
You can easily modify the model type, model directory, and parallel options in the [examples/run.sh](examples/run.sh) within the script to run some already supported DiT models.
Expand Down
6 changes: 3 additions & 3 deletions docs/developer/Http_Service_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@
### 制作服务镜像

```
docker build -t xdit-server:0.3.1 -f ./docker/Dockerfile .
docker build -t xdit-service -f ./docker/Dockerfile .
```

或者直接从dockerhub拉取(版本号可能需要更新)
```
docker pull thufeifeibear/xdit-service:0.3.1
docker pull thufeifeibear/xdit-service
```

用下面方式启动一个服务,服务相关参数写在配置脚本config.json里。我们映射了磁盘文件到docker container中,因为需要传递下载的模型文件。注意映射端口6000,如果冲突请修改。

```
docker run --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 6000:6000 -v /cfs:/cfs xdit-server:0.3.1 --config ./config.json
docker run --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 6000:6000 -v /cfs:/cfs xdit-service --config ./config.json
```

./config.json中内容如下,包括启动GPU卡数,混合并行策略,输出图片的大小,生成图片存储位置等信息。
Expand Down
31 changes: 27 additions & 4 deletions docs/performance/flux.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,37 @@ Deploying Flux.1 in real-time presents several challenges:

2. VAE OOM: The VAE component experiences Out Of Memory (OOM) issues when attempting to generate images larger than 2048px on an A100 GPU with 80GB VRAM, despite the DiTs backbone's capability to handle higher resolutions.

To address these challenges, xDiT employs a hybrid sequence parallel [USP](https://arxiv.org/abs/2405.07719) and [VAE Parallel](https://github.com/xdit-project/DistVAE) to scale Flux.1 inference across multiple GPUs.
To address these challenges, xDiT employs a hybrid sequence parallel [USP](https://arxiv.org/abs/2405.07719), [PipeFusion](https://arxiv.org/abs/2405.14430) and [VAE Parallel](https://github.com/xdit-project/DistVAE) to scale Flux.1 inference across multiple GPUs.
Since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatible with cfg parallel.

Currently, xDiT does not support PipeFusion for the Flux.1 schnell variant due to its minimal sampling steps, as PipeFusion requires a warmup phase which is not suitable for this scenario. However, applying PipeFusion for the Pro and Dev versions is considered necessary and is still under development.
### Scalability of Flux.1 Dev

Additionally, since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatible with cfg parallel.
We conducted performance benchmarking using FLUX.1 [dev] with 28 diffusion steps.

### Scalability
The following figure shows the scalability of Flux.1 on two 8xL40 Nodes, 16xL40 GPUs in total.
Consequently, the performance improvement dose not achieved with 16 GPUs, and for 1024px and 2048px tasks.
For the 4096px task, hybrid parallel still benefits on 16 L40s, 1.9x lower than 8 GPUs, where the configuration is ulysses=2, ring=2, and pipefusion=4.

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/scalability/Flux-16L40-crop.png"
alt="scalability-flux_l40">
</div>


The following figure demonstrates the scalability of Flux.1 on 8xA100 GPUs.
For both the 1024px and the 2048px image generation tasks, SP-Ulysses exhibits the lowest latency among the single parallel methods. The optimal hybrid strategy also are SP-Ulysses in this case.

<div align="center">
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/scalability/Flux-A100-crop.png"
alt="scalability-flux_l40">
</div>

Note that the latency shown in the above figure does not yet include the use of torch.compile, which would provide further performance improvements.

### Scalability of Flux.1 Schnell

We conducted performance benchmarking using FLUX.1 [schnell] with 4 steps.
Since the step number is very small, we do not apply PipeFusion.

On a machine with 8xA100 (80GB) GPUs interconnected via NVLink, generating a 1024px image, the optimal strategy with USP is to apply ulysses_degree=#gpu. After using `torch.compile`, the generation of a 1024px image takes only 0.82 seconds!

Expand Down Expand Up @@ -65,3 +87,4 @@ The quality of image generation at 2048px, 3072px, and 4096px resolutions is as
<img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/flux/flux_image.png"
alt="latency-flux_l40">
</div>

0 comments on commit f9e35f7

Please sign in to comment.