add time cost and gpu memory to readme asn add corresponding configs

Can-Zhao · Can-Zhao · commit c78bf7bb30a7 · 2024-11-15T08:58:31.000Z
Signed-off-by: Can-Zhao &lt;canz@nvidia.com&gt;
diff --git a/generation/maisi/README.md b/generation/maisi/README.md
@@ -31,6 +31,34 @@ We retrained several state-of-the-art diffusion model-based methods using our da
 
 </div>
 
+## Inference Time Cost and GPU Memory Usage
+
+| `output_size` | `autoencoder_sliding_window_infer_size` | `autoencoder_tp_num_splits` | Peak Memory | DM Time | VAE Time |
+|---------------|:--------------------------------------:|:---------------------------:|:-----------:|:-------:|:--------:|
+| 256x256x128   | >=[64,64,32], not used                 | 2                           | 14G         | 57s     | 1s       |
+| 256x256x256   | [48,48,64], 4 patches                  | 2                           | 14G         | 81s     | 7s       |
+| 512x512x128   | [64,64,32], 9 patches                  | 1                           | 14G         | 138s    | 7s       |
+|---------------|----------------------------------------|-----------------------------|-------------|---------|----------|
+| 256x256x256   | >=[64,64,64], not used                 | 4                           | 22G         | 81s     | 2s       |
+| 512x512x128   | [80,80,32], 4 patches                  | 1                           | 18G         | 138s    | 9s       |
+| 512x512x512   | [64,64,48], 36 patches                 | 2                           | 22G         | 569s    | 29s      |
+|---------------|----------------------------------------|-----------------------------|-------------|---------|----------|
+| 512x512x512   | [64,64,64], 27 patches                 | 2                           | 26G         | 569s    | 40s      |
+|---------------|----------------------------------------|-----------------------------|-------------|---------|----------|
+| 512x512x128   | >=[128,128,32], not used               | 4                           | 37G         | 138s    | 140s     |
+| 512x512x512   | [80,80,80], 8 patches                  | 2                           | 44G         | 569s    | 30s      |
+| 512x512x768   | [80,80,112], 8 patches                 | 4                           | 55G         | 904s    | 48s      |
+
+
+The experiment was tested on A100 80G GPU. 
+
+During inference, the peak GPU memory usage happens during the autoencoder decoding latent features.
+To reduce GPU memory usage, we can either increasing `autoencoder_tp_num_splits` or reduce `autoencoder_sliding_window_infer_size`.
+Increasing `autoencoder_tp_num_splits` has smaller impact on the generated image quality.
+Yet reducing `autoencoder_sliding_window_infer_size` may introduce stitching artifact and has larger impact on the generated image quality.
+
+
+
 ## MAISI Model Workflow
 The training and inference workflows of MAISI are depicted in the figure below. It begins by training an autoencoder in pixel space to encode images into latent features. Following that, it trains a diffusion model in the latent space to denoise the noisy latent features. During inference, it first generates latent features from random noise by applying multiple denoising steps using the trained diffusion model. Finally, it decodes the denoised latent features into images using the trained autoencoder.
 <p align="center">