- Introduction
- Installation
- Dataset Preparation
- Text-to-Image
- Image-to-Image
- ControlNet
- T2I Adapter
- Advanced Usage
This repository integrates state-of-the-art Stable Diffusion models including SD1.5, SD2.0, and SD2.1, supporting various generation tasks and pipelines. Efficient training and fast inference are implemented based on MindSpore.
New models and features will be continuously updated.
SD Model | Text-to-Image | Image Variation | Inpainting | Depth-to-Image | ControlNet | T2I Adapter |
---|---|---|---|---|---|---|
1.5 | Inference | Training | N.A. | N.A. | N.A. | Inference | Training | Inference |
2.0 & 2.1 | Inference | Training | Inference | Training | Inference | Inference | N.A. | Inference | Training |
wukong | Inference | Training | N.A. | Inference | N.A. | N.A. | N.A. |
Although some combinations are not supported currently (due to the lack of checkpoints pretrained on the specific task and SD model), you can use the Model Conversion tool to convert the checkpoint (e.g. from HF) then adapt it to the existing pipelines (e.g. image variation pipeline with SD 1.5)
You may click the link in the table to access the running instructions directly.
For model performance, please refer to benchmark.
Our code is mainly developed and tested on Ascend 910 platforms with MindSpore framework. The compatible framework versions that are well-tested are listed as follows.
Ascend | MindSpore | CANN | driver | Python | MindONE |
---|---|---|---|---|---|
910 | 2.0 | 6.3 RC1 | 23.0.rc1 | 3.7.16 | master (4c33849) |
910 | 2.1 | 6.3 RC2 | 23.0.rc2 | 3.9.18 | master (4c33849) |
910* | 2.2.1 (20231124) | 7.1 | 23.0.rc3.6 | 3.7.16 | master (4c33849) |
For detailed instructions to install CANN and MindSpore, please refer to the official webpage MindSpore Installation.
Note: Running on other platforms (such as GPUs) and MindSpore versions may not be reliable. It's highly recommended to use the verified CANN and MindSpore versions. More compatible versions will be continuously updated.
pip install -r requirements.txt
git clone https://github.com/mindspore-lab/mindone.git
cd mindone/examples/stable_diffusion_v2
This section describes the data format and protocol for diffusion model training.
The text-image pair dataset should be organized as follows.
data_path
├── img1.jpg
├── img2.jpg
├── img3.jpg
└── img_txt.csv
, where img_txt.csv
is the image-caption file annotated in the following format.
dir,text
img1.jpg,a cartoon character with a potted plant on his head
img2.jpg,a drawing of a green pokemon with red eyes
img3.jpg,a red and white ball with an angry look on its face
The first column is the image path related to the data_path
and the second column is the corresponding prompt.
For convenience, we have prepared two public text-image datasets obeying the above format.
- pokemon-blip-caption dataset, containing 833 pokemon-style images with BLIP-generated captions.
- Chinese-art blip caption dataset, containing 100 chinese art-style images with BLIP-generated captions.
To use them, please download pokemon_blip.zip
and chinese_art_blip.zip
from the openi dataset website. Then unzip them on your local directory, e.g. ./datasets/pokemon_blip
.
To generate images by providing a text prompt, please download one of the following checkpoints and put it in models
folder:
SD Version | Lang. | MindSpore Checkpoint | Ref. Official Model | Resolution |
---|---|---|---|---|
1.5 | EN | sd_v1.5-d0ab7146.ckpt | stable-diffusion-v1-5 | 512x512 |
1.5-wukong | CN | wukong-huahua-ms.ckpt | N.A. | 512x512 |
2.0 | EN | sd_v2_base-57526ee4.ckpt | stable-diffusion-2-base | 512x512 |
2.0-v | EN | sd_v2_768_v-e12e3a9b.ckpt | stable-diffusion-2 | 768x768 |
2.1 | EN | sd_v2-1_base-7c8d09ce.ckpt | stable-diffusion-2-1-base | 512x512 |
2.1-v | EN | sd_v2-1_768_v-061732d1.ckpt | stable-diffusion-2-1 | 768x768 |
Take SD 1.5 for example:
cd examples/stable_diffusion_v2
wget https://download.mindspore.cn/toolkits/mindone/stable_diffusion/sd_v1.5-d0ab7146.ckpt -P models
After preparing the pretrained weight, you can run text-to-image generation by:
python text_to_image.py --prompt {text prompt} -v {model version}
-v
: model version. Valid values can be referred toSD Version
in the above table.
For more argument illustration, please run python text_to_image.py -h
.
Take SD 1.5 as an example:
# Generate images with the provided prompt using SD 1.5
python text_to_image.py --prompt "elven forest" -v 1.5
Take SD 2.0 as an example:
# Use SD 2.0 instead and add negative prompt guidance to eliminate artifacts
python text_to_image.py --prompt "elven forest" -v 2.0 --negative_prompt "moss" --scale 9.0 --seed 42
For parallel inference, take SD1.5 on the Chinese art dataset as an example:
mpirun --allow-run-as-root -n 2 python text_to_image.py \
--config "configs/v1-inference.yaml" \
--data_path "datasets/chinese_art_blip/test/prompts.txt" \
--output_path "output/chinese_art_inference/txt2img" \
--ckpt_path "models/sd_v1.5-d0ab7146.ckpt" \
--use_parallel True
Note: Parallel inference only can be used for mutilple-prompt.
Long Prompts Support
By Default, SD V2(1.5) only supports the token sequence no longer than 77. For those sequences longer than 77, they will be truncated to 77, which can cause information loss.
To avoid information loss for long text prompts, we can divide one long tokens sequence (N>77) into several shorter sub-sequences (N<=77) to bypass the constraint of context length of the text encoders. This feature is supported by args.support_long_prompts
in text_to_image.py
.
When running inference with text_to_image.py
, you can set the arguments as below.
python text_to_image.py \
... \ # other arguments configurations
--support_long_prompts True \ # allow long text prompts
Flash-Attention Support
MindONE supports flash attention by setting the argument enable_flash_attention
as True
in configs/v1-inference.yaml
or configs/v2-inference.yaml
. For example, in configs/v1-inference.yaml
:
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
...
enable_flash_attention: False
fa_max_head_dim: 256 # max head dim of flash attention. In case of oom, reduce it to 128
One can set enable_flash_attention
to True
. In case of OOM (out of memory) error, please reduce the fa_max_head_dim
to 128.
Here are some generation results.
Prompt: "elven forest" With negative prompt: "moss"
Vanilla fine-tuning refers to training the whole UNet while freezing the CLIP-TextEncoder and VAE modules in the SD model.
To run vanilla fine-tuning, we will use the train_text_to_image.py
script following the instructions below.
-
Prepare the pretrained checkpoint referring to pretrained weights
-
Prepare the training dataset referring to Dataset Preparation.
-
Select a training configuration template from
config/train
and specify the--train_config
argument. The selected config file should match the pretrained weight.- For SD1.5, use
configs/train/train_config_vanilla_v1.yaml
- For SD2.0 or SD2.1, use
configs/train/train_config_vanilla_v2.yaml
- For SD2.x with v-prediction, use
configs/train/train_config_vanilla_v2_vpred.yaml
Note that the model architecture (defined via
model_config
) and training recipes are preset in the yaml file. You may edit the file to adjust hyper-parameters like learning rate, training epochs, and batch size for your task. - For SD1.5, use
-
Launch the training script after specifying the
data_path
,pretrained_model_path
, andtrain_config
arguments.python train_text_to_image.py \ --train_config {path to pre-defined training config yaml} \ --data_path {path to training data directory} \ --output_path {path to output directory} \ --pretrained_model_path {path to pretrained checkpoint file}
Please enable INFNAN mode by
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
for Ascend 910* if overflow found.Take fine-tuning SD1.5 on the Pokemon dataset as an example:
python train_text_to_image.py \ --train_config "configs/train/train_config_vanilla_v1.yaml" \ --data_path "datasets/pokemon_blip/train" \ --output_path "output/finetune_pokemon/txt2img" \ --pretrained_model_path "models/sd_v1.5-d0ab7146.ckpt"
The trained checkpoints will be saved in {output_path}.
For more argument illustration, please run python train_text_to_image.py -h
.
For parallel training on multiple Ascend NPUs, please refer to the instructions below.
-
Generate the rank table file for the target Ascend server.
python tools/hccl_tools/hccl_tools.py --device_num="[0,8)"
--device_num
specifies which cards to train on, e.g. "[4,8)"A json file e.g.
hccl_8p_10234567_127.0.0.1.json
will be generated in the current directory after running. -
Edit the distributed training script
scripts/run_train_distributed.sh
to specifyrank_table_file
with the path to the rank table file generated in step 1,data_path
,pretrained_model_path
, andtrain_config
according to your task.
-
Launch the distributed training script by
bash scripts/run_train_distributed.sh
Please enable INFNAN mode by
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
for Ascend 910* if overflow found.After launched, the training process can be traced by running
tail -f ouputs/train_txt2img/rank_0/train.log
.The trained checkpoints will be saved in
ouputs/train_txt2img
.
Note: For distributed training on large-scale datasets such as LAION, please refer to LAION Dataset Preparation.
Low-Rank Adaptation (LoRA) is a parameter-efficient finetuning method for large models.
Please refer to the tutorial of LoRA for Stable Diffusion Finetuning for detailed instructions.
DreamBooth allows users to generate contextualized images of one subject using just 3-5 images of the subject, e.g., your dog.
Please refer to the tutorial of DreamBooth for Stable Diffusion Finetuning for detailed instructions.
Textual Inversion learns one or a few text embedding vectors for a new concept, e.g., object or style, with only 3~5 images.
Please refer to the tutorial of Textual Inversion for Stable Diffusion Finetuning for detailed instructions.
This pipeline uses a fine-tuned version of Stable Diffusion 2.1, which can be used to create image variations (image-to-image).
The pipeline comes with two pre-trained models, 2.1-unclip-l
and 2.1-unclip-h
, which use the pretrained CLIP Image embedder and OpenCLIP Image embedder separately.
You can use the -v
argument to decide which model to use.
The amount of image variation can be controlled by the noise injected to the image embedding, which can be input by the --noise_level
argument.
A value of 0 means no noise, while a value of 1000 means full noise.
To generate variant images by providing a source image, please download one of the following checkpoints and put it in models
folder:
SD Version | Lang. | MindSpore Checkpoint | Ref. Official Model | Resolution |
---|---|---|---|---|
2.1-unclip-l | EN | sd21-unclip-l-baa7c8b5.ckpt | stable-diffusion-2-1-unclip | 768x768 |
2.1-unclip-h | EN | sd21-unclip-h-6a73eca5.ckpt | stable-diffusion-2-1-unclip | 768x768 |
And download the image encoder checkpoint ViT-L-14_stats-b668e2ca.ckpt to models
folder.
After preparing the pretrained weights, you can run image variation generation by:
python unclip_image_variation.py \
-v {model version} \
--image_path {path to input image} \
--prompt "your magic prompt to run image variation."
-v
: model version. Valid values can be referred toSD Version
in the above table.
For more argument usage, please run python unclip_image_variation.py --help
Using 2.1-unclip-l
model as an example, you may generate variant images based on the example image by
python unclip_image_variation.py \
-v 2.1-unclip-l \
--image_path tarsila_do_amaral.png \
--prompt "a cute cat sitting in the garden"
The output images will be saved in output/samples
directory.
you can also add extra noise to the image embedding to increase the amount of variation in the generated images.
python unclip_image_variation.py -v 2.1-unclip-l --image_path tarsila_do_amaral.png --prompt "a cute cat sitting in the garden" --noise_level 200
For image-to-image fine-tuning, please refer to the tutorial of Stable Diffusion unCLIP Finetuning for detailed instructions.
Text-guided image inpainting allows users to edit specific regions of an image by providing a mask and a text prompt, which is an interesting erase-and-replace editing operation. When the prompt is set to empty, it can be applied to auto-fill the masked regions to fit the image context (which is similar to the AI fill and extend operations in PhotoShop-beta).
To perform inpainting on an input image, please download one of the following checkpoints and put it in models
folder:
SD Version | Lang. | MindSpore Checkpoint | Ref. Official Model | Resolution |
---|---|---|---|---|
2.0-inpaint | EN | sd_v2_inpaint-f694d5cf.ckpt | stable-diffusion-2-inpainting | 512x512 |
1.5-wukong-inpaint | CN | wukong-huahua-inpaint-ms.ckpt | N.A. | 512x512 |
After preparing the pretrained weight, you can run image inpainting by:
python inpaint.py \
-v {model version}
--image {path to input image} \
--mask {path to mask image} \
--prompt "your magic prompt to paint the masked region"
-v
: model version. Valid values can be referred toSD Version
in the above table.
For more argument usage, please run python inpaint.py --help
Using 2.0-inpaint
as an example, you can download the example image and mask. Then execute
python inpaint.py \
-v `2.0-inpaint`
--image overture-creations-5sI6fQgYIuo.png \
--mask overture-creations-5sI6fQgYIuo_mask.png \
--prompt "Face of a yellow cat, high resolution, sitting on a park bench"
The output images will be saved in output/samples
directory. Here are some generated results.
Text-guided image inpainting. From left to right: input image, mask, generated images.
By setting empty prompt (--prompt=""
), the masked part will be auto-filled to fit the context and background.
Image inpainting. From left to right: input image, mask, generated images
This pipeline allows you to generate new images conditioning on a depth map (preserving image structure) and a text prompt. If you pass an initial image instead of a depth map, the pipeline will automatically extract the depth from it (using Midas depth estimation model) and generate new images conditioning on the image depth, the image, and the text prompt.
SD Version | Lang. | MindSpore Checkpoint | Ref. Official Model | Resolution |
---|---|---|---|---|
2.0 | EN | sd_v2_depth-186e18a0.ckpt | stable-diffusion-2-depth | 512x512 |
And download the depth estimation checkpoint midas_v3_dpt_large-c8fd1049.ckpt to the models/depth_estimator
directory.
After preparing the pretrained weight, you can run depth-to-image by:
# depth to image given a depth map and text prompt
python depth_to_image.py \
--prompt {text prompt} \
--depth_map {path to depth map} \
In case you don't have the depth map, you can input a source image instead, The pipeline will extract the depth map from the source image.
# depth to image conditioning on an input image and text prompt
python depth_to_image.py \
--prompt {text prompt} \
--image {path to initial image} \
--strength 0.7
--strength
indicates how strong the pipeline will transform the initial image. A lower value - preserves more content of the input image. 1 - ignore the initial image and only condition on the depth and text prompt.
The output images will be saved in output/samples
directory.
Example:
Download the two-cat image and save it in the current folder. Then execute
python depth_to_image.py --image 000000039769.jpg --prompt "two tigers" --negative_prompt "bad, deformed, ugly, bad anatomy" \
Here are some generated results.
Text-guided depth-to-image. From left to right: input image, estimated depth map, generated images
The two cats are replaced with two tigers while the background and image structure are mostly preserved in the generated images.
ControlNet is a type of model for controllable image generation. It helps make image diffusion models more controllable by conditioning the model with an additional input image. Stable Diffusion can be augmented with ControlNets to enable conditional inputs like canny edge maps, segmentation maps, keypoints, etc.
For detailed instructions on inference and training with ControlNet, please refer to Stable Diffusion with ControlNet.
T2I-Adapter is a simple and lightweight network that provides extra visual guidance for Stable Diffusion models without re-training them. The adapter act as plug-in to SD models, making it easy to integrate and use.
For detailed instructions on inference and training with T2I-Adapters, please refer to T2I-Adapter.
We provide tools to convert SD 1.x or SD 2.x model weights from torch to MindSpore format. Please refer to this doc
Currently, we support the following diffusion schedulers.
- DDIM
- DPM Solver
- DPM Solver++
- PLMS
- UniPC
Detailed illustrations and comparison of these schedulers can be viewed in Diffusion Process Schedulers.
The default objective function in SD training is to minimize the noise prediction error (noise-prediction). To alter the objective to v-prediction, which is used in SD 2.0-v and SD 2.1-v, please refer to v-prediction.md
We provide different evaluation methods including FID and CLIP-score to evaluate the quality of the generated images. For detailed usage, please refer to Evaluation for Diffusion Models
Coming soon
Coming soon
- 2024.01.10
- Add Textual Inversion fine-tuning
- 2023.12.01
- Add ControlNet v1
- Add unclip image variation pipeline, supporting both inference and training.
- Add image inpainting pipeline
- Add depth-to-image pipeline
- Fix bugs and improve compatibility to support more Ascend chip types
- Refractor documents
- 2023.08.30
- Add T2I-Adapter support for text-guided Image-to-Image translation.
- 2023.08.24
- Add Stable Diffusion v2.1 and v2.1-v (768)
- Support checkpoint auto-download
- 2023.08.17
- Add Stable Diffusion v1.5
- Add DreamBooth fine-tuning
- Add text-guided image inpainting
- Add CLIP score metrics (CLIP-I, CLIP-T) for evaluating visual and textual fidelity
- 2023.07.05
- Add negative prompts
- Improve logger
- Fix bugs for MS 2.0.
- 2023.06.30
- Add LoRA fine-tuning and FID evaluation.
- 2023.06.12
- Add velocity parameterization for DDPM prediction type. Usage: set
parameterization: velocity
in configs/your_train.yaml
- Add velocity parameterization for DDPM prediction type. Usage: set
We appreciate all kinds of contributions, including making issues or pull requests to make our work better.