T2I-Adapters are simple and lightweight networks that provide additional visual guidance to Stable Diffusion models, in addition to the built-in text guidance, to leverage implicitly learned capabilities. These adapters act as plug-ins to SD models, making them easy to integrate and use. The overall architecture of T2I-Adapters is as follows:
Overall T2I-Adapter architecture
There are multiple advantages of this architecture:
- T2I-Adapters do not affect the weights of Stable Diffusion models. Moreover, training T2I-Adapters does not require training of an SD model itself.
- Simple and lightweight: 77M parameters for full and 5M parameters for light adapters.
- Composable: Several adapters can be combined to achieve multi-condition control.
- Generalizable: Can be directly used on custom models as long as they are fine-tuned from the same model (e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or Anything anime model).
SD Compatibility | Task | SD Train Version | Dataset | Recipe | Weights |
---|---|---|---|---|---|
SDXL | Canny | SDXL 1.0 | LAION-Aesthetics V2 (3M) | Download | |
Depth (MiDaS) | SDXL 1.0 | LAION-Aesthetics V2 (3M) | Download | ||
LineArt | SDXL 1.0 | LAION-Aesthetics V2 (3M) | Download | ||
OpenPose | SDXL 1.0 | LAION-Aesthetics V2 (3M) | Download | ||
Sketch | SDXL 1.0 | LAION-Aesthetics V2 (3M) | Download | ||
2.x | Segmentation | 2.1 | COCO-Stuff | yaml | Download |
1.x | Canny | 1.5 | COCO-Stuff | Download | |
Color | 1.4 | LAION-Aesthetics V2 (625K) | Download | ||
Depth (MiDaS) | 1.5 | LAION-Aesthetics V2 (625K) | Download | ||
KeyPose | 1.4 | LAION-Aesthetics V2 (625K) | Download | ||
OpenPose | 1.4 | LAION-Aesthetics V2 (625K) | Download | ||
Segmentation | 1.4 | COCO-Stuff | Download | ||
Sketch | 1.5 | COCO-Stuff | Download | ||
Style | 1.4 | Download |
Notes:
- As mentioned in the Introduction, T2I-Adapters generalize well and thus can be used with custom
models (as long as they are fine-tuned from the same model), e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or
Anything anime model.
⚠️ T2I-Adapters trained on SD 1.x are not compatible with SD 2.x due to difference in the architecture.
The weights above were converted from PyTorch version. If you want to convert another custom model, you can do so by
using t2i_tools/convert.py
. For example:
python t2i_tools/convert.py --diffusion_model SDXL \
--pt_weights_file PATH_TO_YOUR_TORCH_MODEL \
--task CONDITION \
--out_dir PATH_TO_OUTPUT_DIR
For detailed information on possible parameters and usage, please execute the following command:
python adapter_image2image_sd.py --help # for SD
python adapter_image2image_sdxl.py --help # for SDXL
Additionally, you can find some sample use cases for SD and SDXL below. The condition images used in the examples can be found here and here
Prompt: Cute toy, best quality, extremely detailed
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Cute toy, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_canny_sd15v2-c484cd69.ckpt \
--ddim \
--adapter_condition canny \
--condition_image samples/canny/toy_canny.png
Prompt: Mystical fairy in real, magic, 4k picture, high quality
Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=canny \
--adapter.ckpt_path=models/adapter_xl_canny-aecfc7d6.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/canny/figs_SDXLV1.0_cond_canny.png \
--prompt="Mystical fairy in real, magic, 4k picture, high quality" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4
Prompt: Ice dragon roar, 4k photo
Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=lineart \
--adapter.ckpt_path=models/adapter_xl_lineart-6110edd0.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/lineart/figs_SDXLV1.0_cond_lin.png \
--prompt="Ice dragon roar, 4k photo" \
--negative_prompt="anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A photo of scenery" \
--adapter_ckpt_path models/t2iadapter_color_sd14v1-7cb31ebd.ckpt \
--ddim \
--adapter_condition color \
--condition_image samples/color/color_0002.png \
--scale 9
Prompt: desk, best quality, extremely detailed
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "desk, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_depth_sd15v2-dc86209b.ckpt \
--ddim \
--adapter_condition depth \
--condition_image samples/depth/desk_depth.png
Prompt: Iron man, high-quality, high-res
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Iron man, high-quality, high-res" \
--adapter_ckpt_path models/t2iadapter_openpose_sd14v1-ebcdb5cb.ckpt \
--ddim \
--adapter_condition openpose \
--condition_image samples/openpose/iron_man_pose.png
Prompt: A black Honda motorcycle parked in front of a garage, best quality, extremely detailed
SD1.5 output on the left and SD2.1 output on the right.
Execution command
# StableDiffusion v2.1
python adapter_image2image_sd.py \
--version 2.1 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--ckpt_path=models/sd_v2-1_base-7c8d09ce.ckpt \
--adapter_ckpt_path models/t2iadapter_seg_sd21-86d4e0db.ckptt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png
# StableDiffusion v1.5
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_seg_sd14v1-1d2e8478.ckpt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png
Prompt: A car with flying wings
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2-6c537e26.ckpt \
--ddim \
--adapter_condition sketch \
--condition_image samples/sketch/car.png \
--cond_tau 0.5
Prompt: a robot, mount fuji in the background, 4k photo, highly detailed
Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=sketch \
--adapter.ckpt_path=models/adapter_xl_sketch-98dbd348.ckpt \
--adapter.cond_weight=0.9 \
--adapter.image=samples/sketch/figs_SDXLV1.0_cond_sketch.png \
--prompt="a robot, mount fuji in the background, 4k photo, highly detailed" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4
Individual T2I-Adapters can also be combined without retraining to condition on multiple images.
Prompt: A car with flying wings
Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2.ckpt models/t2iadapter_color_sd14v1.ckpt \
--adapter_condition sketch color \
--condition_image samples/sketch/car.png samples/color/color_0004.png \
--cond_weight 1.0 1.2 \
--ddim
The following table summarizes T2I-Adapters training details:
Task | SD Version | Dataset | Context | Train Time | Throughput | Recipe |
---|---|---|---|---|---|---|
Segmentation | 2.1 | COCO-Stuff Train | D910Ax4-MS2.1-G | 10h 35m / epoch | 39.2 img / s | yaml |
Context: Training context denoted as {device}x{pieces}-{MS version}{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
Conditional images must be in RGB format. Therefore, some datasets may require preprocessing before training.
Segmentation masks are usually grayscale images with values corresponding to class labels. However, to train T2I-Adapters, masks must be converted to RGB images. To do so, first download grayscale masks (stuffthingmaps_trainval2017.zip) and unpack them. Then execute the following command to convert mask to RGB images:
python t2i_tools/cocostuff_colorize_mask.py PATH_TO_GRAY_MASKS_DIR PATH_TO_OUTPUT_DIR
Annotation labels for COCO-Stuff can be found in annotations_trainval2017.zip/annotations/.
After the data preparation is completed, the following command can be used to train T2I-Adapter:
mpirun --allow-run-as-root -n 4 python train_t2i_adapter_sd.py \
--config configs/sd_v2.1_train.yaml \
--train.dataset.init_args.image_dir PATH_TO_IMAGES_DIR \
--train.dataset.init_args.masks_path PATH_TO_RGB_MASKS_DIR \
--train.dataset.init_args.label_path PATH_TO_LABELS \
--train.output_dir: PATH_TO_OUTPUT_DIR
T2I-Adapters are evaluated on COCO-Stuff Validation dataset (see Data Preparation for more details) by using the first prompt per image only. The following table summarizes the performance of T2I-Adapters:
Task | SD Version | Dataset | FID ↓ | CLIP Score ↑ | Recipe |
---|---|---|---|---|---|
Segmentation | 2.1 | COCO-Stuff Val | 26.10 | 26.32 | yaml |
To evaluate T2I-Adapters yourself, first you will need to generate images with adapter_image2image.py
(see
Inference and Examples for more details). Then, to calculate FID, run the following command:
python examples/stable_diffusion_v2/tools/eval/eval_fid.py \
--backend=ms \
--real_dir=PATH_TO_VALIDATION_IMAGES \
--gen_dir=PATH_TO_GENERATED_IMAGES \
--batch_size=50
CLIP score is calculated by using the clip_vit_l_14
model (more information and weights can be found
here). To calculate the score, run the following command:
python examples/stable_diffusion_v2/tools/eval/eval_clip_score.py \
--backend=ms \
--config=examples/stable_diffusion_v2/tools/_common/clip/configs/clip_vit_l_14.yaml \
--ckpt_path=PATH/TO/clip_vit_l_14.ckpt \
--tokenizer_path=examples/stable_diffusion_v2/ldm/models/clip/bpe_simple_vocab_16e6.txt.gz \
--image_path_or_dir=PATH_TO_GENERATED_IMAGES \
--prompt_or_path=PATH_TO_PROMPTS \
--save_result=False \
--quiet
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv:2302.08453, 2023.