Skip to content

Latest commit

 

History

History
452 lines (347 loc) · 19.7 KB

README.md

File metadata and controls

452 lines (347 loc) · 19.7 KB

T2I-Adapter

Introduction

T2I-Adapters are simple and lightweight networks that provide additional visual guidance to Stable Diffusion models, in addition to the built-in text guidance, to leverage implicitly learned capabilities. These adapters act as plug-ins to SD models, making them easy to integrate and use. The overall architecture of T2I-Adapters is as follows:

T2I-Adapter Architecture
Overall T2I-Adapter architecture

There are multiple advantages of this architecture:

  • T2I-Adapters do not affect the weights of Stable Diffusion models. Moreover, training T2I-Adapters does not require training of an SD model itself.
  • Simple and lightweight: 77M parameters for full and 5M parameters for light adapters.
  • Composable: Several adapters can be combined to achieve multi-condition control.
  • Generalizable: Can be directly used on custom models as long as they are fine-tuned from the same model (e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or Anything anime model).

Pretrained Models

SD Compatibility Task SD Train Version Dataset Recipe Weights
SDXL Canny SDXL 1.0 LAION-Aesthetics V2 (3M) Download
Depth (MiDaS) SDXL 1.0 LAION-Aesthetics V2 (3M) Download
LineArt SDXL 1.0 LAION-Aesthetics V2 (3M) Download
OpenPose SDXL 1.0 LAION-Aesthetics V2 (3M) Download
Sketch SDXL 1.0 LAION-Aesthetics V2 (3M) Download
2.x Segmentation 2.1 COCO-Stuff yaml Download
1.x Canny 1.5 COCO-Stuff Download
Color 1.4 LAION-Aesthetics V2 (625K) Download
Depth (MiDaS) 1.5 LAION-Aesthetics V2 (625K) Download
KeyPose 1.4 LAION-Aesthetics V2 (625K) Download
OpenPose 1.4 LAION-Aesthetics V2 (625K) Download
Segmentation 1.4 COCO-Stuff Download
Sketch 1.5 COCO-Stuff Download
Style 1.4 Download

Notes:

  • As mentioned in the Introduction, T2I-Adapters generalize well and thus can be used with custom models (as long as they are fine-tuned from the same model), e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or Anything anime model.
  • ⚠️ T2I-Adapters trained on SD 1.x are not compatible with SD 2.x due to difference in the architecture.

The weights above were converted from PyTorch version. If you want to convert another custom model, you can do so by using t2i_tools/convert.py. For example:

python t2i_tools/convert.py --diffusion_model SDXL \
--pt_weights_file PATH_TO_YOUR_TORCH_MODEL \
--task CONDITION \
--out_dir PATH_TO_OUTPUT_DIR

Inference and Examples

For detailed information on possible parameters and usage, please execute the following command:

python adapter_image2image_sd.py --help # for SD
python adapter_image2image_sdxl.py --help # for SDXL

Additionally, you can find some sample use cases for SD and SDXL below. The condition images used in the examples can be found here and here

Individual Adapters

Canny Adapter

SD

SD Canny input SD Canny output
Prompt: Cute toy, best quality, extremely detailed

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Cute toy, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_canny_sd15v2-c484cd69.ckpt \
--ddim \
--adapter_condition canny \
--condition_image samples/canny/toy_canny.png
SDXL

SDXL Canny input SDXL Canny output
Prompt: Mystical fairy in real, magic, 4k picture, high quality

Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=canny \
--adapter.ckpt_path=models/adapter_xl_canny-aecfc7d6.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/canny/figs_SDXLV1.0_cond_canny.png \
--prompt="Mystical fairy in real, magic, 4k picture, high quality" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

LineArt Adapter

SDXL

SDXL LineArt input SDXL LineArt output
Prompt: Ice dragon roar, 4k photo

Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=lineart \
--adapter.ckpt_path=models/adapter_xl_lineart-6110edd0.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/lineart/figs_SDXLV1.0_cond_lin.png \
--prompt="Ice dragon roar, 4k photo" \
--negative_prompt="anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

Spatial Palette (Color) Adapter (SD only)

Color input Color output
Prompt: A photo of scenery

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A photo of scenery" \
--adapter_ckpt_path models/t2iadapter_color_sd14v1-7cb31ebd.ckpt \
--ddim \
--adapter_condition color \
--condition_image samples/color/color_0002.png \
--scale 9

Depth Adapter

SD

Depth input Depth output
Prompt: desk, best quality, extremely detailed

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "desk, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_depth_sd15v2-dc86209b.ckpt \
--ddim \
--adapter_condition depth \
--condition_image samples/depth/desk_depth.png

OpenPose Adapter

SD

OpenPose input OpenPose output
Prompt: Iron man, high-quality, high-res

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Iron man, high-quality, high-res" \
--adapter_ckpt_path models/t2iadapter_openpose_sd14v1-ebcdb5cb.ckpt \
--ddim \
--adapter_condition openpose \
--condition_image samples/openpose/iron_man_pose.png

Segmentation Adapter

SD

Segmentation input Segmentation output SDv1.5 Segmentation output SDv2.1
Prompt: A black Honda motorcycle parked in front of a garage, best quality, extremely detailed
SD1.5 output on the left and SD2.1 output on the right.

Execution command
# StableDiffusion v2.1
python adapter_image2image_sd.py \
--version 2.1 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--ckpt_path=models/sd_v2-1_base-7c8d09ce.ckpt \
--adapter_ckpt_path models/t2iadapter_seg_sd21-86d4e0db.ckptt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png
# StableDiffusion v1.5
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_seg_sd14v1-1d2e8478.ckpt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png

Sketch Adapter

SD

SD Sketch input SD Sketch output
Prompt: A car with flying wings

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2-6c537e26.ckpt \
--ddim \
--adapter_condition sketch \
--condition_image samples/sketch/car.png \
--cond_tau 0.5
SDXL

SDXL Sketch input SDXL Sketch output
Prompt: a robot, mount fuji in the background, 4k photo, highly detailed

Execution command
python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=sketch \
--adapter.ckpt_path=models/adapter_xl_sketch-98dbd348.ckpt \
--adapter.cond_weight=0.9 \
--adapter.image=samples/sketch/figs_SDXLV1.0_cond_sketch.png \
--prompt="a robot, mount fuji in the background, 4k photo, highly detailed" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

Combined Adapters

Individual T2I-Adapters can also be combined without retraining to condition on multiple images.

Color + Sketch

SD

Sketch input
Prompt: A car with flying wings

Execution command
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2.ckpt models/t2iadapter_color_sd14v1.ckpt \
--adapter_condition sketch color \
--condition_image samples/sketch/car.png samples/color/color_0004.png \
--cond_weight 1.0 1.2 \
--ddim

Training

The following table summarizes T2I-Adapters training details:

Task SD Version Dataset Context Train Time Throughput Recipe
Segmentation 2.1 COCO-Stuff Train D910Ax4-MS2.1-G 10h 35m / epoch 39.2 img / s yaml

Context: Training context denoted as {device}x{pieces}-{MS version}{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.

Data preparation

Conditional images must be in RGB format. Therefore, some datasets may require preprocessing before training.

Segmentation (COCO-Stuff)

Segmentation masks are usually grayscale images with values corresponding to class labels. However, to train T2I-Adapters, masks must be converted to RGB images. To do so, first download grayscale masks (stuffthingmaps_trainval2017.zip) and unpack them. Then execute the following command to convert mask to RGB images:

python t2i_tools/cocostuff_colorize_mask.py PATH_TO_GRAY_MASKS_DIR PATH_TO_OUTPUT_DIR

Annotation labels for COCO-Stuff can be found in annotations_trainval2017.zip/annotations/.

Segmentation

After the data preparation is completed, the following command can be used to train T2I-Adapter:

mpirun --allow-run-as-root -n 4 python train_t2i_adapter_sd.py \
--config configs/sd_v2.1_train.yaml \
--train.dataset.init_args.image_dir PATH_TO_IMAGES_DIR \
--train.dataset.init_args.masks_path PATH_TO_RGB_MASKS_DIR \
--train.dataset.init_args.label_path PATH_TO_LABELS \
--train.output_dir: PATH_TO_OUTPUT_DIR

Evaluation

T2I-Adapters are evaluated on COCO-Stuff Validation dataset (see Data Preparation for more details) by using the first prompt per image only. The following table summarizes the performance of T2I-Adapters:

Task SD Version Dataset FID ↓ CLIP Score ↑ Recipe
Segmentation 2.1 COCO-Stuff Val 26.10 26.32 yaml

To evaluate T2I-Adapters yourself, first you will need to generate images with adapter_image2image.py (see Inference and Examples for more details). Then, to calculate FID, run the following command:

python examples/stable_diffusion_v2/tools/eval/eval_fid.py \
--backend=ms \
--real_dir=PATH_TO_VALIDATION_IMAGES \
--gen_dir=PATH_TO_GENERATED_IMAGES \
--batch_size=50

CLIP score is calculated by using the clip_vit_l_14 model (more information and weights can be found here). To calculate the score, run the following command:

python examples/stable_diffusion_v2/tools/eval/eval_clip_score.py \
--backend=ms \
--config=examples/stable_diffusion_v2/tools/_common/clip/configs/clip_vit_l_14.yaml \
--ckpt_path=PATH/TO/clip_vit_l_14.ckpt \
--tokenizer_path=examples/stable_diffusion_v2/ldm/models/clip/bpe_simple_vocab_16e6.txt.gz \
--image_path_or_dir=PATH_TO_GENERATED_IMAGES \
--prompt_or_path=PATH_TO_PROMPTS \
--save_result=False \
--quiet

Acknowledgements

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv:2302.08453, 2023.