T2I-Adapter

Introduction
Pretrained Models
Inference and Examples
- Individual Adapters
- Combined Adapters
Training
Acknowledgements

Introduction

T2I-Adapters are simple and lightweight networks that provide additional visual guidance to Stable Diffusion models, in addition to the built-in text guidance, to leverage implicitly learned capabilities. These adapters act as plug-ins to SD models, making them easy to integrate and use. The overall architecture of T2I-Adapters is as follows:

Overall T2I-Adapter architecture

There are multiple advantages of this architecture:

T2I-Adapters do not affect the weights of Stable Diffusion models. Moreover, training T2I-Adapters does not require training of an SD model itself.
Simple and lightweight: 77M parameters for full and 5M parameters for light adapters.
Composable: Several adapters can be combined to achieve multi-condition control.
Generalizable: Can be directly used on custom models as long as they are fine-tuned from the same model (e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or Anything anime model).

Pretrained Models

SD Compatibility	Task	SD Train Version	Dataset	Recipe	Weights
SDXL	Canny	SDXL 1.0	LAION-Aesthetics V2 (3M)		Download
	Depth (MiDaS)	SDXL 1.0	LAION-Aesthetics V2 (3M)		Download
	LineArt	SDXL 1.0	LAION-Aesthetics V2 (3M)		Download
	OpenPose	SDXL 1.0	LAION-Aesthetics V2 (3M)		Download
	Sketch	SDXL 1.0	LAION-Aesthetics V2 (3M)		Download

2.x	Segmentation	2.1	COCO-Stuff	yaml	Download

1.x	Canny	1.5	COCO-Stuff		Download
	Color	1.4	LAION-Aesthetics V2 (625K)		Download
	Depth (MiDaS)	1.5	LAION-Aesthetics V2 (625K)		Download
	KeyPose	1.4	LAION-Aesthetics V2 (625K)		Download
	OpenPose	1.4	LAION-Aesthetics V2 (625K)		Download
	Segmentation	1.4	COCO-Stuff		Download
	Sketch	1.5	COCO-Stuff		Download
	Style	1.4			Download

Notes:

As mentioned in the Introduction, T2I-Adapters generalize well and thus can be used with custom models (as long as they are fine-tuned from the same model), e.g., use T2I-Adapters trained on SD 1.4 with SD 1.5 or Anything anime model.
⚠️ T2I-Adapters trained on SD 1.x are not compatible with SD 2.x due to difference in the architecture.

The weights above were converted from PyTorch version. If you want to convert another custom model, you can do so by using t2i_tools/convert.py. For example:

python t2i_tools/convert.py --diffusion_model SDXL \
--pt_weights_file PATH_TO_YOUR_TORCH_MODEL \
--task CONDITION \
--out_dir PATH_TO_OUTPUT_DIR

Inference and Examples

For detailed information on possible parameters and usage, please execute the following command:

python adapter_image2image_sd.py --help # for SD
python adapter_image2image_sdxl.py --help # for SDXL

Additionally, you can find some sample use cases for SD and SDXL below. The condition images used in the examples can be found here and here

Individual Adapters

Canny Adapter

SD

Prompt: Cute toy, best quality, extremely detailed

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Cute toy, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_canny_sd15v2-c484cd69.ckpt \
--ddim \
--adapter_condition canny \
--condition_image samples/canny/toy_canny.png

SDXL

Prompt: Mystical fairy in real, magic, 4k picture, high quality

Execution command

python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=canny \
--adapter.ckpt_path=models/adapter_xl_canny-aecfc7d6.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/canny/figs_SDXLV1.0_cond_canny.png \
--prompt="Mystical fairy in real, magic, 4k picture, high quality" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

LineArt Adapter

SDXL

Prompt: Ice dragon roar, 4k photo

Execution command

python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=lineart \
--adapter.ckpt_path=models/adapter_xl_lineart-6110edd0.ckpt \
--adapter.cond_weight=0.8 \
--adapter.image=samples/lineart/figs_SDXLV1.0_cond_lin.png \
--prompt="Ice dragon roar, 4k photo" \
--negative_prompt="anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

Spatial Palette (Color) Adapter (SD only)

Prompt: A photo of scenery

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A photo of scenery" \
--adapter_ckpt_path models/t2iadapter_color_sd14v1-7cb31ebd.ckpt \
--ddim \
--adapter_condition color \
--condition_image samples/color/color_0002.png \
--scale 9

Depth Adapter

SD

Prompt: desk, best quality, extremely detailed

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "desk, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_depth_sd15v2-dc86209b.ckpt \
--ddim \
--adapter_condition depth \
--condition_image samples/depth/desk_depth.png

OpenPose Adapter

SD

Prompt: Iron man, high-quality, high-res

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "Iron man, high-quality, high-res" \
--adapter_ckpt_path models/t2iadapter_openpose_sd14v1-ebcdb5cb.ckpt \
--ddim \
--adapter_condition openpose \
--condition_image samples/openpose/iron_man_pose.png

Segmentation Adapter

SD

Prompt: A black Honda motorcycle parked in front of a garage, best quality, extremely detailed
SD1.5 output on the left and SD2.1 output on the right.

Execution command

# StableDiffusion v2.1
python adapter_image2image_sd.py \
--version 2.1 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--ckpt_path=models/sd_v2-1_base-7c8d09ce.ckpt \
--adapter_ckpt_path models/t2iadapter_seg_sd21-86d4e0db.ckptt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png

# StableDiffusion v1.5
python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A black Honda motorcycle parked in front of a garage, best quality, extremely detailed" \
--adapter_ckpt_path models/t2iadapter_seg_sd14v1-1d2e8478.ckpt \
--ddim \
--adapter_condition seg \
--condition_image samples/seg/motor.png

Sketch Adapter

SD

Prompt: A car with flying wings

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2-6c537e26.ckpt \
--ddim \
--adapter_condition sketch \
--condition_image samples/sketch/car.png \
--cond_tau 0.5

SDXL

Prompt: a robot, mount fuji in the background, 4k photo, highly detailed

Execution command

python adapter_image2image_sdxl.py \
--config=configs/sdxl_inference.yaml \
--SDXL.checkpoints=models/sd_xl_base_1.0_ms.ckpt \
--adapter.condition=sketch \
--adapter.ckpt_path=models/adapter_xl_sketch-98dbd348.ckpt \
--adapter.cond_weight=0.9 \
--adapter.image=samples/sketch/figs_SDXLV1.0_cond_sketch.png \
--prompt="a robot, mount fuji in the background, 4k photo, highly detailed" \
--negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" \
--n_samples=4

Combined Adapters

Individual T2I-Adapters can also be combined without retraining to condition on multiple images.

Color + Sketch

SD

Prompt: A car with flying wings

Execution command

python adapter_image2image_sd.py \
--version 1.5 \
--prompt "A car with flying wings" \
--adapter_ckpt_path models/t2iadapter_sketch_sd15v2.ckpt models/t2iadapter_color_sd14v1.ckpt \
--adapter_condition sketch color \
--condition_image samples/sketch/car.png samples/color/color_0004.png \
--cond_weight 1.0 1.2 \
--ddim

Training

The following table summarizes T2I-Adapters training details:

Task	SD Version	Dataset	Context	Train Time	Throughput	Recipe
Segmentation	2.1	COCO-Stuff Train	D910Ax4-MS2.1-G	10h 35m / epoch	39.2 img / s	yaml

Context: Training context denoted as {device}x{pieces}-{MS version}{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.

Data preparation

Conditional images must be in RGB format. Therefore, some datasets may require preprocessing before training.

Segmentation (COCO-Stuff)

Segmentation masks are usually grayscale images with values corresponding to class labels. However, to train T2I-Adapters, masks must be converted to RGB images. To do so, first download grayscale masks (stuffthingmaps_trainval2017.zip) and unpack them. Then execute the following command to convert mask to RGB images:

python t2i_tools/cocostuff_colorize_mask.py PATH_TO_GRAY_MASKS_DIR PATH_TO_OUTPUT_DIR

Annotation labels for COCO-Stuff can be found in annotations_trainval2017.zip/annotations/.

Segmentation

After the data preparation is completed, the following command can be used to train T2I-Adapter:

mpirun --allow-run-as-root -n 4 python train_t2i_adapter_sd.py \
--config configs/sd_v2.1_train.yaml \
--train.dataset.init_args.image_dir PATH_TO_IMAGES_DIR \
--train.dataset.init_args.masks_path PATH_TO_RGB_MASKS_DIR \
--train.dataset.init_args.label_path PATH_TO_LABELS \
--train.output_dir: PATH_TO_OUTPUT_DIR

Evaluation

T2I-Adapters are evaluated on COCO-Stuff Validation dataset (see Data Preparation for more details) by using the first prompt per image only. The following table summarizes the performance of T2I-Adapters:

Task	SD Version	Dataset	FID ↓	CLIP Score ↑	Recipe
Segmentation	2.1	COCO-Stuff Val	26.10	26.32	yaml

To evaluate T2I-Adapters yourself, first you will need to generate images with adapter_image2image.py (see Inference and Examples for more details). Then, to calculate FID, run the following command:

python examples/stable_diffusion_v2/tools/eval/eval_fid.py \
--backend=ms \
--real_dir=PATH_TO_VALIDATION_IMAGES \
--gen_dir=PATH_TO_GENERATED_IMAGES \
--batch_size=50

CLIP score is calculated by using the clip_vit_l_14 model (more information and weights can be found here). To calculate the score, run the following command:

python examples/stable_diffusion_v2/tools/eval/eval_clip_score.py \
--backend=ms \
--config=examples/stable_diffusion_v2/tools/_common/clip/configs/clip_vit_l_14.yaml \
--ckpt_path=PATH/TO/clip_vit_l_14.ckpt \
--tokenizer_path=examples/stable_diffusion_v2/ldm/models/clip/bpe_simple_vocab_16e6.txt.gz \
--image_path_or_dir=PATH_TO_GENERATED_IMAGES \
--prompt_or_path=PATH_TO_PROMPTS \
--save_result=False \
--quiet

Acknowledgements

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv:2302.08453, 2023.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

T2I-Adapter

Introduction

Pretrained Models

Inference and Examples

Individual Adapters

Canny Adapter

SD

SDXL

LineArt Adapter

SDXL

Spatial Palette (Color) Adapter (SD only)

Depth Adapter

SD

OpenPose Adapter

SD

Segmentation Adapter

SD

Sketch Adapter

SD

SDXL

Combined Adapters

Color + Sketch

SD

Training

Data preparation

Segmentation (COCO-Stuff)

Segmentation

Evaluation

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

T2I-Adapter

Introduction

Pretrained Models

Inference and Examples

Individual Adapters

Canny Adapter

SD

SDXL

LineArt Adapter

SDXL

Spatial Palette (Color) Adapter (SD only)

Depth Adapter

SD

OpenPose Adapter

SD

Segmentation Adapter

SD

Sketch Adapter

SD

SDXL

Combined Adapters

Color + Sketch

SD

Training

Data preparation

Segmentation (COCO-Stuff)

Segmentation

Evaluation

Acknowledgements