Skip to content

Latest commit

 

History

History

svd

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Stable Video Diffusion (VideoLDM)

VideoLDM U-Net Block Architecture

Introduction

Stable Video Diffusion is an Image-to-Video generation model based on Stable Diffusion that extends it to a video generation task by introducing temporal layers into the architecture (a.k.a. VideoLDM). Additionally, it utilizes a modified Decoder with added temporal layers to counteract flickering artifacts.

VideoLDM U-Net Block Architecture
An example of a single U-Net Block with Added Temporal Layers (for more information please refer to [2])

Pretrained Models

SD Base Version SVD version Trained for Config Checkpoint
v2.0 & v2.1 SVD 14 frames generation yaml Download (9GB)
SVD-XT 25 frames generation yaml Download (9GB)

The weights above were converted from the PyTorch version. If you want to convert another custom model, you can do so by using svd_tools/convert.py. For example:

python svd_tools/convert.py \
--pt_weights_file PATH_TO_YOUR_TORCH_MODEL \
--config CONFIG_FILE \
--out_dir PATH_TO_OUTPUT_DIR

Inference

Currently, only Image-to-Video generation is supported. For video generation from text, an image must first be created using either SD or SDXL (recommended resolution is 1024x576). Once the image is created, the video can be generated using the following command:

python image_to_video.py --mode=1 \
--SVD.config=configs/svd.yaml \
--SVD.checkpoint=PATH_TO_YOUR_SVD_CHECKPOINT \
--SVD.num_frames=NUM_FRAMES_TO_GENERATE \
--SVD.fps=FPS \
--image=PATH_TO_INPUT_IMAGE

Tip

If you encounter an OOM error while running the above command, try setting the --SVD.decode_chunk_size argument to a lower value (default is num_frames) before reducing num_frames as decoding is very memory-intensive.

For more information on possible parameters and usage, please execute the following command:

python image_to_video.py --help

Training

Dataset Preparation

Video labels should be stored in a CSV file in the following format:

path,length,motion_bucket_id
path_to_video1,video_length1,motion_bucket_id1
path_to_video2,video_length2,motion_bucket_id2
...

The generation of motion bucket IDs is described in detail in the SVD [1] paper. Please refer to Appendix C of the paper for more information.

Training

Currently, only Image-to-Video generation training is supported. To train Stable Video Diffusion, execute the following command:

python train.py --config=configs/svd_train.yaml \
--svd_config=configs/svd.yaml \
--train.pretrained=PATH_TO_YOUR_SVD_CHECKPOINT \
--train.output_dir=PATH_TO_OUTPUT_DIR \
--environment.mode=0 \
--train.temporal_only=True \
--train.epochs=NUM_EPOCHS \
--train.dataset.init_args.frames=NUM_FRAMES \
--train.dataset.init_args.step=FRAMES_FETCHING_STEP \
--train.dataset.init_args.data_dir=PATH_TO_DATASET \
--train.dataset.init_args.metadata=PATH_TO_LABELS

Note

More details on the training arguments can be found in the training config and model config.

Important

For 910*, please set export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" before running training.

Acknowledgements

  1. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. Stability AI, 2023.
  2. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.