Skip to content

sayanshaw24/videogen

 
 

Repository files navigation

MakeLongVideo - Pytorch

Implementation of long video generation based on diffusion model.

"Ironman is surfing" "a car is racing" "a cat eating food of a bowl, in von Gogh style" "a giraffe underneath the microwave"
"a glass bead falling into water with huge splash" "a video of Earth rotating in space" "A teddy bear running in New York City" "A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset"

Setup

Requirements

python3 -m pip install -r requirements.txt

Training

Prepare Stable Diffusion v1-4 pretrained weights

download from huggingface and put it in directory 'checkpoints' which is configured in configs/makelongvideo.yaml

Download webvid dataset

download webvid dataset into directory 'data/webvid' using https://github.com/m-bain/webvid repo. Then prepare dataset using command

python3 genvideocap.py

Download LAION400M dataset

download laion400m into directory 'data/laion400m'

Train

first train using resolution 128x128

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml

then finetune in resolution 256x256, modify last line of configs/makelongvideo256x256.yaml according to your local epoch checkpoint

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo256x256.yaml

Inference

Pretrained weights: https://huggingface.co/xiexiecn/MakeLongVideo

# unwrap checkpoint first
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train.py --config configs/makelongvideo.yaml --unwrap ./outputs/makelongvideo/checkpoint-5200

inference directly

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing"

inference using latents initialized by sample video

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --sample_video_path your_sample_video

inference by sample frame rate 6 (actual frame rate is 24/6==4)

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --speed 6

Todo

  • generate 24 frames video of 256x256
  • add fps control
  • release pretrained checkpoint
  • remove watermark
  • improve resolution to 512x512
  • 1~2minutes video generation
  • make story video

References

Citations

@misc{Singer2022,
    author  = {Uriel Singer},
    url     = {https://makeavideo.studio/Make-A-Video.pdf}
}
@article{wu2022tuneavideo,
    title   = {Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author  = {Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year    = {2022},
    note    = {under review}
}

About

Implementation of video generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%