Skip to content

Latest commit

 

History

History
73 lines (59 loc) · 2.74 KB

README.md

File metadata and controls

73 lines (59 loc) · 2.74 KB

VAR-CLIP:
Text-to-Image Generator with Visual Auto-Regressive Modeling

arXiv 

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
Qian Zhang, Xiangzi Dai, Ninghua Yang, Xiang An, Ziyong Feng, Xingyu Ren
Institute of Applied Physics and Computational Mathematics, DeepGlint,Shanghai Jiao Tong University

Some example for text-conditional generation:

.

Some example for class-conditional generation:

.

TODO

  • Relased Pre_trained model on ImageNet.
  • Relased train code.
  • Relased Arxiv.
  • Training T2I on the ImageNet dataset has been completed.
  • Training on the ImageNet dataset has been completed.

Getting Started

Requirements

pip install -r requirements.txt

Download Pretrain model/Dataset

1. Place the downloaded ImageNet train/val parts separately under *train/val* in the directory **./imagenet/** .
2. Download **Clip and Vae** pretrain model put on **pretrained/**.
3. Download **VAR_CLIP_d16** pretrain model put on **local_output/**.

Download ClIP_L14
Download VAE
Download VAR_CLIP Model Weight

Training Scripts

# training VAR-CLIP-d16 for 1000 epochs on ImageNet 256x256 costs 4.1 days on 64 A100s
# Before running, you need to configure the IP addresses of multiple machines in the run.py file and data_path
python run.py

Demo Scripts

# you can run demo_samle.py get text-conditional generation resulets after train completed.
python demo_sample.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citations

@misc{zhang2024varclip,
      title={VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling}, 
      author={Qian Zhang and Xiangzi Dai and Ninghua Yang and Xiang An and Ziyong Feng and Xingyu Ren},
      year={2024},
      journal={arXiv:2408.01181},
}