|
| 1 | +# 🦜 Parrot: Multilingual Visual Instruction Tuning |
| 2 | + |
| 3 | +<p align="center"> |
| 4 | + <a href="#-introduction">🎉Introduction</a> • |
| 5 | + <a href="#-whats-new">📰What's New</a> • |
| 6 | + <a href="#%EF%B8%8F-install">☄️Install</a> • |
| 7 | + <a href="#-model">🦜Model</a> • |
| 8 | + <a href="#-train">🔥Train</a> • |
| 9 | + <a href="#-datasets">🌟Datasets</a> • |
| 10 | + <a href="#-mmmb">🎄MMMB</a> <br /> |
| 11 | + <a href="#-evaluation">🔑Evaluation</a> • |
| 12 | + <a href="#-quick-start">📍Quick Start</a> • |
| 13 | + <a href="#-acknowledgement">👨🏫Acknowledgement</a> • |
| 14 | + <a href="#-contact">🤗Contact</a> |
| 15 | +</p> |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +<p align="center"> |
| 20 | + <a href=""><img src="https://img.shields.io/badge/Parrot-v1.0-darkcyan"></a> |
| 21 | + <a href='https:/sun-hailong.github.io/projects/Parrot'><img src='https://img.shields.io/Project-Page-Green'></a> |
| 22 | + <a href='https://arxiv.org/abs/2406.02539'><img src='https://img.shields.io/badge/Arxiv-2406.02539-b31b1b.svg?logo=arXiv'></a> |
| 23 | + <a href=""><img src="https://img.shields.io/github/stars/AIDC-AI/Parrot?color=4fb5ee"></a> |
| 24 | + <a href=""><img src="https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FAIDC-AI%2FParrot&count_bg=%23FFA500&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=visitors&edge_flat=false"></a> |
| 25 | +</p> |
| 26 | + |
| 27 | +> Thanks to [Hai-Long Sun](https://github.com/sun-hailong) for his contribution in Parrot! |
| 28 | +
|
| 29 | +## 🎉 Introduction |
| 30 | +Welcome to Parrot [[paper](https://arxiv.org/abs/2406.02539)], a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. |
| 31 | + |
| 32 | +**If you find Parrot useful for your research and applications, please cite using this BibTeX:** |
| 33 | +```bibtex |
| 34 | +@article{sun2024parrot, |
| 35 | + title={Parrot: Multilingual Visual Instruction Tuning}, |
| 36 | + author={Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and others}, |
| 37 | + journal={arXiv preprint arXiv:2406.02539}, |
| 38 | + year={2024} |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +## 📰 What's New |
| 43 | +- [08/02] 🔥 We release the [code](https://github.com/AIDC-AI/Parrot), inhouse multilingual [dataset](https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v), benchmark [MMMB](https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/mmmb), and [model](https://huggingface.co/AIDC-AI/Parrot-7B), welcome to have a try! |
| 44 | +- [06/05] 🔥 Parrot is coming! We release the [paper](https://arxiv.org/abs/2406.02539)! |
| 45 | + |
| 46 | + |
| 47 | +## ☄️ Install |
| 48 | + |
| 49 | +Please follow the instructions below to install the required packages. |
| 50 | + |
| 51 | +1. Clone this repository and navigate to Parrot folder |
| 52 | +```bash |
| 53 | +git clone https://github.com/AIDC-AI/Parrot.git |
| 54 | +cd Parrot |
| 55 | +``` |
| 56 | + |
| 57 | +2. Install Package |
| 58 | +```Shell |
| 59 | +conda create -n parrot python=3.10 -y |
| 60 | +conda activate parrot |
| 61 | +pip install --upgrade pip |
| 62 | +pip install -e . |
| 63 | +``` |
| 64 | + |
| 65 | +### Upgrade to latest code base |
| 66 | + |
| 67 | +```Shell |
| 68 | +git pull |
| 69 | +pip install -e . --no-deps |
| 70 | +``` |
| 71 | + |
| 72 | +## 🦜 Model |
| 73 | +Parrot is a multilingual multimodal large language model. We provide our fully finetuned models below: |
| 74 | + |
| 75 | +| Model | Base LLM | Vision Encoder | Stage | Download | |
| 76 | +| --- | --- | :---: | :---: | :---: | |
| 77 | +| Parrot-7B | Qwen-1.5-7B-Chat | CLIP-ViT-Large-patch14-336 | Pretrain | [ckpt](https://huggingface.co/AIDC-AI/Parrot_S1_7B-Qwen15Clip) | |
| 78 | +| Parrot-7B | Qwen-1.5-7B-Chat | CLIP-ViT-Large-patch14-336 | SFT | [ckpt](https://huggingface.co/AIDC-AI/Parrot_S2_7B-Qwen15Clip) | |
| 79 | +| Parrot-14B | Qwen-1.5-14B-Chat | CLIP-ViT-Large-patch14-336 | Pretrain | [ckpt](https://huggingface.co/AIDC-AI/Parrot_S1_14B-Qwen15Clip) | |
| 80 | +| Parrot-14B | Qwen-1.5-14B-Chat | CLIP-ViT-Large-patch14-336 | SFT | [ckpt](https://huggingface.co/AIDC-AI/Parrot_S2_14B-Qwen15Clip) | |
| 81 | + |
| 82 | +<div align="center"> |
| 83 | + <img src="./images/teaser.png" width="600px" /> |
| 84 | +</div> |
| 85 | + |
| 86 | +## 🔥 Train |
| 87 | + |
| 88 | +Parrot is trained in two stages: modality alignment and instruction tuning for multilingual alignment. Each stage's training script is provided in the `scripts` folder. Before starting the training, ensure you properly set the `ROOT` variable in the training script. Below are the commands to train Parrot for each stage: |
| 89 | + |
| 90 | +```shell |
| 91 | +bash scripts/train/pretrain.sh |
| 92 | +bash scripts/train/finetune.sh |
| 93 | +``` |
| 94 | + |
| 95 | +#### Hyperparameters |
| 96 | +We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below. |
| 97 | + |
| 98 | +1. Pretraining |
| 99 | + |
| 100 | +| Model | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |
| 101 | +| --- | ---: | ---: | ---: | ---: | ---: | |
| 102 | +| Parrot-7B | 256 | 1e-3 | 1 | 2048 | 0 | |
| 103 | + |
| 104 | +2. Finetuning |
| 105 | + |
| 106 | +| Model | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |
| 107 | +| --- | ---: | ---: | ---: | ---: | ---: | |
| 108 | +| Parrot-7B | 128 | 2e-5 | 1 | 2048 | 0 | |
| 109 | + |
| 110 | +#### Download Qwen1.5-7B-Chat checkpoints |
| 111 | + |
| 112 | +Our base model Qwen1.5-7B-Chat, which is an instruction-tuned chatbot, can be downloaded from [here](https://huggingface.co/Qwen/Qwen1.5-7B-Chat). |
| 113 | + |
| 114 | +## 🔎 Datasets |
| 115 | + |
| 116 | +All training datasets are summarized in the Python file located at `parrot/train/utils/utils.py`. Each dataset contains a collection of samples where each sample consists of text and (optionally) image. The text data is embedded directly within the JSON file, while the image is represented by its filename. This filename refers to the image file located in the `image_dir`. |
| 117 | + |
| 118 | +We provide the JSON file for each training dataset at [Huggingface](https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v). The images can be downloaded from their respective sources listed below. |
| 119 | + |
| 120 | +| dataset name | image dir | image source | |
| 121 | +|:-------------------------------|---------------:|--------------------------------------------------------------:| |
| 122 | +| llava-pretrain-558k | llava_pretrain | https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain | |
| 123 | +| laion-12k | parrot_laion | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset | |
| 124 | +| cc12m-645k | parrot_cc12m | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset | |
| 125 | +| llava-finetune-665k | llava_finetune | https://github.com/haotian-liu/LLaVA | |
| 126 | +| sharegpt4v-sft-zh | multilingual_sft | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v | |
| 127 | +| sharegpt4v-sft-pt | multilingual_sft | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v | |
| 128 | +| sharegpt4v-sft-ar | multilingual_sft | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v | |
| 129 | +| sharegpt4v-sft-tr | multilingual_sft | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v | |
| 130 | +| sharegpt4v-sft-ru | multilingual_sft | https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/sharegpt_4v | |
| 131 | + |
| 132 | +Below is an example of the folder structure. You can alter the folder structure as needed and modify the function `name2data` in `parrot/train/utils/utils.py` accordingly. |
| 133 | +``` |
| 134 | +|-- mllm_datasets |
| 135 | + |-- meta_files |
| 136 | + |-- llava-pretrain-558k.json |
| 137 | + |-- laion-12k.json |
| 138 | + |-- llava-finetune-665k.json |
| 139 | + ... |
| 140 | + |-- images |
| 141 | + |-- llava_pretrain |
| 142 | + |-- sharegpt4v |
| 143 | + |-- laion |
| 144 | + ... |
| 145 | +``` |
| 146 | + |
| 147 | +## 🎄 MMMB |
| 148 | + |
| 149 | +We provide the MMMB benchmark at [Huggingface](https://huggingface.co/datasets/AIDC-AI/Parrot-dataset/tree/main/mmmb). It contains 6 languages, 15 categories, and 12,000 questions. You can download the dataset and use it for your own experiments. We utilize the tsv file to store the dataset, and it is easy to evaluate using the `VLMEvalKit`. |
| 150 | + |
| 151 | +<div align="center"> |
| 152 | + <img src="./images/mmmb.png" width="600px" /> |
| 153 | +</div> |
| 154 | + |
| 155 | +## 🔑 Evaluation |
| 156 | + |
| 157 | +> We use the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate MLLMs. |
| 158 | +
|
| 159 | +To evaluate the multilingual capabilities of Parrot, we conduct a comprehensive comparison of it with the state-of-the-art approaches using multilingual benchmarks. Additionally, we compare Parrot with leading models across a range of multimodal tasks. To ensure the reproducibility, we evaluate the models using VLMEvalKit. You can find the evaluation script in `VLMEvalKit/run.sh`. **Before running the script, please replace the paths related to the model and the dataset in the script.** |
| 160 | + |
| 161 | +<div align="center"> |
| 162 | + <img src="./images/performance_table.png" width="600px" /> |
| 163 | +</div> |
| 164 | + |
| 165 | +<div align="center"> |
| 166 | + <img src="./images/performance.png" width="300px" /> |
| 167 | +</div> |
| 168 | + |
| 169 | +## 📍 Quick Start |
| 170 | + |
| 171 | +We provide a quick start demo in `parrot/deploy/runner.py`, which can be used as a template to run Parrot for inference. |
| 172 | + |
| 173 | +1. Before running the demo, please make sure you download the [Parrot checkpoint](https://huggingface.co/AIDC-AI/Parrot-7B) and the [Clip checkpoint](https://huggingface.co/openai/clip-vit-large-patch14-336). |
| 174 | +2. Second, you should replace the paths in the `runner.py`. |
| 175 | +3. Finally, run the python file in your system. |
| 176 | + |
| 177 | +<div align="center"> |
| 178 | + <img src="./images/example1.png" width="600px" /> |
| 179 | +</div> |
| 180 | + |
| 181 | +<div align="center"> |
| 182 | + <img src="./images/example2.png" width="600px" /> |
| 183 | +</div> |
| 184 | + |
| 185 | +## 👨🏫 Acknowledgement |
| 186 | + |
| 187 | +- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon. |
| 188 | +- [Qwen1.5-Chat](https://github.com/QwenLM/Qwen1.5): the LLM backbone we used. |
| 189 | +- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the evaluation toolkit we used. |
| 190 | + |
| 191 | +## 🤗 Contact |
| 192 | + |
| 193 | +If there are any questions, please feel free to propose new features by opening an issue or contacting the author: **Hai-Long Sun **( [[email protected]](mailto:[email protected])). Enjoy the code! |
| 194 | + |
| 195 | +## 🚀 Star History |
| 196 | + |
| 197 | +[](https://star-history.com/#AIDC-AI/Parrot&Date) |
0 commit comments