During the image-based training stage, our dataset comprises 558K image-caption pairs from LAION-CCSBU and 708k image-caption pairs from ALLaVA-4V-Caption dataset, culminating in a total of 1.26M image-caption pairs for pretraining.
The datasets employed for instruction-tuning encompass 665K mixture dataset from LLaVA-Instruct, 692k instructions from ALLaVA-4V-Instruction dataset, and an additional 25k instructions derived from a combination of ShareGPT4V , DocVQA, DVQA and AI2D, with a total number of more than 1.3M image-text conversations.
LLaVA-1.5 pretrain images -> data/LLaVA-Pretrain/images
ALLaVA-4V-LAION and ALLaVA-4V-Vison-FLAN images -> data/allava_laion/images
, data/allava_vflan/images
COCO -> data/coco/train2017
GQA -> data/gqa/images
OCR-VQA -> data/ocr_vqa/images
TextVQA -> data/textvqa/train_images
VG-Part1, VG-Part2 -> data/vg/VG_100K, data/vg/VG_100K_2
The Web-Celebrity, Web-Landmark, WikiArt, Share-TextVQA in ShareGPT-4V.
AI2D -> data/ai2d/images
DocVQA -> data/docvqa/images
DVQA -> data/dvqa/images
The complete structure is as follows:
MG-LLAVA
├── data
│ ├── LLaVA-Pretrain
│ │ ├── images
│ ├── ai2d
| | ├── images
│ ├── allava_laion
| | ├── images
│ ├── allava_vflan
| | ├── images
│ ├── coco
| | ├── train2017
│ ├── docvqa
| | ├── images
│ ├── dvqa
| | ├── images
│ ├── gqa
| | ├── images
│ ├── ocr_vqa
| | ├── images
│ ├── share_textvqa
| | ├── images
│ ├── textvqa
| | ├── train_images
│ ├── vg
| | ├── VG_100K
| | ├── VG_100K_2
│ ├── web-celebrity
| | ├── images
│ ├── web-landmark
| | ├── images
│ ├── wikiart
| | ├── images
We employ RAM-Plus and OWL-ViT to generate bounding boxes for training and evaluation. Our trainging annotation files and bounding box annotation files are available in Hugging Face. Please download and modify data_path
and box_json_path
in your config file.
If you want to generate the bounding boxes by yourself, you can refer to the image_offline_to_bbox.py by downloading RAM, OWL-ViT2 and modifying the data_file
, image_folder
, save_json_path
,then run the following command:
torchrun --nproc_per_node=8 mg_llava/bbox_generation/image_offline_to_bbox.py
Most of the evaluation benchmarks utilized in our paper can be found in LLaVa.
The bounding box annotation files for evaluation are available in Hugging Face.