Dataset Preparation

Pretraining

During the image-based training stage, our dataset comprises 558K image-caption pairs from LAION-CCSBU and 708k image-caption pairs from ALLaVA-4V-Caption dataset, culminating in a total of 1.26M image-caption pairs for pretraining.

Fine-tuning

The datasets employed for instruction-tuning encompass 665K mixture dataset from LLaVA-Instruct, 692k instructions from ALLaVA-4V-Instruction dataset, and an additional 25k instructions derived from a combination of ShareGPT4V , DocVQA, DVQA and AI2D, with a total number of more than 1.3M image-text conversations.

Download Images for Training

LLaVA-1.5 pretrain images -> data/LLaVA-Pretrain/images

ALLaVA-4V-LAION and ALLaVA-4V-Vison-FLAN images -> data/allava_laion/images, data/allava_vflan/images

COCO -> data/coco/train2017

GQA -> data/gqa/images

OCR-VQA -> data/ocr_vqa/images

TextVQA -> data/textvqa/train_images

VG-Part1, VG-Part2 -> data/vg/VG_100K, data/vg/VG_100K_2

The Web-Celebrity, Web-Landmark, WikiArt, Share-TextVQA in ShareGPT-4V.

AI2D -> data/ai2d/images

DocVQA -> data/docvqa/images

DVQA -> data/dvqa/images

The complete structure is as follows:

MG-LLAVA
├── data
│   ├── LLaVA-Pretrain
│   │   ├── images
│   ├── ai2d
|   |   ├── images
│   ├── allava_laion
|   |   ├── images
│   ├── allava_vflan
|   |   ├── images
│   ├── coco
|   |   ├── train2017
│   ├── docvqa
|   |   ├── images
│   ├── dvqa
|   |   ├── images
│   ├── gqa
|   |   ├── images
│   ├── ocr_vqa
|   |   ├── images
│   ├── share_textvqa
|   |   ├── images
│   ├── textvqa
|   |   ├── train_images
│   ├── vg
|   |   ├── VG_100K
|   |   ├── VG_100K_2
│   ├── web-celebrity
|   |   ├── images
│   ├── web-landmark
|   |   ├── images
│   ├── wikiart
|   |   ├── images

Download Annotations Files

We employ RAM-Plus and OWL-ViT to generate bounding boxes for training and evaluation. Our trainging annotation files and bounding box annotation files are available in Hugging Face. Please download and modify data_path and box_json_path in your config file.

If you want to generate the bounding boxes by yourself, you can refer to the image_offline_to_bbox.py by downloading RAM, OWL-ViT2 and modifying the data_file, image_folder, save_json_path ,then run the following command:

torchrun --nproc_per_node=8 mg_llava/bbox_generation/image_offline_to_bbox.py

Download Data for Evaluation

Most of the evaluation benchmarks utilized in our paper can be found in LLaVa.

The bounding box annotation files for evaluation are available in Hugging Face.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_prepare.md

dataset_prepare.md

Dataset Preparation

Pretraining

Fine-tuning

Download Images for Training

Download Annotations Files

Download Data for Evaluation

Files

dataset_prepare.md

Latest commit

History

dataset_prepare.md

File metadata and controls

Dataset Preparation

Pretraining

Fine-tuning

Download Images for Training

Download Annotations Files

Download Data for Evaluation