🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page | Paper
A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others. This list will be updated in real time. ✨
Welcome to join our WeChat group of MLLM communication!
Please add WeChat ID (wmd_rz_ustc) to join the group. 🌟
🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper
Leaderboards of 16 advanced MLLMs, including BLIP-2, InstructBLIP, LLaVA, MiniGPT-4, mPLUG-Owl, LLaMA-Adapter V2, ImageBind_LLM, Otter, VisualGLM-6B, Multimodal-GPT, PandaGPT, VPGTrans, LaVIN, Lynx, Octopus, and LRV-Instruction.
If you want to add your model in our leaderboards, please feel free to email [email protected]. We will update the leaderboards in time. ✨
Download MME 🌟🌟
The benchmark dataset is collected by Xiamen University for academic research only. You can email [email protected] to obtain the dataset, according to the following requirement.
Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as [email protected] and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.
Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)
If you find our projects helpful to your research, please cite the following papers:
@article{yin2023survey,
title={A Survey on Multimodal Large Language Models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
journal={arXiv preprint arXiv:2306.13549},
year={2023}
}
@article{fu2023mme,
title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Qiu, Zhenyu and Lin, Wei and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Ji, Rongrong},
journal={arXiv preprint arXiv:2306.13394},
year={2023}
}
Table of Contents
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Generative Pretraining in Multimodality |
arXiv | 2023-07-11 | Github | Demo |
Kosmos-2: Grounding Multimodal Large Language Models to the World |
arXiv | 2023-06-26 | Github | Demo |
Transfer Visual Prompt Generator across LLMs |
arXiv | 2023-05-02 | Github | Demo |
GPT-4 Technical Report | arXiv | 2023-03-15 | - | - |
PaLM-E: An Embodied Multimodal Language Model | arXiv | 2023-03-06 | - | Demo |
Prismer: A Vision-Language Model with An Ensemble of Experts |
arXiv | 2023-03-04 | Github | Demo |
Language Is Not All You Need: Aligning Perception with Language Models |
arXiv | 2023-02-27 | Github | - |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
arXiv | 2023-01-30 | Github | Demo |
VIMA: General Robot Manipulation with Multimodal Prompts |
ICML | 2022-10-06 | Github | Local Demo |
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge |
NeurIPS | 2022-06-17 | Github | - |
Language Models are General-Purpose Interfaces |
arXiv | 2022-06-13 | Github | - |
Title | Venue | Date | Page |
---|---|---|---|
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | arXiv | 2023-08-07 | Coming soon |
Tiny LVLM-eHub: Early Multimodal Experiments with Bard |
arXiv | 2023-08-07 | Github |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities |
arXiv | 2023-08-04 | Github |
MMBench: Is Your Multi-modal Model an All-around Player? |
arXiv | 2023-07-12 | Github |
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models |
arXiv | 2023-06-23 | Github |
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models |
arXiv | 2023-06-15 | Github |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |
arXiv | 2023-06-11 | Github |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models |
arXiv | 2023-06-08 | Github |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Planting a SEED of Vision in Large Language Model |
arxiv | 2023-07-16 | Github | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | arXiv | 2023-06-01 | Coming soon | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
Generating Images with Multimodal Language Models |
arXiv | 2023-05-26 | Github | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Evaluating Object Hallucination in Large Vision-Language Models |
arXiv | 2023-05-17 | Github | - |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2023-01-31 | Github | Demo |
Name | Paper | Link | Notes |
---|---|---|---|
MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 3.2M informative instruction tuning data |
mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
LRV-Instruction | Aligning Large Multi-Modal Model with Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Coming soon | Multimodal in-context instruction tuning |
M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Name | Paper | Link | Notes |
---|---|---|---|
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Coming soon | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Name | Paper | Link | Notes |
---|---|---|---|
I4 | Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | Link | A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions |
SciGraphQA | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | Coming soon | A large-scale chart-visual question-answering dataset |
MM-Vet | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | Link | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks |
MMBench | MMBench: Is Your Multi-modal Model an All-around Player? | Link | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models |
Lynx | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | Coming soon | A comprehensive evaluation benchmark including both image and video tasks |
MME | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Link | A comprehensive MLLM Evaluation benchmark |
LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Link | An evaluation platform for MLLMs |
LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks |
M3Exam | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | Link | A multilingual, multimodal, multilevel benchmark for evaluating MLLM |
OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | Link | Dataset for evaluation on multiple capabilities |
Name | Paper | Link | Notes |
---|---|---|---|
IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models |
CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions |
Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |
InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Coming soon | A VQA dataset that focuses on asking information-seeking questions |