This repository is build in association with our position paper on "Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers".
As a part of this release we share the information about recent multimodal datasets which are available for research purposes.
We found that although 100+ multimodal language resources are available in literature for various NLP tasks, still publicly available multimodal datasets are under-explored for its re-usage in subsequent problem domains.
- Sentiment Analysis
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
EmoDB | A Database of German Emotional Speech | Paper | Dataset |
VAM | The Vera am Mittag German Audio-Visual Emotional Speech Database | Paper | Dataset |
IEMOCAP | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
Mimicry | A Multimodal Database for Mimicry Analysis | Paper | Dataset |
YouTube | Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web | Paper | Dataset |
HUMAINE | The HUMAINE database | Paper | Dataset |
Large Movies | Sentiment classification on Large Movie Review | Paper | Dataset |
SEMAINE | The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent | Paper | Dataset |
AFEW | Collecting Large, Richly Annotated Facial-Expression Databases from Movies | Paper | Dataset |
SST | Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank | Paper | Dataset |
ICT-MMMO | YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context | Paper | Dataset |
RECOLA | Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions | Paper | Dataset |
MOUD | Utterance-Level Multimodal Sentiment Analysis | Paper | |
CMU-MOSI | MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos | Paper | Dataset |
POM | Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia | Paper | Dataset |
MELD | MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations | Paper | Dataset |
CMU-MOSEI | Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph | Paper | Dataset |
AMMER | Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning | Paper | On Request |
SEWA | SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild | Paper | Dataset |
Fakeddit | r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection | Paper | Dataset |
CMU-MOSEAS | CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French | Paper | Dataset |
MultiOFF | Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text | Paper | Dataset |
MEISD | MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations | Paper | Dataset |
TASS | Overview of TASS 2020: Introducing Emotion | Paper | Dataset |
CH SIMS | CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality | Paper | Dataset |
Creep-Image | A Multimodal Dataset of Images and Text | Paper | Dataset |
Entheos | Entheos: A Multimodal Dataset for Studying Enthusiasm | Paper | Dataset |
- Machine Translation
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
Multi30K | Multi30K: Multilingual English-German Image Description | Paper | Dataset |
How2 | How2: A Large-scale Dataset for Multimodal Language Understanding | Paper | Dataset |
MLT | Multimodal Lexical Translation | Paper | Dataset |
IKEA | A Visual Attention Grounding Neural Model for Multimodal Machine Translation | Paper | Dataset |
Flickr30K (EN- (hi-IN)) | Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data | Paper | On Request |
Hindi Visual Genome | Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation | Paper | Dataset |
HowTo100M | Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models | Paper | Dataset |
- Information Retrieval
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MUSICLEF | MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval | Paper | Dataset |
Moodo | The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval | Paper | Dataset |
ALF-200k | ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists | Paper | Dataset |
MQA | Can Image Captioning Help Passage Retrieval in Multimodal Question Answering? | Paper | Dataset |
WAT2019 | WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset | Paper | Dataset |
ViTT | Multimodal Pretraining for Dense Video Captioning | Paper | Dataset |
MTD | MTD: A Multimodal Dataset of Musical Themes for MIR Research | Paper | Dataset |
MusiClef | A professionally annotated and enriched multimodal data set on popular music | Paper | Dataset |
Schubert Winterreise | Schubert Winterreise dataset: A multimodal scenario for music analysis | Paper | Dataset |
WIT | WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Paper | Dataset |
- Question Answering
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MQA | A Dataset for Multimodal Question Answering in the Cultural Heritage Domain | Paper | - |
MovieQA | Movieqa: Understanding stories in movies through question-answering MovieQA | Paper | Dataset |
PororoQA | Deep story video story qa by deep embedded memory networks | Paper | Dataset |
MemexQA | MemexQA: Visual Memex Question Answering | Paper | Dataset |
VQA | Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering | Paper | Dataset |
TDIUC | An analysis of visual question answering algorithms | Paper | Dataset |
TGIF-QA | TGIF-QA: Toward spatio-temporal reasoning in visual question answering | Paper | Dataset |
MSVD QA, MSRVTT QA | Video question answering via attribute augmented attention network learning | Paper | Dataset |
YouTube2Text | Video Question Answering via Gradually Refined Attention over Appearance and Motion | Paper | Dataset |
MovieFIB | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | Paper | Dataset |
Video Context QA | Uncovering the temporal context for video question answering | Paper | Dataset |
MarioQA | Marioqa: Answering questions by watching gameplay videos | Paper | Dataset |
TVQA | Tvqa: Localized, compositional video question answering | Paper | Dataset |
VQA-CP v2 | Don’t just assume; look and answer: Overcoming priors for visual question answering | Paper | Dataset |
RecipeQA | RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes | Paper | Dataset |
GQA | GQA: A new dataset for real-world visual reasoning and compositional question answering | Paper | Dataset |
Social IQ | Social-iq: A question answering benchmark for artificial social intelligence | Paper | Dataset |
MIMOQA | MIMOQA: Multimodal Input Multimodal Output Question Answering | Paper | - |
- Summarization
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
SumMe | Tvsum: Summarizing web videos using titles | Paper | Dataset |
TVSum | Creating summaries from user videos | Paper | Dataset |
QFVS | Query-focused video summarization: Dataset, evaluation, and a memory network based approach | Paper | Dataset |
MMSS | Multi-modal Sentence Summarization with Modality Attention and Image Filtering | Paper | - |
MSMO | MSMO: Multimodal Summarization with Multimodal Output | Paper | - |
Screen2Words | Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning | Paper | Dataset |
AVIATE | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
Multimodal Microblog Summarizaion | On Multimodal Microblog Summarization | Paper | - |
- Human Computer Interaction
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
CAUVE | CUAVE: A new audio-visual database for multimodal human-computer interface research | Paper | Dataset |
MHAD | Berkeley mhad: A comprehensive multimodal human action database | Paper | Dataset |
Multi-party interactions | A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction | Paper | - |
MHHRI | Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement | Paper | Dataset |
Red Hen Lab | Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research | Paper | - |
EMRE | Generating a Novel Dataset of Multimodal Referring Expressions | Paper | Dataset |
Chinese Whispers | Chinese whispers: A multimodal dataset for embodied language grounding | Paper | Dataset |
uulmMAC | The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction | Paper | Dataset |
- Semantic Analysis
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
WN9-IMG | Image-embodied Knowledge Representation Learning | Paper | Dataset |
Wikimedia Commons | A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions | Paper | Dataset |
Starsem18-multimodalKB | A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning | Paper | Dataset |
MUStARD | Towards Multimodal Sarcasm Detection | Paper | Dataset |
YouMakeup | YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension | Paper | Dataset |
MDID | Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts | Paper | Dataset |
Social media posts from Flickr (Mental Health) | Inferring Social Media Users’ Mental Health Status from Multimodal Information | Paper | Dataset |
Twitter MEL | Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets | Paper | Dataset |
MultiMET | MultiMET: A Multimodal Dataset for Metaphor Understanding | Paper | - |
MSDS | Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline | Paper | Dataset |
- Miscellaneous
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MS COCO | Microsoft COCO: Common objects in context | Paper | Dataset |
ILSVRC | ImageNet Large Scale Visual Recognition Challenge | Paper | Dataset |
YFCC100M | YFCC100M: The new data in multimedia research | Paper | Dataset |
COGNIMUSE | COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization | Paper | Dataset |
SNAG | SNAG: Spoken Narratives and Gaze Dataset | Paper | Dataset |
UR-Funny | UR-FUNNY: A Multimodal Language Dataset for Understanding Humor | Paper | Dataset |
Bag-of-Lies | Bag-of-Lies: A Multimodal Dataset for Deception Detection | Paper | Dataset |
MARC | A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks | Paper | Dataset |
MuSE | MuSE: a Multimodal Dataset of Stressed Emotion | Paper | Dataset |
BabelPic | Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept | Paper | Dataset |
Eye4Ref | Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations | Paper | - |
Troll Memes | A Dataset for Troll Classification of TamilMemes | Paper | Dataset |
SEMD | EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system | Paper | - |
Chat talk Corpus | Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness | Paper | - |
EMOTyDA | Towards Emotion-aided Multi-modal Dialogue Act Classification | Paper | Dataset |
MELINDA | MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification | Paper | Dataset |
NewsCLIPpings | NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media | Paper | Dataset |
R2VQ | Designing Multimodal Datasets for NLP Challenges | Paper | Dataset |
M2H2 | M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations | Paper | Dataset |