Task-Oriented Dialogue Dataset Survey

A dataset survey about task-oriented dialogue. Following infomation is included:

Name
Introduction
Link (Download & Paper)
Multi or single turn
Task
Task detail
Wether Public Accessible
Size & Stats
Included Label
Missing Label

See Survey Here or in Excel File

Name	Introduction	Links	Multi/Single Turn	Task	Task Detail	Public Accessible	Size & Stats	Included Label	Missing Label
MultiWOZ 2.0	1. Proposed by EMNLP 2018 best paper. 2. Largest by now & contain multi-domains. 3. Human2human 4. goal changes are encouraged	Download: http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ Paper: https://arxiv.org/pdf/1810.00278.pdf	M	Task Oriented	7 domains Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train.	Yes	Total 10438 dialogues average number of turns are 8.93 and 15.39 for single and multi-domain dialogues respectively. 115, 434 turns in total.	Belief state User Act(inform, request slots) Agent Act(inform, request slots)	NLU(Intent, Slots)
Medical DS	1. Our dataset is collected from the pediatric department in a Chinese online healthcare community 2. Task-oriented Dialogue System for Automatic Diagnosis	Download: http://www.sdspeople.fudan.edu.cn/zywei/data/acl2018-mds.zip Paper: http://www.sdspeople.fudan.edu.cn/zywei/paper/liu-acl2018.pdf	M	Task Oriented	Automatic Diagnosis	Yes	4 Disease 67 symptoms	Slot Action
Snips	1. Collected by Snips for model evaluation. 2. For natural language understanding 3. Homepage: https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19	Download: https://github.com/snipsco/ nlu-benchmark/tree/master/ 2017-06-custom-intent-engines	S	Task Oriented	7 task: Weather,play music, search, add to list, book, moive	Yes	Train:13,084 Test:700 7 intent 72 slot labels	Intent Slots
MIT Restaurant Corpus	1. The MIT Restaurant Corpus is a semantically tagged training and test corpus in BIO format. 2. For natural language understanding	Download: https://groups.csail.mit.edu/sls/downloads/restaurant/	S	Task Oriented	Resaurant	Yes	Train, Dev, Test 6,894 766 1,521	Slot	Intent
MIT Movie Corpus	1. The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries. 2. For natural language understanding	Download: https://groups.csail.mit.edu/sls/downloads/movie/	S	Task Oriented	Movive	Yes	Train, Dev, Test MIT Movie Eng 8,798 977 2,443 MIT Movie Trivia 7,035 781 1,953 Refer to: Data Augmentation for Spoken Language Understanding via Joint Variational Generation	Slot	Intent
ATIS	1. The ATIS (Airline Travel Information Systems) dataset (Tur et al., 2010) is widely used in SLU research 2. For natural language understanding	Download: 1. https://github.com/AtmaHou/Bi-LSTM_PosTagger/tree/master/data 2.https://github.com/yvchen/JointSLU/tree/master/data	S	Task Oriented	Airline Travel Information	Yes	Train: 4478 Test: 893 120 slot and 21 intent	Intent Slots
Microsoft Dialogue Challenge	1. Containing human-annotated conversational data in three domains an 2. Experiment platform with built-in simulators in each domain, for training and evaluation purposes.	Paper： https://arxiv.org/pdf/1807.11125.pdf	M	Task Oriented	Movie-Ticket Booking Restaurant Reservation Taxi Ordering	Yes	Task Intents Slots Dialogues Movie-Ticket Booking 11 29 2890 Restaurant Reservation 11 30 4103 Taxi Ordering 11 29 3094	Intent Slots	Database API-call
CamRest676	CamRest676 Human2Human dataset contains the following three json files: 1. CamRest676.json: the woz dialogue dataset, which contains the conversion from users and wizards, as well as a set of coarse labels for each user turn. 2. CamRestDB.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes. 3. The ontology file, specific all the values the three informable slots can take.	Download: https://www.repository.cam.ac.uk/handle/1810/260970 Paper: https://arxiv.org/abs/1604.04562	M	Task Oriented	Booking restautant	Yes	Total 676 Dialogues Total 1500 Turns Train:Dev:Test 3:1:1 (Test set not given)	Slot User Act(inform, request slots) Agent Act(inform, request slots)	Intent API call Database
Human-human goal oriented dataset	1. Maluuba reased a travel booking dataset 2. Design for new task: frame tracking (allow comparing between history entities) 3. Homepage: https://datasets.maluuba.com/Frames 4. Human2Human	Download: https://datasets.maluuba.com/Frames/dl Paper: https://arxiv.org/abs/1706.01690 https://1drv.ms/b/s!Aqj1OvgfsHB7dsg42yp2BzDUK6U	M	Task Oriented	Travel Booking	Yes	Dialogues 1369 Turns 19986 Average user satisfaction (from 1-5) 4.58	Frame User agenda User Act(inform, request slots) Agent Act(inform, request slots) API Call User's satisfaction Task successful Database Entity reference	Intent
Dialog bAbI tasks data	1. Facebook's 6 task-oriented dialogues data set consist of 6 different tasks. 2. Dataset for task 1-5 is constucted automaticly from bots' chat(Bot2Bot). And dataset for task 6 is simply reformated dstc2 dataset. 3. A Shared database is included. 4. This is the only task-oriented dataset among bAbI tasks. 5. The goal of it is to evaluate end2end tasks, so there is not intents and slots.	Download: https://research.fb.com/downloads/babi/ Paper: http://arxiv.org/abs/1605.07683	M	Task Oriented	Book a table at a restaurant	Yes	For each task, training 1000 develop 1000 test 1000 For tasks 1-5, second test set (with suffix -OOV.txt) that contains dialogs including entities not present.	API call Full Database	Slot Intent User Act Agent Act
Stanford Dialog Dataset	1. Standford NLP group's data of car autopilot agent. 2. Human2Human 3. A quick intro http://m.sohu.com/n/499803391/	Download: http://nlp.stanford.edu/projects/kvret/kvret_dataset_public.zip Paper: https://arxiv.org/abs/1705.05414	M	Task Oriented	car autopilot agent: schedule, weather, navigation	Yes	Training Dialogues 2,425 Validation Dialogues 302 Test Dialogues 304 Avg. # of Utterances Per Dialogue 5.25	Dialogue level database User Act(inform, request slots) Agent Act(inform, request slots)	API call Intent Slot
Stanford Dialog Dataset LU	1. Stanford data labeled by HIT, relabel slot & intent 2. Human2Human 3. A quick intro http://m.sohu.com/n/499803391/ to stanford data 4. Annotation handbook: https://docs.google.com/document/d/1ROARKf8AJNnG2_nPINe1Xm5Rza7V0jPnQV8io09hcFY/edit	N/A	M	Task Oriented	car autopilot agent: schedule, weather, navigation	No	Training Dialogues 2,425 Validation Dialogues 302 Test Dialogues 304 Avg. # of Utterances Per Dialogue 5.25	Slot Intent	API call Need to do sample alignment to get the following: Dialogue level database User Act(inform, request slots) Agent Act(inform, request slots) Agent Reply
DSTC-2	1. Human2Bot restaurant booking dataset 2. For usage refer to: http://camdial.org/~mh521/dstc/downloads/handbook.pdf 3. Each dialofue is stored in different folder, which contains log and label.	http://camdial.org/~mh521/dstc/	M	Task Oriented	Booking restautant	Yes	Train 1612 calls Dev 506 calls Test 1117 dialogs	Slot User Act(inform, request slots) Agent Act(inform, request slots)	Intent API call Database
DSTC4	1. Data name as TourSG consists of 35 dialog sessions on touristic information for Singapore collected from Skype calls between three tour guides and 35 tourists 2. All the recorded dialogs with the total length of 21 hours have been manually transcribed and annotated with speech act and semantic labels for each turn level. 3. Homepage: http://www.colips.org/workshop/dstc4/data.html 4. Human2Human	N/A	M	Task Oriented	Querry touristic information	No	Train 20 dialogs Test 15 dialogs	speech act (User & Agent) semantic labels(Intent? User & Agent) topic for turn (Intent?)	N/A
Movie Booking Dataset	1. (Microsoft) Raw conversational data collected via Amazon Mechanical Turk, with annotations provided by domain experts. 2. Human2Human	Download: https://github.com/MiuLab/TC-Bot#data Paper: TC-bot	M	Task Oriented	Booking Movie	Yes	280 dialogues turns per dialogue is approximately 11	User Act(inform, request slots) Agent Act(inform, request slots) Intent Slots	Database API-call
Lingxi	1. The data is all single round user input divided into good words. There is more noise. 2. Completed part of speech tagging and slot labeling 3. Language: Chinese	N/A	S	Task Oriented	conversational robot service user log	No	Utterance: 5132	Slot POS	Agent reply Intent API call Database

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Atma'sDatasetSurvey.xlsx		Atma'sDatasetSurvey.xlsx
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Task-Oriented Dialogue Dataset Survey

See Survey Here or in Excel File

About

Releases

Packages

AnnDing/Task-Oriented-Dialogue-Dataset-Survey

Folders and files

Latest commit

History

Repository files navigation

Task-Oriented Dialogue Dataset Survey

See Survey Here or in Excel File

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages