Marco Open data set

Microsoft is trying to help create machines that can have conversations by releasing a new set of data for free.

The data, called the Microsoft Machine Reading Comprehension dataset (MS MARCO) is a bundle of 100,000 English queries along with corresponding answers. It's supposed to help people build artificial intelligence systems that can understand human written language.

The queries in MS MARCO are based on anonymized questions that were submitted to Microsoft’s Bing search engine and Cortana virtual assistant. The answers are based on information found online, written by humans and checked for accuracy.

The data set can be downloaded from the link.

HTC-Vive Demo Project

Imagine a Scaniro where medical student wants to operate a patient and during operation he wants to ask question from his/her supervisor. An Engineering student want to see the aircraft engine condition when it is running 10000 feet height with 320 km/hr and need to know various answers regarding aircraft enginnering. Not limited to that, we can create all scanrios where presence of human is impossible.

We can create all scanrios and an virtual agent can answers the user queries.For proof of concept Marco Dataset is considered and and models are built on subset of data as in initial step.

Python Note-book Description

In this section the brief description of proccess followed with their respective Ipython Notebook is given. Instead of using All dataset, Only "yes & no" subset data is filtered out for Machine Learning purpose the subsetting code can be seen in Marco subsetting yes_no answers.ipynb. Further on the subset data, we deployed Gradient Boosting Machine and Deep Learning Model on the bag of words and tf-idf. The initial results and curves were not so good. We found imbalance class distribution between "yes and no", so we increased sample of "no" data by fraction 0.7 and had very excellent results. The results are compared in following section when data is dealt with imbalance calss and without.

Dealing without Imbalance class:

Scoring history:

Model	Bag of Words	tf-idf
Gradient BM
Deep Learning

ROC:

Model	Bag of Words	tf-idf
Gradient BM
Deep Learning

Aera Under the Curve:

Model	Bag of Words	tf-idf
Gradient BM	Training AUC: 0.868, Validation AUC: 0.518	Training AUC: 0.987, Validation AUC: 0.526
Deep Learning	Training AUC: 0.927, Validation AUC: 0.556	Training AUC: 0.658, Validation AUC: 0.536

Dealing with Imbalance class:

Scoring history:

Model	Bag of Words	tf-idf
Gradient BM
Deep Learning

ROC:

Model	Bag of Words	tf-idf
Gradient BM
Deep Learning

Aera Under the Curve:

Model	Bag of Words	tf-idf
Gradient BM	Training AUC: 0.954, Validation AUC: 0.742	Training AUC: 0.998, Validation AUC: 0.772
Deep Learning	Training AUC: 0.998, Validation AUC: 0.802	Training AUC: 0.763, Validation AUC: 0.614

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Marco		Marco
Imbalance Class.ipynb		Imbalance Class.ipynb
Marco Bag of words and tf-idf pandas-2.ipynb		Marco Bag of words and tf-idf pandas-2.ipynb
Marco Bag of words and tf-idf pandas.ipynb		Marco Bag of words and tf-idf pandas.ipynb
Marco Bag of words and tf-idf.ipynb		Marco Bag of words and tf-idf.ipynb
Marco Bag of words pandas-2.ipynb		Marco Bag of words pandas-2.ipynb
Marco subsetting yes_no answers.ipynb		Marco subsetting yes_no answers.ipynb
Marco tf-idf.ipynb		Marco tf-idf.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marco Open data set

HTC-Vive Demo Project

Python Note-book Description

Dealing without Imbalance class:

Dealing with Imbalance class:

About

Releases

Packages

Languages

hamzafar/vr_chat

Folders and files

Latest commit

History

Repository files navigation

Marco Open data set

HTC-Vive Demo Project

Python Note-book Description

Dealing without Imbalance class:

Dealing with Imbalance class:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages