Quora Question pair similarity

Motivation

Quora is a place to gain and share knowledge about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. Every month tons of questions are added on Quora and multiple questions with the same intent can cause seekers to spend more time finding the best answer, and make writers feel they need to answer multiple versions of the same question.

⚪Task:

In this case study we have to identify whether a "pair of questions" is duplicate or not.
The duplicate questions asked can be removed so that the user can find the best answer already available.
This is a binary classification problem and the metric used here is "ROC-AUC Score".

Acknowledgements

Data Overview

The dataset contains question id's, question pairs and the target variable where (0) = Not a duplicate pair, (1) = A duplicate pair of question.

This is a slightly imbalanced dataset problem as we can see below:

Below is a snapshot of the data looks initially:

Following are the examples of duplicate and non duplicate pairs of questions:

Although there are a small segment of questions which have been repeated and we also get the unique number of questions asked through the (question id)

Feature Extractions

We have two stages for feature extraction, the first stage involves extracting "basic features" and the latter stage consists of "advanced features".

Basic feature contains the following:

Length of sentence (number of words in a question)
Common words between a pair of questions.
Word_share: (common_word_count/total number of words)
cwc_min & cwc_max: (Ratio of common_word_count to min & max length of word count of Q1 and Q2)
Whether "first" word of pair of question is equal (1) or not (0).
Whether "last" word of pair of question is equal (1) or not (0).
Difference of number of words in question pair.
Average number of words in a question pair.

We add the above features to the datasetnow:

Advanced features (fuzzywuzzy library)

Click here to know fuzzywuzzy library

Longest substring ratio (length of longest common substring)/(min (length of question 1, length of question 2))
Partial ratio.
Token sort ratio.
Token set ratio.
Wratio.

We add these advanced features to the dataset:

Data Visualisation

Pair plots for advanced features:

Wordcloud for duplicate pairs: This tells us which words contribute for a question pair to be termed as "duplicate".

Wordcloud for non duplicate pairs: The below words are the most occuring in a "non duplicate" question pair.

How well (Wratio, partial ratio) feature identifies duplicate pairs.

Data Cleaning

⚪Missing Values:

Since this is a text data problem there are no serious effects of missing values, but we found one question with (NaN) value and have replaced it with an empty string.

⚪Cleaning Text data:

Removing Stopwords, punctuations, numeric values.
We remove certain words from the list of "stopwords" to preserve the meaning of title.
Contracted words are expanded (won't ---> will not)
Lemmatization is used to return the dictionary form of words.

Data preprocessing

We are using TF-IDF Word2Vec for converting sentences into vectors, it involves following steps:

Apply TF-IDF vectorization on clean data.
Get the IDF values for each feature (max feature = 1000).
Create a list of words from the entire corpus.
Form W2V for each of these words (size = 200), this means that each word is represented by a 200 dimensional vector.

Below we can see how this W2V is created for every word.

TF-IDF W2V by multiply each TF-IDF value with the W2V for a given word, then adding these vectors formed and finally dividing the result by sum of TF-IDF values.
```
   TF-IDF W2V = Σ {[TFIDF(Wi) * W2V(Wi)]/TFIDF(Wi)}
   Wi = Any given word.
```

We can see the code snippet for this below:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Deploy_data.csv		Deploy_data.csv
Quora_modeling.ipynb		Quora_modeling.ipynb
Quora_question_pairs_.ipynb		Quora_question_pairs_.ipynb
README.md		README.md
advanced_features.PNG		advanced_features.PNG
basic_features.PNG		basic_features.PNG
comparison.PNG		comparison.PNG
data.PNG		data.PNG
data_summary.PNG		data_summary.PNG
data_viz.PNG		data_viz.PNG
final_data.PNG		final_data.PNG
imbalance.PNG		imbalance.PNG
nonsimilar.PNG		nonsimilar.PNG
partial_ratio.PNG		partial_ratio.PNG
quora.PNG		quora.PNG
quora_app.py		quora_app.py
random_model.PNG		random_model.PNG
requirements.txt		requirements.txt
results.PNG		results.PNG
similar.PNG		similar.PNG
test.csv		test.csv
w2v1.PNG		w2v1.PNG
w2v2.PNG		w2v2.PNG
wc_duplicate.PNG		wc_duplicate.PNG
wc_nonduplicate.PNG		wc_nonduplicate.PNG
wratio.PNG		wratio.PNG
xgb.PNG		xgb.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quora Question pair similarity

Motivation

Acknowledgements

Data Overview

Feature Extractions

Data Visualisation

Data Cleaning

Data preprocessing

Best Model : LGBM

Comparisons of Model:

About

Releases

Packages

Languages

BairagiSaurabh/Duplicate-Questions-Check

Folders and files

Latest commit

History

Repository files navigation

Quora Question pair similarity

Motivation

Acknowledgements

Data Overview

Feature Extractions

Data Visualisation

Data Cleaning

Data preprocessing

Best Model : LGBM

Comparisons of Model:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages