GitHub - Kaayrapetov/Classifying-Reddit-Posts: Comparing Performance Metrics of Four Classifiers on Labeling Text Data

Introduction

A significant portion of data that exists in the world today is in the form of text that is written by humans to communicate with other humans. Since computers cannot understand human language, they cannot be used to process or analyze this text. Natural language Processing (NLP) is a field of linguists and computer science that uses computational tools that allow computers to analyse human language. NLP is used with machine learning classifiers to analyse human language data. This study aims to compare a handful of machine learning classifiers in thier ability to classify text data into the appropriate category.

Business Problem Statement

Larson Texts, Inc is a company in Erie, PA that has been producing mathematics textbooks and educational material for high schools for over 45 years. The company is developing a subsidiary company, Big Ideas Learning, which will produce textbooks for elementary school students. The marketing department wants to use subreddits related to teaching to advertise its textbooks to teachers. We will use existing reddit data from Elementary School and High School teachers in order to train a model to categorize a block of text as coming from one group or the other. We can use the model to make sure that the text of the advertisement would be classified and fit stylistically into the correct education level.

Data

The data for this study was obtained from Reddit using the Pushshift API. The API was used to scrape the contents of posts on two subreddits: r/ElementaryTeachers and r/HighSchoolTeachers. These two subreddits were chosen since they are quite distinct in many ways, but also have in common the over-arching theme of education. This allowed me to use the data either as is, giving the models very distinct key-words that can be used to distingush between the two groups, or, with subreddit-specific keywords removed, making the two groups more similar. There were a total of about 24,000 subreddits scraped. Of those, about 15,500 subreddits were from High School teachers and about 8,500 subreddits were from Elementary School Teachers.

Models

The models I tested in this study are:

Logistic Regression Classifier which classifies an obervation based on its probability of belonging either to one class or the other.
Multinomial Naive Bayes Classifier which uses the Naive Bayes theorem to determine which class an observation belongs to
Support Vector Machine which identifies the optimum decision boundary in a multi-dimensional feature space to classify observations
Voting Classifier which is an ensemble classifier that used an Ada Boost Classifier, a Gradient Boosting Classifier and a Logistic regression Classifier to contribute their votes for the final prediction of each observation.

None of these models can process text data as is. Text data must first be converted in an appropriate numerical form using NLP tools that are referred to as transformers. The models were tested with two transformers, Count Vectorizer and TF-IDF Vectorizer, to find the combination that worked best.

Methodology

Data Acquisition and Cleaning

I acquired from two subreddits: r/ElementaryTeachers and r/HighSchoolTeachers using Pushshift API. I pulled around 5000 posts from each subreddit using a custom function that used the API to scrape posts in batches of 100 going back in time.

I then binarized the subreddit names to 0 for r/ElementaryTeachers and 1 for r/HighSchoolTeachers, and split the data into train and test sets with a test size of 30%.

Gridsearching Through Model Hyperparameters

I used a pipeline and gridsearching to find the best parameters optimized for recall score for the following combinations of transformer and model:

Logistic regression Classifier with Count Vectorizer
Logistic Regression Classifier with TF-IDF Vectorizer
Support Vector Machine with Count Vectorizer
Support Vector Machine with TF-IDF Vectorizer
Multinomial Naive Bayes Classifier with Count Vectorizer

Results

Model performance when trained on full dataset All models tested did really well in classifying posts into the corerct subreddit. SVM-Tvec (Support Vector Machine with Tfidf Vectorizer) did the best followed by Multinomial Naive Bayes, as shown in the table below. From these, I selected SVM_TVec, MNBayes and LogReg_TVec to be tested for performance on classifying the modified dataset.

Model	Accuracy	Recall	Precision
SVM-TVec	0.93	0.95	0.94
MNBayes	0.85	0.86	0.90
LogReg-TVec	0.93	0.96	0.94
LogReg-CVec	0.94	0.95	0.95
SVM-CVec	0.92	0.95	0.93

Conclusion and Future Directions

Overall, Logistic Regression performed the best.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Code		Code
Data		Data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Business Problem Statement

Data

Models

Methodology

Data Acquisition and Cleaning

Gridsearching Through Model Hyperparameters

Results

Conclusion and Future Directions

About

Releases

Packages

Languages

Kaayrapetov/Classifying-Reddit-Posts

Folders and files

Latest commit

History

Repository files navigation

Introduction

Business Problem Statement

Data

Models

Methodology

Data Acquisition and Cleaning

Gridsearching Through Model Hyperparameters

Results

Conclusion and Future Directions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages