Text Classification of quotes from candidates vying to be the Democratic presidential nominee for the 2020 US presidential election.
Here, all data has been extracted from debates between candidates. I have built a NLP classification model to identify who said what for a subset of unlabeled data.
The quotes are subjected to basic text-preprocessing steps such as
-
Stopword removal
-
Punctuation removal
-
Lemmatization
-
Tokenization using unigram
To prepare data for modeling, I performed feature engineering. Here, I engineered features which utilize count of various components of the text such as character, word, punctuation etc.
The text classification is done using Supervised & Semi-Supervised techniques. The following models were explored:
-
Regularized Logistic Regression
-
Random Forest
-
XGBoost
1. NLP: nltk, TfidfVectorizer, CountVectorizer
2. ML: sklearn, xgboost, scipy
3. Visualization: Seaborn, Matplotlib
4. Exploration: Jupyter Notebooks