This is an entity-level sentiment analysis dataset of twitter. Given a message and an entity, the task is to judge the sentiment of the message about the entity. There are three classes in this dataset: Positive, Negative, Neutral and Irrelevant.
- Category: NLP, Multiclass Classification problem
- Tech Stack: Python, Regular expression, Word cloud, NLTK, TF-IDF, Bag of Words, Pandas, Matplotlib, Sklearn
https://parisrohan.medium.com/twitter-sentiment-analysis-and-classification-7060d4444a27
- EDA_TextCleaning.ipynb - EDA and text cleaning code
- Model_building.ipynb - Model building code
- Twitter sentiment analysis dataset from Kaggle has been used to build a multiclass classification model. The dataset can be found from the following link:- https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
- The dataset contains 74682 rows and 4 columns
- Distribution of target feature is as below
- The dataset columns have been renamed to {0:'Tweet_ID',1:'Topic',2:'Sentiment',3:'Tweet'} to get a better sense of the data.
- 0.9% of the data has been dropped as it contains null values
- On an average each tweet contains 23 tokens and there are some tweets with extreme outliers
- Following actions are performed on the 'Tweet' feature to extract important information.
- Remove user mentions
- Remove hashtags
- Remove contractions
- Remove urls
- Remove special characters
- Convert tweets into lowercase
- Remove stopwords
- Normalize text by converting words into lemma
- Generate word clouds for each sentiment on the cleaned tweets
- Perform one-hot encoding on the 'Topic' feature
- Drop features like 'Tweet_ID','Tweet','Topic' as they are no longer required