Text classification

Prerequests

Python3.12 is used

Clone the reposiroty

git clone https://github.com/LukichevaPolina/nlp_lab.git
cd nlp_lab

Install requirements.txt

pip3 install -r requirements.txt

Set up PYTHONPATH

export PYTHONPATH=$PYTHONPATH:$PWD

EDA

Dataset is taken from kaggle, navigate to more description. To brief, dataset consist of two columns with statement and status name. The status is our target, which could take seven different value, so we deal with multiclass classification. The total amount of row is 53043, some of this rows are nan, so we remove them. It lead to 52681 appropriate rows. Below you could see the distribution of our targets(or classes).

As you can see we have disbalanced classes, we deal with it using class weight approach. Below we provide a little more statistics of our data.

The chart consist of histograms of the sentences length. Notice that a huge frequency corresponds to relative short sentences. A long sentences relate to Suicidal, Personality disorder, following by Anxiety, Depression, Bipolar and Stress. A smallest sentences belong to Normal.

The chart consist of histograms of the amount of punctuation (or punctuation length). Notice that the sentences include only one punctuation frequently.

The chart consist of histograms of the digit amount. Notice that more sentence have few digits.

We conclude that our data include short sentences around 5-50 words for Normal, around 1000-1500 of that for Suicidal and Personality disorder, about 500-600 of that for Anxiety, Depression, Bipolar and Stress, few punctuation and few digits. So the statistics say that Suicidal and Personality disorder people have a numerous thoughts, probably unnecessary. Notice also that almost all histograms are left shifted.

Below we provide the histograms of more frequent words in our data.

The chart consist of histogram of the top 20 frequently words. The top 3 words are feel, like, want.

One of our preprocess include following steps: drop_nan -> remove_punctuation -> remove_digits -> remove_stop_words -> tokenize -> lemmatization. The embeddings strategy is tfidf. We want to exam the four models: svm, decision_tree, cnn and linear.

Classical algorithms

SVM

To train

python3 main.py --dataset_path {dataset_path} --algorithm svm --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode train

To eval

python3 main.py --dataset_path {dataset_path} --algorithm svm --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode eval

Val metrics

class	f1_score
0	0.78
1	0.77
2	0.69
3	0.90
4	0.64
5	0.54
6	0.65

accuracy = 0.73, f1_weighted = 0.75

Details

A search for optimal grid parameters was conducted. The optimal parameters identified through the grid search are as follows: {'C': 1, 'multi_class': 'ovr'}.

We used the Stratified K-Fold approach because it is suitable for unbalanced data.

Decision Tree

To train

python3 main.py --dataset_path {dataset_path} --algorithm decision-tree --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode train

To eval

python3 main.py --dataset_path {dataset_path} --algorithm decision-tree --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode eval

Val metrics

class	f1_score
0	0.59
1	0.58
2	0.58
3	0.78
4	0.50
5	0.32
6	0.52

accuracy = 0.57, f1_weighted = 0.62

Details

A search for optimal grid parameters was conducted. The optimal parameters identified through the grid search are as follows: {'criterion': 'gini', 'max_depth': 12}

We used the Stratified K-Fold approach because it is suitable for unbalanced data.

DL algorithms

CNN

To train

python3 main.py --dataset_path {dataset_path} --algorithm cnn --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode train

To eval

python3 main.py --dataset_path {dataset_path} --algorithm cnn --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode eval

Train metrics

Learning Curve	Accuracy Curve	F1 Curve

Val metrics

class	f1_score
0	0.76
1	0.78
2	0.69
3	0.88
4	0.69
5	0.56
6	0.64

accuracy = 0.71, f1_weighted = 0.74

Linear

To train

python3 main.py --dataset_path {dataset_path} --algorithm linear --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode train

To eval

python3 main.py --dataset_path {dataset_path} --algorithm linear --embeddigns tfidf --class_balancer class-weight --preprocessor remove-all --mode eval

Train metrics

Learning Curve	Accuracy Curve	F1 Curve

Val metrics

class	f1_score
0	0.63
1	0.56
2	0.61
3	0.84
4	0.49
5	0.42
6	0.59

accuracy = 0.61, f1_weighted = 0.67

Total Comparing

The best model is SVM because this algorithm is able to find the hyperplane that maximizes the separation compared to other methods. It is also capable of handling unbalanced datasets.

Experiments

Prepocessing

We experemented with preprocessing on classical models. We have three types of experiements:

Preprocessing (removing all punctuation, digits and stop-words) + lemmatization
Preprocessing (removing all punctuation) + lemmatization
Preprocessing (removing all punctuation, digits and stop-words) + stemming.

As evidenced by the plots, there is no significant difference between the results. This is likely due to the fact that the models focused on the "specific" words and phrases for person mental state and also because there is no significant amount of punctuation and digits, as these were obtained from EDA.

SVM	Decision Tree

Parametr searching

Classical models

Grid search was aplied. For SVC models were checked {"C": [1, 10, 100, 1000], "multi_class": ["ovr", "crammer_singer"]} parametrs. For Decision Tree were checked {'criterion': [gini', 'entropy'], 'max_depth': np.arange(3, 15)} parametrs.

DL models

A series of experiments was conducted to test the impact of varying the learning rate (1e-3, 2e-5, 2e-3) at the optimizer, regularization weight, and scheduler. Additionally, the batch size was modified from 64 to 128. To account for class imbalance, different weights were used in cross entropy. We also performed experiments with activation layers in CNN models (adding ReLU), but the baseline model had higher metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
checkpoints		checkpoints
graphs		graphs
metrics		metrics
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text classification

Prerequests

EDA

Classical algorithms

SVM

Val metrics

Details

Decision Tree

Val metrics

Details

DL algorithms

CNN

Train metrics

Val metrics

Linear

Train metrics

Val metrics

Total Comparing

Experiments

Prepocessing

Parametr searching

Classical models

DL models

About

Releases

Packages

Contributors 2

Languages

LukichevaPolina/nlp_lab

Folders and files

Latest commit

History

Repository files navigation

Text classification

Prerequests

EDA

Classical algorithms

SVM

Val metrics

Details

Decision Tree

Val metrics

Details

DL algorithms

CNN

Train metrics

Val metrics

Linear

Train metrics

Val metrics

Total Comparing

Experiments

Prepocessing

Parametr searching

Classical models

DL models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages