GitHub - MianWang123/Natural-Language-Processing: Given Amazon Reviews Dataset, 4 models are established here (RNN, LSTM, GRU, BILSTM) to predict custormers' rankings from given review, the best of which is further improved with attention; Besides, hugging face pretrained model is introduced for transfer learning; At last, Seq2Seq model is built to predict customers' summaries based on their reviews.

Project Title

Natural-Language-Processing

Tasks

Given Amazon Reviews Dataset, 4 models are established here (RNN, LSTM, GRU, BILSTM) to predict custormers' rankings from given review, the best of which is further improved with attention; Besides, hugging face pretrained model is introduced for transfer learning; At last, Seq2Seq model is built to predict customers' summaries based on their reviews.

Dataset

The training data has 190 Mb, and validation data has 13 Mb, which can be downloaded here:
https://drive.google.com/file/d/1fdfRa6frQGBIGxDg4zDB1r8eOtZxALw_/view?usp=sharing
https://drive.google.com/file/d/1pTqDFL9CiTMgHsP_VL4G2CL745DCWa2Y/view?usp=sharing
https://drive.google.com/file/d/1NPN0XjMOref0T8JSTy1413QxrNnOCQ7F/view?usp=sharing

Introduction

The whole prodecure can be divided into preprocessing the data, building recurrent models, tranfer learning, and establishing sequence to sequence model. Details can be found in the code, the framework is listed below.
Part 1 (4 different models, RNN,LSTM,GRU,BILSTM)

Build the Baseline

lemmatize text
evaluate ratio of each review
realize threshold baseline

Featurize the Dataset Using Torchtext

create torchtext data fields
create a tabular dataset
build vocab
create an iterator for the dataset

Establish the Recurrent Models

design an embedding layer
pack the embedded data (optional)
build a rnn/lstm/gru/bilstm layer
add "attention", "teacher forcing", or "beam search" module (optional)
design a fully connected layer

Train/Evaluate the Model

Part 2 (transfer learning)

Transfer Learning Using Hugging Face

Part 3 (seq2seq model)

Develop the Seq2Seq Model

featurize the dataset like before
build the encoder
build the decoder
build encoder-decoder combined model
train/evaluate the model

Data Visualization

Baseline

To begin with, I need to define a non-deep-learning baseline in case deep learning method performs bad or doesn't fit here. To do this, I utilized "ratio" below as criterion to evaluate reviews' ratings.

During the process, I need to lemmatize each word, a portion of original words v.s lemmatized words is as follows:

After implementing the baseline, I get the confusion matrix below:

where y-axis represents the true ratings of customers', x-axis is the predicted ratings based on our non-deep-learning criterion.

RNN, LSTM, GRU, BILSTM

Then, I can start to built the recurrent models(rnn, lstm, gru, bilstm), their confusion matrices are shown respectively below:

From these graphs, we can see that rnn performs worst, next is lstm, gru & bilstm stand out in the 4 models, they have similar performance, except that bilstm is a bit better than gru. Therefore, I tried to improve the best model - bilstm with "attention" module, the confusion matrix of bilstm with attention is below:

To make it clear, I plot the training loss & F1-score graph of rnn, lstm, gru, bilstm, bilstm with "attention" as follows:

Hugging face transfer learning

As comparison, hugging face transfering learning is introduced, 2 pretrained models(roberta, camembert) are utilized to predict the ratings, their confusion matrices are shown below:

The transferred model performs better than rnn and lstm, but worse than gru and bilstm(without attention), let alone bilstm with attention. To sum up, pre-trained model generalizes well, but does not necessarily outrun our own model. Given specific training data, choose one suitable model may best the transferred model.

Seq2Seq model

At last, I developed Seq2Seq model to predict the summary for reviews, it's consisted with encoder, decoder, encoder-decoder models. The raw text(reviews) look like:

Below are the ground truth summary and our predicted summary:

Acknowledge

Special thanks to CIS522 course's professor and TAs for providing the data set and guidance

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
pics		pics
README.md		README.md
recurrent.py		recurrent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Tasks

Dataset

Introduction

Data Visualization

Baseline

RNN, LSTM, GRU, BILSTM

Hugging face transfer learning

Seq2Seq model

Acknowledge

About

Releases

Packages

Languages

MianWang123/Natural-Language-Processing

Folders and files

Latest commit

History

Repository files navigation

Project Title

Tasks

Dataset

Introduction

Data Visualization

Baseline

RNN, LSTM, GRU, BILSTM

Hugging face transfer learning

Seq2Seq model

Acknowledge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages