Natural-Language-Processing
Given Amazon Reviews Dataset, 4 models are established here (RNN, LSTM, GRU, BILSTM) to predict custormers' rankings from given review, the best of which is further improved with attention; Besides, hugging face pretrained model is introduced for transfer learning; At last, Seq2Seq model is built to predict customers' summaries based on their reviews.
The training data has 190 Mb, and validation data has 13 Mb, which can be downloaded here:
https://drive.google.com/file/d/1fdfRa6frQGBIGxDg4zDB1r8eOtZxALw_/view?usp=sharing
https://drive.google.com/file/d/1pTqDFL9CiTMgHsP_VL4G2CL745DCWa2Y/view?usp=sharing
https://drive.google.com/file/d/1NPN0XjMOref0T8JSTy1413QxrNnOCQ7F/view?usp=sharing
The whole prodecure can be divided into preprocessing the data, building recurrent models, tranfer learning, and establishing sequence to sequence model. Details can be found in the code, the framework is listed below.
Part 1 (4 different models, RNN,LSTM,GRU,BILSTM)
- Build the Baseline
- lemmatize text
- evaluate ratio of each review
- realize threshold baseline
- Featurize the Dataset Using Torchtext
- create torchtext data fields
- create a tabular dataset
- build vocab
- create an iterator for the dataset
- Establish the Recurrent Models
- design an embedding layer
- pack the embedded data (optional)
- build a rnn/lstm/gru/bilstm layer
- add "attention", "teacher forcing", or "beam search" module (optional)
- design a fully connected layer
- Train/Evaluate the Model
Part 2 (transfer learning)
- Transfer Learning Using Hugging Face
Part 3 (seq2seq model)
- Develop the Seq2Seq Model
- featurize the dataset like before
- build the encoder
- build the decoder
- build encoder-decoder combined model
- train/evaluate the model
To begin with, I need to define a non-deep-learning baseline in case deep learning method performs bad or doesn't fit here. To do this, I utilized "ratio" below as criterion to evaluate reviews' ratings.
During the process, I need to lemmatize each word, a portion of original words v.s lemmatized words is as follows: After implementing the baseline, I get the confusion matrix below: where y-axis represents the true ratings of customers', x-axis is the predicted ratings based on our non-deep-learning criterion.Then, I can start to built the recurrent models(rnn, lstm, gru, bilstm), their confusion matrices are shown respectively below:
From these graphs, we can see that rnn performs worst, next is lstm, gru & bilstm stand out in the 4 models, they have similar performance, except that bilstm is a bit better than gru. Therefore, I tried to improve the best model - bilstm with "attention" module, the confusion matrix of bilstm with attention is below:
To make it clear, I plot the training loss & F1-score graph of rnn, lstm, gru, bilstm, bilstm with "attention" as follows:
As comparison, hugging face transfering learning is introduced, 2 pretrained models(roberta, camembert) are utilized to predict the ratings, their confusion matrices are shown below:
The transferred model performs better than rnn and lstm, but worse than gru and bilstm(without attention), let alone bilstm with attention. To sum up, pre-trained model generalizes well, but does not necessarily outrun our own model. Given specific training data, choose one suitable model may best the transferred model.
At last, I developed Seq2Seq model to predict the summary for reviews, it's consisted with encoder, decoder, encoder-decoder models. The raw text(reviews) look like:
Below are the ground truth summary and our predicted summary:Special thanks to CIS522 course's professor and TAs for providing the data set and guidance