This repository contains three sub-projects focused on Natural Language Processing (NLP): ProbingGPT, Postagging, and Autocorrection. Each sub-project addresses different aspects of NLP, utilizing various techniques and algorithms.
The ProbingGPT project, inspired by the methodology outlined in the paper, utilizes the Baukit library to probe the GPT-2 small model downloaded from Hugging Face. Focused on layers h.0.mlp, h.3.mlp, h.9.mlp, and h.9.attn, the project employs the SNLI corpus to generate text. Hidden states from these layers are captured through Baukit probing, and linear classifiers are trained on them. The evaluation aims to assess the model's understanding of contradictions, neutral and entailments and how well the model learns at each layer, all executed in Google Colab for seamless collaboration and execution.
Import the Jupyter file into Google Colab and follow the provided instructions. All the results and evaluations are present in the same file.
The Postagging sub-project explores part-of-speech tagging using Hidden Markov Models (HMM) with bigram and trigram implementations. The study covers three languages: English, Japanese, and Bulgarian. Additionally, the performance of Vanilla RNN, LSTM, and Bidirectional LSTM models is compared for part-of-speech tagging.
- For HMM with Viterbi Algo
python3 train_hmm.py data/ptb.2-21.tgs data/ptb.2-21.txt > my.hmm # for training
python3 viterbi.py my.hmm < data/ptb.22.txt > my.out # for viterbi
python3 tag_acc.py data/ptb.22.tgs my.out # for eval
- For VRNN, LSTM, BIDLSTM
python3 vrnn_lstm_bidlstm.py data/ptb.2-21.tgs data/ptb.2-21.txt data/ptb.22.tgs data/ptb.22.txt 22_1.out # for training and eval
Detailed Evaluated results are provided in POSTagging/README.md file.
Autocorrection focuses on evaluating different spell correction methods, including bigram, unigram, trigram, smoothed bigram, bigram with backoff, smoothed unigram, and trigram with backoff. The project uses an edit model for autocorrection and compares the performance of each method based on evaluation metrics.
python3 EditModel.py # for sanity check your edit model
python3 SpellCorrect.py # for performance of all the language models
If you would like to contribute or report issues, please mail to Mrudhul Guda