Named Entity Recognition (NER) Labeling

A DL model to predict named entity recognition using BLSTM and GloVe word embeddings using PyTorch. Two models have been trained using the CoNLL-2003 corpus. The difficulty in correctly predicting the NER tag comes from encountering unknown words. If a word in not in the corpus, the model cannot be expected to correctly classify the NER tag for that word. However, strategies in handling unknown words can be utilized to lessen the severity of this issue. For example, adding <UNK> tags, and other variants such as <UNK-UPPERCASE> and <UNK-NUMBER>, assists the model in learning how to handle unknown words.

For more details on the dataset, click here.

First Model: BLSTM with random embeddings

This model was a typical one-layer bidirectional LSTM with dropout. However, the embeddings used were randomly intialized. The model architecture is shown below:

Embedding (100-dim) > BLSTM (size 256) > Linear (size 512) > ELU > Linear (size 10)

Second Model: BLSTM with GloVe embeddings

This model is similar to the previous model. However, the embedding layer is replaced with a pretrained GloVe embedding layer. Moreover, spelling features on the words were concatenated to the embedding layer, yielding a richer word representation. Examples of spelling features include ALL_UPPER (e.g. IBM), ALL_LOWER (e.g. cat), NUMBER, FIRST_UPPER (i.e. John), and OTHERS. This improvement can be clearly seen by the F1 score jump between this model and the previous (80.33 to 93.23). The model architecture is shown below:

GloVe Embedding + Spelling Embedding (120-dim) > BLSTM (size 256) > Linear (size 512) > ELU > Linear (size 10)

Results

Model	Precision	Recall	Accuracy	F1 score
BLSTM + Random Embeddings	84.22%	76.79%	96.03%	80.33
BLSTM + GloVe + Spelling Embedding	92.54%	93.92%	98.83%	93.23

While these results look very promising, keep in mind that the average sentence length in English is 14 words. Thus, for the sentence, the accuracy becomes (98.83%)^14, which is approximately 85%. The current state of the art has an F1 score of 94.6. Hence there is still room for improvement in this task. One future work idea that could enhance this method is adding an additional embedding layer to capture not only the spelling features, but also the character-level features, perhaps by utilizing a CNN.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
README.md		README.md
predict.ipynb		predict.ipynb
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named Entity Recognition (NER) Labeling

First Model: BLSTM with random embeddings

Second Model: BLSTM with GloVe embeddings

Results

About

Releases

Packages

Languages

Sanavesa/Named-Entity-Recognition-Labeling

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition (NER) Labeling

First Model: BLSTM with random embeddings

Second Model: BLSTM with GloVe embeddings

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages