Skip to content

A DL model to predict named entity recognition using BLSTM and GloVe word embeddings in PyTorch.

Notifications You must be signed in to change notification settings

Sanavesa/Named-Entity-Recognition-Labeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition (NER) Labeling

A DL model to predict named entity recognition using BLSTM and GloVe word embeddings using PyTorch. Two models have been trained using the CoNLL-2003 corpus. The difficulty in correctly predicting the NER tag comes from encountering unknown words. If a word in not in the corpus, the model cannot be expected to correctly classify the NER tag for that word. However, strategies in handling unknown words can be utilized to lessen the severity of this issue. For example, adding <UNK> tags, and other variants such as <UNK-UPPERCASE> and <UNK-NUMBER>, assists the model in learning how to handle unknown words.

For more details on the dataset, click here.

First Model: BLSTM with random embeddings

This model was a typical one-layer bidirectional LSTM with dropout. However, the embeddings used were randomly intialized. The model architecture is shown below:

Embedding (100-dim) > BLSTM (size 256) > Linear (size 512) > ELU > Linear (size 10)

Second Model: BLSTM with GloVe embeddings

This model is similar to the previous model. However, the embedding layer is replaced with a pretrained GloVe embedding layer. Moreover, spelling features on the words were concatenated to the embedding layer, yielding a richer word representation. Examples of spelling features include ALL_UPPER (e.g. IBM), ALL_LOWER (e.g. cat), NUMBER, FIRST_UPPER (i.e. John), and OTHERS. This improvement can be clearly seen by the F1 score jump between this model and the previous (80.33 to 93.23). The model architecture is shown below:

GloVe Embedding + Spelling Embedding (120-dim) > BLSTM (size 256) > Linear (size 512) > ELU > Linear (size 10)

Results

Model Precision Recall Accuracy F1 score
BLSTM + Random Embeddings 84.22% 76.79% 96.03% 80.33
BLSTM + GloVe + Spelling Embedding 92.54% 93.92% 98.83% 93.23

While these results look very promising, keep in mind that the average sentence length in English is 14 words. Thus, for the sentence, the accuracy becomes (98.83%)^14, which is approximately 85%. The current state of the art has an F1 score of 94.6. Hence there is still room for improvement in this task. One future work idea that could enhance this method is adding an additional embedding layer to capture not only the spelling features, but also the character-level features, perhaps by utilizing a CNN.

About

A DL model to predict named entity recognition using BLSTM and GloVe word embeddings in PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published