Skip to content

An implementation of several models (BiLSTM-CRF, BiLSTM-CNN, BiLSTM-BiLSTM) for Medical Named Entity Recognition (NER)

License

Notifications You must be signed in to change notification settings

SuperBruceJia/MedicalNER

Repository files navigation

Medical Named Entity Recognition (MedicalNER)

Abstract

With the development of Medical Artificial Intelligence (AI) System, Natural Language Processing (NLP) has played an essential role to process medical texts and build intelligent machines. Named Entity Recognition (NER), one of the most basic NLP tasks, is primarily studied since it is the cornerstone of the following NLP downstream tasks, e.g., Relation Extraction. In this work, a character-level Bidirectional Long-short Term Memory (BiLSTM)-based models were introduced to tackle the challenge of medical texts. The input character embedding vectors were randomly initialized and then updated during training. The character-level BiLSTM extracted features from medical order-matter sequential data. The followed Conditional Random Field (CRF) predicted the final entity tag. Results have shown that the presented method took advantage of the recurrent architecture and achieved competitive performances for medical texts. Promising results paved the road towards building robust and powerful medical AI engines.


Topic and Study

Task: Named Entity Recognition (NER) implemented using PyTorch

Background: Medical & Clinical Healthcare

Level: Character (and Word) Level

Data Annotation: BIOES tagging Scheme

Method:

  1. CRF++

  2. Character-level BiLSTM + CRF

  3. Character-level BiLSTM + Word-level BiLSTM + CRF

  4. Character-level BiLSTM + Word-level CNN + CRF

Results:

 Results of this work can be downloaded here.

Prerequisities:

 For Word-level models:

 The pre-trained word vectors can be downloaded here.

def load_word_vector(self):
    """
    Load word vectors
    """
    print("Start to load pre-trained word vectors!!")
    pre_trained = {}
    for i, line in enumerate(codecs.open(self.model_path + "word_vectors.vec", 'r', encoding='utf-8')):
        line = line.rstrip().split()
        if len(line) == self.word_dim + 1:
            pre_trained[line[0]] = np.array([float(x) for x in line[1:]]).astype(np.float32)
    return pre_trained

 For Character-level models:

 The Embeddings of characters are randomly initialized and updated by a PyTorch Function, i.e., (nn.Embedding).

self.char_embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=self.char_dim)

Some Statistics Info

Number of entities: 34

No. Entity Number Recognize
1 E95f2a617 3221 ✔︎
2 E320ca3f6 6338 ✔︎
3 E340ca71c 22209 ✔︎
4 E1ceb2bd7 3706 ✔︎
5 E1deb2d6a 9744 ✔︎
6 E370cabd5 6196 ✔︎
7 E360caa42 5268 ✔︎
8 E310ca263 6948 ✔︎
9 E300ca0d0 9490 ✔︎
10 E18eb258b 4526 ✔︎
11 E3c0cb3b4 6280 ✔︎
12 E1beb2a44 1663 ✔︎
13 E3d0cb547 1025 ✔︎
14 E14eb1f3f 406
15 E8ff29ca5 1676 ✔︎
16 E330ca589 1487 ✔︎
17 E89f29333 1093
18 E8ef29b12 217
19 E1eeb2efd 1637 ✔︎
20 E1aeb28b1 209
21 E17eb23f8 670 ✔︎
22 E87f05176 407 ✔︎
23 E88f05309 355 ✔︎
24 E19eb271e 152
25 E8df2997f 135
26 E94f2a484 584 ✔︎
27 E13eb1dac 58
28 E85f04e50 6
29 E8bf057c2 7
30 E8cf297ec 6
31 E8ff05e0e 6 ⨉︎
32 E87e38583 18 ⨉︎
33 E86f04fe3 6 ⨉︎
34 E8cf05955 64 ⨉︎

train data: 6494   vocab size: 2258   unique tag: 74  

dev data: 865   vocab size: 2258   unique tag: 74  

data: number of sentences vocab: character vocabulary unique tag: number of (prefix + entities)


Structure of the code

At the root of the project, you will see:

├── data
|  └── train # Training set 
|  └── val # Validation set 
|  └── test # Testing set 
├── models
|  └── data.pkl # Containing all the used data, e.g., look-up table
|  └── params.pkl # Saved PyTorch model
├── preprocess-data.py # Preprocess the original dataset
├── data_manager.py # Load the train/val/test data
├── model.py # BiLSTM-CRF with Attention Model
├── main.py # Main codes for the training and prediction
├── utils.py # Some functions for prediction stage and evaluation criteria
├── config.yml # Contain the hyper-parameters settings

Basic Model Architecture

    Character Input
          |                         
     Lookup Layer  <----------------|    Update Character Embedding
          |                         |
     Bi-LSTM Model  <---------------|        Extract Features
          |                         |     Back-propagation Errors
     Linear Layer  <----------------|   Update Trainable Parameters
          |                         |
       CRF Model  <-----------------|    
          |                         |
Output corresponding tags  ---> [NLL Loss] <---  Target tags

Limitations

  1. Currently only support CPU training

    GPU is much more slower than the CPU as a result of the viterbi decode's FOR LOOP.

  2. Cannot recognize entities with fewer examples (< 500 samples)


Final Results

Overall F1 score on 18 entities:

Separate F1 score on each entity:

No. Entity Number F1 Score
1 E95f2a617 3221 ✔︎
2 E320ca3f6 6338 ✔︎
3 E340ca71c 22209 ✔︎
4 E1ceb2bd7 3706 ✔︎
5 E1deb2d6a 9744 ✔︎
6 E370cabd5 6196 ✔︎
7 E360caa42 5268 ✔︎
8 E310ca263 6948 ✔︎
9 E300ca0d0 9490 ✔︎
10 E18eb258b 4526 ✔︎
11 E3c0cb3b4 6280 ✔︎
12 E1beb2a44 1663 ✔︎
13 E3d0cb547 1025 ✔︎
14 E8ff29ca5 1676 ✔︎
15 E330ca589 1487 ✔︎
16 E1eeb2efd 1637 ✔︎
17 E17eb23f8 670 ✔︎
18 E94f2a484 584 ✔︎

Hyperparameters settings

Name Value
embedding_size 30 / 40 / 50 / 100
hidden_size 128 / 256
batch_size 8/ 16 / 32 / 64
dropout rate 0.50 / 0.75
learning rate 0.01 / 0.001
epochs 100
weight decay 0.0005
max length 100 / 120

Model Deployment

The BiLSTM + CRF model has been deployed using Docker + Flask as a webapp.

The codes and demos were open-sourced in this repo.

Screenshot of the model output:


Reference

Traditional Methods for NER: BiLSTM + CNN + CRF

  1. Neural Architectures for Named Entity Recognition

  2. Log-Linear Models, MEMMs, and CRFs

  3. Named Entity Recognition with Bidirectional LSTM-CNNs

  4. A Survey on Deep Learning for Named Entity Recognition

SOTA Method for NER (I think)

  1. Lattice LSTM

  2. 中文NER的正确打开方式: 词汇增强方法总结 (从Lattice LSTM到FLAT)

  3. 工业界如何解决NER问题


Licence

MIT Licence

About

An implementation of several models (BiLSTM-CRF, BiLSTM-CNN, BiLSTM-BiLSTM) for Medical Named Entity Recognition (NER)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published