With the development of Medical Artificial Intelligence (AI) System, Natural Language Processing (NLP) has played an essential role to process medical texts and build intelligent machines. Named Entity Recognition (NER), one of the most basic NLP tasks, is primarily studied since it is the cornerstone of the following NLP downstream tasks, e.g., Relation Extraction. In this work, a character-level Bidirectional Long-short Term Memory (BiLSTM)-based models were introduced to tackle the challenge of medical texts. The input character embedding vectors were randomly initialized and then updated during training. The character-level BiLSTM extracted features from medical order-matter sequential data. The followed Conditional Random Field (CRF) predicted the final entity tag. Results have shown that the presented method took advantage of the recurrent architecture and achieved competitive performances for medical texts. Promising results paved the road towards building robust and powerful medical AI engines.
Task: Named Entity Recognition (NER) implemented using PyTorch
Background: Medical & Clinical Healthcare
Level: Character (and Word) Level
Data Annotation: BIOES tagging Scheme
Method:
-
CRF++
-
Character-level BiLSTM + CRF
-
Character-level BiLSTM + Word-level BiLSTM + CRF
-
Character-level BiLSTM + Word-level CNN + CRF
Results:
Results of this work can be downloaded here.
Prerequisities:
For Word-level models:
The pre-trained word vectors can be downloaded here.
def load_word_vector(self):
"""
Load word vectors
"""
print("Start to load pre-trained word vectors!!")
pre_trained = {}
for i, line in enumerate(codecs.open(self.model_path + "word_vectors.vec", 'r', encoding='utf-8')):
line = line.rstrip().split()
if len(line) == self.word_dim + 1:
pre_trained[line[0]] = np.array([float(x) for x in line[1:]]).astype(np.float32)
return pre_trained
For Character-level models:
The Embeddings of characters are randomly initialized and updated by a PyTorch Function, i.e., (nn.Embedding).
self.char_embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=self.char_dim)
Number of entities: 34
No. | Entity | Number | Recognize |
---|---|---|---|
1 | E95f2a617 | 3221 | ✔︎ |
2 | E320ca3f6 | 6338 | ✔︎ |
3 | E340ca71c | 22209 | ✔︎ |
4 | E1ceb2bd7 | 3706 | ✔︎ |
5 | E1deb2d6a | 9744 | ✔︎ |
6 | E370cabd5 | 6196 | ✔︎ |
7 | E360caa42 | 5268 | ✔︎ |
8 | E310ca263 | 6948 | ✔︎ |
9 | E300ca0d0 | 9490 | ✔︎ |
10 | E18eb258b | 4526 | ✔︎ |
11 | E3c0cb3b4 | 6280 | ✔︎ |
12 | E1beb2a44 | 1663 | ✔︎ |
13 | E3d0cb547 | 1025 | ✔︎ |
14 | E14eb1f3f | 406 | ⨉ |
15 | E8ff29ca5 | 1676 | ✔︎ |
16 | E330ca589 | 1487 | ✔︎ |
17 | E89f29333 | 1093 | ⨉ |
18 | E8ef29b12 | 217 | ⨉ |
19 | E1eeb2efd | 1637 | ✔︎ |
20 | E1aeb28b1 | 209 | ⨉ |
21 | E17eb23f8 | 670 | ✔︎ |
22 | E87f05176 | 407 | ✔︎ |
23 | E88f05309 | 355 | ✔︎ |
24 | E19eb271e | 152 | ⨉ |
25 | E8df2997f | 135 | ⨉ |
26 | E94f2a484 | 584 | ✔︎ |
27 | E13eb1dac | 58 | ⨉ |
28 | E85f04e50 | 6 | ⨉ |
29 | E8bf057c2 | 7 | ⨉ |
30 | E8cf297ec | 6 | ⨉ |
31 | E8ff05e0e | 6 | ⨉︎ |
32 | E87e38583 | 18 | ⨉︎ |
33 | E86f04fe3 | 6 | ⨉︎ |
34 | E8cf05955 | 64 | ⨉︎ |
train data: 6494 vocab size: 2258 unique tag: 74
dev data: 865 vocab size: 2258 unique tag: 74
data: number of sentences vocab: character vocabulary unique tag: number of (prefix + entities)
At the root of the project, you will see:
├── data
| └── train # Training set
| └── val # Validation set
| └── test # Testing set
├── models
| └── data.pkl # Containing all the used data, e.g., look-up table
| └── params.pkl # Saved PyTorch model
├── preprocess-data.py # Preprocess the original dataset
├── data_manager.py # Load the train/val/test data
├── model.py # BiLSTM-CRF with Attention Model
├── main.py # Main codes for the training and prediction
├── utils.py # Some functions for prediction stage and evaluation criteria
├── config.yml # Contain the hyper-parameters settings
Character Input
|
Lookup Layer <----------------| Update Character Embedding
| |
Bi-LSTM Model <---------------| Extract Features
| | Back-propagation Errors
Linear Layer <----------------| Update Trainable Parameters
| |
CRF Model <-----------------|
| |
Output corresponding tags ---> [NLL Loss] <--- Target tags
-
Currently only support CPU training
GPU is much more slower than the CPU as a result of the viterbi decode's FOR LOOP.
-
Cannot recognize entities with fewer examples (< 500 samples)
Overall F1 score on 18 entities:
Separate F1 score on each entity:
No. | Entity | Number | F1 Score |
---|---|---|---|
1 | E95f2a617 | 3221 | ✔︎ |
2 | E320ca3f6 | 6338 | ✔︎ |
3 | E340ca71c | 22209 | ✔︎ |
4 | E1ceb2bd7 | 3706 | ✔︎ |
5 | E1deb2d6a | 9744 | ✔︎ |
6 | E370cabd5 | 6196 | ✔︎ |
7 | E360caa42 | 5268 | ✔︎ |
8 | E310ca263 | 6948 | ✔︎ |
9 | E300ca0d0 | 9490 | ✔︎ |
10 | E18eb258b | 4526 | ✔︎ |
11 | E3c0cb3b4 | 6280 | ✔︎ |
12 | E1beb2a44 | 1663 | ✔︎ |
13 | E3d0cb547 | 1025 | ✔︎ |
14 | E8ff29ca5 | 1676 | ✔︎ |
15 | E330ca589 | 1487 | ✔︎ |
16 | E1eeb2efd | 1637 | ✔︎ |
17 | E17eb23f8 | 670 | ✔︎ |
18 | E94f2a484 | 584 | ✔︎ |
Name | Value |
---|---|
embedding_size | 30 / 40 / 50 / 100 |
hidden_size | 128 / 256 |
batch_size | 8/ 16 / 32 / 64 |
dropout rate | 0.50 / 0.75 |
learning rate | 0.01 / 0.001 |
epochs | 100 |
weight decay | 0.0005 |
max length | 100 / 120 |
The BiLSTM + CRF model has been deployed using Docker + Flask as a webapp.
The codes and demos were open-sourced in this repo.
MIT Licence