-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
140 lines (113 loc) · 4.99 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Source code: Neural Architectures for Nested NER through Linearization
======================================================================
Jana Straková, Milan Straka and Jan Hajič
https://aclweb.org/anthology/papers/P/P19/P19-1527/
{strakova,straka,hajic}@ufal.mff.cuni.cz
License
-------
Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
Mathematics and Physics, Charles University, Czech Republic.
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/.
Please cite as:
---------------
@inproceedings{strakova-etal-2019-neural,
title = {{Neural Architectures for Nested {NER} through Linearization}},
author = {Jana Strakov{\'a} and Milan Straka and Jan Haji\v{c}},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
month = jul,
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/P19-1527},
pages = {5326--5331},
}
How to run the tagger
---------------------
1. Install requirements
pip install -r requirements.txt
2. Download the data
ACE-2004: https://catalog.ldc.upenn.edu/LDC2005T09
ACE-2005: https://catalog.ldc.upenn.edu/LDC2006T06
GENIA: http://www.geniaproject.org/
3. Create inputs
The input of the tagger is in the CoNLL-2003 BILOU format. CoNLL-2003 shared
task data format is described here:
https://www.clips.uantwerpen.be/conll2003/ner/ . BILOU format is described
here (Ratinov and Roth, 2009): https://www.aclweb.org/anthology/W09-1119 .
The input format is a CoNLL format, with one token per line, sentences
delimited by empty line. For each token, columns are separated by tabs. First
column is the surface token, second column is lemma, third column is a POS tag
and fourth column is the BILOU encoded NE label.
For flat corpora (e.g. CoNLL-2003 English and German), the fourth column bears
exactly one NE label, e.g. (example from CoNLL-2003 English):
-DOCSTART- -docstart- NN O
EU EU NNP U-ORG
rejects reject VBZ O
German german JJ U-MISC
call call NN O
to to TO O
boycott boycott VB O
British british JJ U-MISC
lamb lamb NN O
. . . O
For nested NE corpora, the NE tags are linearized (flattened) according to
rules described in the paper, e.g. (example from ACE-2004):
The the DT B-GPE
Chinese chinese JJ I-GPE|U-GPE
government government NN L-GPE
and and CC O
the the DT B-GPE
Australian australian JJ I-GPE|U-GPE
government government NN L-GPE
signed sign VBD O
an an DT O
agreement agreement NN O
today today NN O
, , , O
wherein wherein WRB O
the the DT B-GPE
Australian australian JJ I-GPE|U-GPE
party party NN L-GPE
would would MD O
provide provide VB O
China China NNP U-GPE
with with IN O
a a DT O
preferential preferential JJ O
financial financial JJ O
loan loan NN O
of of IN O
150 150 CD O
million million CD O
Australian australian JJ U-GPE
dollars dollar NNS O
. . . O
The lemmatization and POS tagging can be done with e.g. UDPipe
(http://ufal.mff.cuni.cz/udpipe) or with MorphoDiTa
(http://ufal.mff.cuni.cz/morphodita) or with any tool of your choice. If you
don't have any POS tagger or lemmatizer, simply fill the respective columns
with dummy (e.g. "_").
4. Get word embeddings
- word2vec,
- FastText,
- BERT,
- ELMo,
- Flair
from sources described in the paper. The input formats are:
- word2vec: The native word2vec text file.
- FastText: The native FastText binary.
- contextualized embeddings (BERT, ELMo, Flair): A text file with one token per
line, first column is the token, all other columns are the vector real valued
numbers; columns separated with space. The format is readable for human eyes,
but quite large, sorry for the inconvenience. The per-token BERT
contextualized word embeddings are created as an average of all token
corresponding BERT subowords. The ELMo and Flair are generated using this
code: https://github.com/zalandoresearch/flair.
You can also run the tagger without pretrained word embeddings just with
end-to-end word embeddings and character-level embeddings (created inside the
tagger), or with a subset of the above mentioned pretrained word embeddings.
5. Run the tagger
Usage example:
./tagger.py --corpus=CoNLL_en --train_data=conll_en/train_dev_bilou.conll --test_data=conll_en/test_bilou.conll --decoding=seq2seq --epochs=10:1e-3,8:1e-4 --form_wes_model=word_embeddings/conll_en_form.txt --lemma_wes_model=word_embeddings/conll_en_lemma.txt --bert_embeddings_train=bert_embeddings/conll_en_train_dev_bert_large_embeddings.txt --bert_embeddings_test=bert_embeddings/conll_en_test_bert_large_embeddings.txt --flair_train=flair_embeddings/conll_en_train_dev.txt --flair_test=flair_embeddings/conll_en_test.txt --elmo_train=elmo_embeddings/conll_en_train_dev.txt --elmo_test=elmo_embeddings/conll_en_test.txt --name=seq2seq+ELMo+BERT+Flair