Working with all lowercase dataset #32

ghost · 2019-06-17T09:09:21Z

Thanks for the wonderful work here!

I have some text files and want to extract NE from them by running nerTagger.py . However, my files contain all lowercase characters and of course, I can't get any NE result.

For instance:

[Normal sentence]: I live in New York.
Output:

...
"text": "I live in New York.",
"entities": [
                {
                    "text": "New York",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...

[Lowercase sentence]: i live in new york.
Output:

...
"text": "i live in new york.",
"entities": []
...

Expected:

...
"text": "i live in new york.",
"entities": [
                {
                    "text": "new york",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...

Therefore, should we develop a caseless NER model?

The text was updated successfully, but these errors were encountered:

kermitt2 · 2019-06-17T12:07:16Z

Hi @Protossnam and thanks!

That's a very good point, I will prepare some uncased models for NER (with the uncased embedding data).

ghost · 2019-06-18T16:14:33Z

@kermitt2 Hi there, friend! I've tried to configure and use another glove uncased word embedding - the glove.42B.300d.txt (which is downloaded from https://nlp.stanford.edu/projects/glove/ and unzipped the glove.42B.300d.zip). I also trained an uncased model but the performance is not so good. I can attach the lowercase file and the output json here for you as well ;) Keep it up your wonderful work!
English-test-noise_niveau_0-SS.txt (This is the lowercase text for NER)
Noise0SS.txt (Result/Output JSON file from Bi-LSTM_CNN_CRF with uncased embedding.)
Notes: Github doesn't allow me to upload .json file so you can change the file extension from .txt to .json

Edit: I added 3 more train/dev/test files
uncased-data-for-delft.zip

ghost · 2019-06-19T08:02:56Z

@kermitt2 Oh, I didn't config my nerTagger.py file correctly :D I changed the 'train' code, however I used 'train_eval' option. I corrected my mistake (change the dataset in 'train_eval' and 'tag' to calculate F1-score). Voila, here is the output.
Noise0SS.txt
Notes: Github doesn't allow me to upload .json file so you can change the file extension from .txt to .json

kermitt2 · 2019-06-19T08:10:05Z

@Protossnam results are excellent, great ! I was not expecting such good results with lowercase.

I've only downloaded the uncased embeddings, but I have not yet started to do some training or to adapt the code. I was thinking simply adding a parameter --uncased to the command line, then set correctly the parameter, and extend the model name with -uncased to distinguish it from the other models.

There is no ELMo uncased model unfortunately, so it will not be possible to exploit it for caseless text I think. However, there are BERT uncased models - I am adding the support of BERT currently and the base model is already quite good.

Thanks a lot for the news and the 3 train/dev/test files.

kermitt2 self-assigned this Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with all lowercase dataset #32

Working with all lowercase dataset #32

ghost commented Jun 17, 2019 •

edited by ghost

Loading

kermitt2 commented Jun 17, 2019

ghost commented Jun 18, 2019 •

edited by ghost

Loading

ghost commented Jun 19, 2019

kermitt2 commented Jun 19, 2019

Working with all lowercase dataset #32

Working with all lowercase dataset #32

Comments

ghost commented Jun 17, 2019 • edited by ghost Loading

kermitt2 commented Jun 17, 2019

ghost commented Jun 18, 2019 • edited by ghost Loading

ghost commented Jun 19, 2019

kermitt2 commented Jun 19, 2019

ghost commented Jun 17, 2019 •

edited by ghost

Loading

ghost commented Jun 18, 2019 •

edited by ghost

Loading