Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with all lowercase dataset #32

Open
ghost opened this issue Jun 17, 2019 · 4 comments
Open

Working with all lowercase dataset #32

ghost opened this issue Jun 17, 2019 · 4 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jun 17, 2019

Thanks for the wonderful work here!

I have some text files and want to extract NE from them by running nerTagger.py . However, my files contain all lowercase characters and of course, I can't get any NE result.

For instance:

  • [Normal sentence]: I live in New York.
    Output:
...
"text": "I live in New York.",
"entities": [
                {
                    "text": "New York",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...
  • [Lowercase sentence]: i live in new york.
    Output:
...
"text": "i live in new york.",
"entities": []
...

Expected:

...
"text": "i live in new york.",
"entities": [
                {
                    "text": "new york",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...

Therefore, should we develop a caseless NER model?

@kermitt2
Copy link
Owner

Hi @Protossnam and thanks!

That's a very good point, I will prepare some uncased models for NER (with the uncased embedding data).

@kermitt2 kermitt2 self-assigned this Jun 17, 2019
@ghost
Copy link
Author

ghost commented Jun 18, 2019

@kermitt2 Hi there, friend! I've tried to configure and use another glove uncased word embedding - the glove.42B.300d.txt (which is downloaded from https://nlp.stanford.edu/projects/glove/ and unzipped the glove.42B.300d.zip). I also trained an uncased model but the performance is not so good. I can attach the lowercase file and the output json here for you as well ;) Keep it up your wonderful work!
English-test-noise_niveau_0-SS.txt (This is the lowercase text for NER)
Noise0SS.txt (Result/Output JSON file from Bi-LSTM_CNN_CRF with uncased embedding.)
Notes: Github doesn't allow me to upload .json file so you can change the file extension from .txt to .json

Edit: I added 3 more train/dev/test files
uncased-data-for-delft.zip

@ghost
Copy link
Author

ghost commented Jun 19, 2019

@kermitt2 Oh, I didn't config my nerTagger.py file correctly :D I changed the 'train' code, however I used 'train_eval' option. I corrected my mistake (change the dataset in 'train_eval' and 'tag' to calculate F1-score). Voila, here is the output.
Noise0SS.txt
Notes: Github doesn't allow me to upload .json file so you can change the file extension from .txt to .json

@kermitt2
Copy link
Owner

@Protossnam results are excellent, great ! I was not expecting such good results with lowercase.

I've only downloaded the uncased embeddings, but I have not yet started to do some training or to adapt the code. I was thinking simply adding a parameter --uncased to the command line, then set correctly the parameter, and extend the model name with -uncased to distinguish it from the other models.

There is no ELMo uncased model unfortunately, so it will not be possible to exploit it for caseless text I think. However, there are BERT uncased models - I am adding the support of BERT currently and the base model is already quite good.

Thanks a lot for the news and the 3 train/dev/test files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant