Masking PROPN when training/predicting with textcat #7713

fyamagata · 2021-04-09T05:56:01Z

fyamagata
Apr 9, 2021

General guidance says that masking PROPN before training/predicting with text classification models to prevent over-fitting to some PROPN entities. Some datasets such as GoEmotios previously masks PROPN, like [NAME]. But, Spacy's sample training code or dataset doesn't. Do I need to mask PROPN's previously, to avoid over-fitting or Spacy training code masks them by itself? Please provide a best practice on preprocessing text data.
Thank you.

svlandeg · 2021-04-09T08:03:01Z

svlandeg
Apr 9, 2021
Maintainer

Hi! As a maintenance note - this is a good question for the discussion forum, so I'll move it there. The original issue will be closed, but will link through to the new, open discussion thread.

0 replies

svlandeg · 2021-04-09T08:20:40Z

svlandeg
Apr 9, 2021
Maintainer

You're right to note that spaCy doesn't do this by default. If you want to have this kind of preprocessing, it should be relatively straightforward to implement a custom function that is tailored towards your use-case.

Whether or not you should mask the actual texts of some entities is probably up for debate, and depends on the use-case. In general, I agree with you that it can help overfitting, but that's really only valid when the dataset is limited, or the distribution skewed somehow.

For instance, let's say you're training a biomedical relation extraction system on articles describing cancer pathways, and you're NOT masking the entity texts, then your system might actually be learning the pathways themselves (X inhibits Y) rather than the grammatical/lexical structures in the sentences that describe that X inhibits Y. This could mean that your system will be less likely to pick up lesser known or new relations, which is definitely bad.

On the other hand, if we're creating a Named Entity Recognizer that is supposed to pick up specific names (e.g. of people, or genes/proteins), it would really benefit from the lexical properties of the actual texts (use of capitalization, hyphens, numbers,...), and if there is enough data and the network is big enough, it should also be able to generalize properly from the given examples.

1 reply

fyamagata Apr 9, 2021
Author

@svlandeg
Thank you for your kind reply. I understood the reason why spaCy doesn't mask proper nouns. As I'm not training technical or academical papers and training ordinary business documents whose entity words are not important, I would mask them myself, by replacing them with [Person] or [Date], etc ([output of ner]).
Thank you for your excellent advice!! Hope you a good day!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masking PROPN when training/predicting with textcat #7713

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Masking PROPN when training/predicting with textcat #7713

fyamagata Apr 9, 2021

Replies: 2 comments · 1 reply

svlandeg Apr 9, 2021 Maintainer

svlandeg Apr 9, 2021 Maintainer

fyamagata Apr 9, 2021 Author

fyamagata
Apr 9, 2021

Replies: 2 comments 1 reply

svlandeg
Apr 9, 2021
Maintainer

svlandeg
Apr 9, 2021
Maintainer

fyamagata Apr 9, 2021
Author