Masking PROPN when training/predicting with textcat #7713
Replies: 2 comments 1 reply
-
Hi! As a maintenance note - this is a good question for the discussion forum, so I'll move it there. The original issue will be closed, but will link through to the new, open discussion thread. |
Beta Was this translation helpful? Give feedback.
-
You're right to note that spaCy doesn't do this by default. If you want to have this kind of preprocessing, it should be relatively straightforward to implement a custom function that is tailored towards your use-case. Whether or not you should mask the actual texts of some entities is probably up for debate, and depends on the use-case. In general, I agree with you that it can help overfitting, but that's really only valid when the dataset is limited, or the distribution skewed somehow. For instance, let's say you're training a biomedical relation extraction system on articles describing cancer pathways, and you're NOT masking the entity texts, then your system might actually be learning the pathways themselves (X inhibits Y) rather than the grammatical/lexical structures in the sentences that describe that X inhibits Y. This could mean that your system will be less likely to pick up lesser known or new relations, which is definitely bad. On the other hand, if we're creating a Named Entity Recognizer that is supposed to pick up specific names (e.g. of people, or genes/proteins), it would really benefit from the lexical properties of the actual texts (use of capitalization, hyphens, numbers,...), and if there is enough data and the network is big enough, it should also be able to generalize properly from the given examples. |
Beta Was this translation helpful? Give feedback.
-
General guidance says that masking PROPN before training/predicting with text classification models to prevent over-fitting to some PROPN entities. Some datasets such as GoEmotios previously masks PROPN, like [NAME]. But, Spacy's sample training code or dataset doesn't. Do I need to mask PROPN's previously, to avoid over-fitting or Spacy training code masks them by itself? Please provide a best practice on preprocessing text data.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions