Spell Checking in SpaCy for feature cleaning #8448
Replies: 3 comments 7 replies
-
I think the reason you won't find many pre-existing solutions for this is that for many applications it's fine to ignore spelling errors. For most datasets, errors should be relatively rare and more or less randomly distributed. If errors are not rare you have "noisy" text, and how you deal with that depends on what kind of noise you have. If you have enough data your errors should be rare enough they don't influence the model as a whole. To give a practical example, if you're doing document classification, assuming an error rate of 5% of words, you have 1/20 words with spelling errors. That's quite high, but if you're classifying documents into "biology" and "mathematics", you could probably usually still tell which is which even if you deleted 5% of words. For even moderately formal documents the error rate is likely less and the occasional spelling error shouldn't matter. (Mailing list posts would be noisier, but I can't imagine they're that noisy.) One common recent trend in dealing with noisy text is using data augmentation to make models more robust to it. This means modifying training data to create automatic variation, for example by randomly modifying case. The latest spaCy models accidentally left this out of training, but we should address that soon; for your own training, take a look at the Augmenters docs. I am pretty surprised that pretrained Transformers didn't work well on the 20 newsgroups for you. What do you consider "good" results? Can you describe what kind of texts you're working on and what you're trying to do with them in more detail? Another point is that while historically feature selection and cleaning were very important, with recent neural methods it's often preferred to leave the data as-is unless something is quite messy with it. You shouldn't use stop word removal or stemming before using the default spaCy models, for example. |
Beta Was this translation helpful? Give feedback.
-
It would be nice to know more about the kind of errors you would like to handle, and what is working or not. As I don't want to send unsolicited advertising here, if you contact me using the links on my profile (I do not see one on yours), I will send you few links to other projects (open-source or proprietary) that you may try. ;-) |
Beta Was this translation helpful? Give feedback.
-
I suppose I should have described the problem more thoroughly in my original post, so here goes: We're analyzing documents that have been written and sent in before and after a "treatment", where that treatment is an assignment to some level of support for a startup company by national authorities. In the ex-ante reports, there is information about the plans of given startups, their identified challenges, and so forth. First of all, the data is very limited in size (some 10's-100's of docs) compared to other text datasets for which one might typically use machine learning, and secondly, there is reason to believe that errors have some relation to identity or latent class. Additionally, spelling errors might be different, and apply to different words, but still relate to the underlying latent class. All in all, being able to label which documents contain spelling-errors and doing a corrected analysis appears valuable for an this problem, and I imagine this is the case for similar small-dataset problems. The end-goal is quite open still, but the current vision is to do some simple descriptive statistics with word frequencies and then some clustering, maybe hierarchical. The fear is that (1) the errors are not random (2) the features become too many relative to the data points, e.g. p>n One might ask why one doesn't commit more human resources to the problem, for which the answer would be continuity: Data is still being collected, and not all resources will be available as time passes. Additionally, there are strict limitations on who can access the data. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
For my current project, I am assisting researchers with text analysis.
In that regard, I would like the feature space to be as clean and representative as possible.
Currently, we are observing some spelling errors (NWE) that we would like to correct for.
As far as I can tell, SpaCy only has the third party project ContextualSpellChecker.
I do not wish to knock on the authors of this project, but that solution does not work for me.
I have tried multiple pre-trained transformer models on HF, but none give any good results on the sci.space subset of the 20 news groups dataset in scikit.
I can imagine that the problem of feature selection and cleaning is common among NLP researchers, and thus I can't be the only one with this issue.
Is this spell-checking functionality something SpaCy plans to address in the future, or am I missing some existing modules or functionality?
In the meantime, I may try to implement a solution myself based on knn and neural networks, if the scope of the problem calls for it.
Beta Was this translation helpful? Give feedback.
All reactions