Spell Checking in SpaCy for feature cleaning #8448

Economax · 2021-06-21T07:01:04Z

Economax
Jun 21, 2021

Hi all,

For my current project, I am assisting researchers with text analysis.
In that regard, I would like the feature space to be as clean and representative as possible.
Currently, we are observing some spelling errors (NWE) that we would like to correct for.

As far as I can tell, SpaCy only has the third party project ContextualSpellChecker.
I do not wish to knock on the authors of this project, but that solution does not work for me.
I have tried multiple pre-trained transformer models on HF, but none give any good results on the sci.space subset of the 20 news groups dataset in scikit.

I can imagine that the problem of feature selection and cleaning is common among NLP researchers, and thus I can't be the only one with this issue.

Is this spell-checking functionality something SpaCy plans to address in the future, or am I missing some existing modules or functionality?

In the meantime, I may try to implement a solution myself based on knn and neural networks, if the scope of the problem calls for it.

polm · 2021-06-21T09:39:23Z

polm
Jun 21, 2021

I think the reason you won't find many pre-existing solutions for this is that for many applications it's fine to ignore spelling errors.

For most datasets, errors should be relatively rare and more or less randomly distributed. If errors are not rare you have "noisy" text, and how you deal with that depends on what kind of noise you have. If you have enough data your errors should be rare enough they don't influence the model as a whole.

To give a practical example, if you're doing document classification, assuming an error rate of 5% of words, you have 1/20 words with spelling errors. That's quite high, but if you're classifying documents into "biology" and "mathematics", you could probably usually still tell which is which even if you deleted 5% of words. For even moderately formal documents the error rate is likely less and the occasional spelling error shouldn't matter. (Mailing list posts would be noisier, but I can't imagine they're that noisy.)

One common recent trend in dealing with noisy text is using data augmentation to make models more robust to it. This means modifying training data to create automatic variation, for example by randomly modifying case. The latest spaCy models accidentally left this out of training, but we should address that soon; for your own training, take a look at the Augmenters docs.

I am pretty surprised that pretrained Transformers didn't work well on the 20 newsgroups for you. What do you consider "good" results? Can you describe what kind of texts you're working on and what you're trying to do with them in more detail?

Another point is that while historically feature selection and cleaning were very important, with recent neural methods it's often preferred to leave the data as-is unless something is quite messy with it. You shouldn't use stop word removal or stemming before using the default spaCy models, for example.

2 replies

Economax Jun 21, 2021
Author

I described the research project in a bit more detail further down for the convenience of people who may later see this post - please do say if that info is insufficient!

In the sci.space data, my experience was that most documents do not feature many spelling errors.
My experience looking over the results manually were that in documents with no spelling errors, some words were switched out anyway, even though the upper L-distance threshold was set to 2 or 3.
Secondly, while some errors were caught by the algorithms when they were present, all were rarely caught.
The suggestions were also often wrong. I caught one error where $1billion was edited to $100 by the bert-base.
I don't really have any good metrics here sadly, mostly because I do not have a gold-standard to compare to.

Please note that I did not fine-tune the algorithms in this search, because I will not have labelled data for the real project to do fine-tuning with. Thus, I believe that I need something that works pretty well out of the box, or that can be trained in an unsupervised way.

I do like the idea regarding increasing the noise, but that would require me to know the correct case declinations for the input words, right? I have started the process for getting access to such a data set/vocab, but we're not there yet.
Would it be possible to use spaCy's vocab API instead, maybe?

Aware of the part regarding the spaCy pipeline, I use the built-in modules for removing stop words from the data/feature set that I would like to analyze. The idea is basically to use spaCy to generate a bag of words data set and then move that data into an unsupervised learner.
For now I've been thinking about hierarchical clustering with that approach after finding some guidance in Mining Text Data (2012) from Springer.

If you've got experience with this kind of a problem, I'd be happy to take some guidance! (I.e. please roast)

polm Jun 22, 2021

My experience looking over the results manually were that in documents with no spelling errors, some words were switched out anyway, even though the upper L-distance threshold was set to 2 or 3.
Secondly, while some errors were caught by the algorithms when they were present, all were rarely caught.
The suggestions were also often wrong. I caught one error where $1billion was edited to $100 by the bert-base.
I don't really have any good metrics here sadly, mostly because I do not have a gold-standard to compare to.

I was pretty confused by this paragraph, I think you're referring to using ContextualSpellChecker? That's a third-party module I've never used so I can't comment on it.

I do like the idea regarding increasing the noise, but that would require me to know the correct case declinations for the input words, right? I have started the process for getting access to such a data set/vocab, but we're not there yet.
Would it be possible to use spaCy's vocab API instead, maybe?

When I said "case" I meant upper/lower case, not grammatical case, though both work. spaCy's built in augmenters are a bit limited but there are other packages like nlpaug and the recently released AugLy that do more unusual things. They might not be targeted for spelling errors though. If you can make generalizations about the kind of spelling errors you expect you should be able to write an augmenter function to replace correct words with wrong ones - for example, for English text you could automatically swap "a" and "an", or rewrite "the" to "teh". Other things you can do are randomly introduce or delete letters.

I don't think spaCy's vocab API would help you much here, it's really just a way to efficiently store words.

For now I've been thinking about hierarchical clustering with that approach after finding some guidance in Mining Text Data (2012) from Springer.

Recently I understand people have success using the Universal Sentence Encoder (USE) or BERT, then using UMAP for dimensionality reduction, and doing clustering on top of that. Here's a recent Twitter thread on the subject. Note that if you use these methods it is very important that you do not remove stop words, as these models are trained on natural sentences.

EtienneAb3d · 2021-06-21T13:33:50Z

EtienneAb3d
Jun 21, 2021

It would be nice to know more about the kind of errors you would like to handle, and what is working or not.

As I don't want to send unsolicited advertising here, if you contact me using the links on my profile (I do not see one on yours), I will send you few links to other projects (open-source or proprietary) that you may try.

;-)

3 replies

Economax Jun 21, 2021
Author

I wrote some more detailed info about the problem in a separate comment - if you need more info, please say!

I'll connect with you, then we can see what is possible - thanks a lot :)

polm Jun 22, 2021

Hey @EtienneAb3d, if you have links that are relevant and useful for solving a problem they're always welcome in thread. This may be seen by other people later and any resources for them would be useful, even if it's not in spaCy.

EtienneAb3d Jun 22, 2021

Hey @EtienneAb3d, if you have links that are relevant and useful for solving a problem they're always welcome in thread.

Ok. If this is clearly asked by a "Maintainer", here are some projects being dedicated to automatic correction:
https://github.com/PrithivirajDamodaran/Gramformer
https://github.com/neuspell
https://github.com/bakwc/JamSpell
And here is my own hybrid system (neural + rule-base), that can be tuned or trained for a dedicated in-domain vocab/phrasing, and/or for a special processing goal, and/or for new languages (FR version is much more advanced than EN one):
http://neurospell.com/en.php
;-)

Economax · 2021-06-21T15:12:53Z

Economax
Jun 21, 2021
Author

I suppose I should have described the problem more thoroughly in my original post, so here goes:

We're analyzing documents that have been written and sent in before and after a "treatment", where that treatment is an assignment to some level of support for a startup company by national authorities.

In the ex-ante reports, there is information about the plans of given startups, their identified challenges, and so forth.
In the ex-post reports, there is information on what the startup actually did, how the startup felt about the treatment, and what the treatment supplier thought about the startup.

First of all, the data is very limited in size (some 10's-100's of docs) compared to other text datasets for which one might typically use machine learning, and secondly, there is reason to believe that errors have some relation to identity or latent class.
One could for instance envision separating along the axis of "detail oriented" or "hard working", where the amount of spelling errors might be related to information about those latent classes.
In the very least, it seems probable that errors are not randomly distributed.

Additionally, spelling errors might be different, and apply to different words, but still relate to the underlying latent class.
It would be valuable to be able to have a common indicator for this and then focus the text analysis on actual word meaning/content instead of revealing "laziness".

All in all, being able to label which documents contain spelling-errors and doing a corrected analysis appears valuable for an this problem, and I imagine this is the case for similar small-dataset problems.

The end-goal is quite open still, but the current vision is to do some simple descriptive statistics with word frequencies and then some clustering, maybe hierarchical.

The fear is that (1) the errors are not random (2) the features become too many relative to the data points, e.g. p>n

One might ask why one doesn't commit more human resources to the problem, for which the answer would be continuity: Data is still being collected, and not all resources will be available as time passes. Additionally, there are strict limitations on who can access the data.

2 replies

Economax Jun 21, 2021
Author

Just to clarify on (2) Due to data size, select errors may only happen once, so some features would be indicators of individual documents, but might actually be important along another axis if it had been spelled correct.

polm Jun 22, 2021

Thanks for the detailed description! It's great that your motivation is clear and I think your understanding of the problem, and the idea to use spelling correction for that reason, is a good idea.

That said I would recommend you go ahead and train an uncorrected model to use as a baseline and verify your concerns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spell Checking in SpaCy for feature cleaning #8448

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spell Checking in SpaCy for feature cleaning #8448

Economax Jun 21, 2021

Replies: 3 comments · 7 replies

polm Jun 21, 2021

Economax Jun 21, 2021 Author

polm Jun 22, 2021

EtienneAb3d Jun 21, 2021

Economax Jun 21, 2021 Author

polm Jun 22, 2021

EtienneAb3d Jun 22, 2021

Economax Jun 21, 2021 Author

Economax Jun 21, 2021 Author

polm Jun 22, 2021

Economax
Jun 21, 2021

Replies: 3 comments 7 replies

polm
Jun 21, 2021

Economax Jun 21, 2021
Author

EtienneAb3d
Jun 21, 2021

Economax Jun 21, 2021
Author

Economax
Jun 21, 2021
Author

Economax Jun 21, 2021
Author