Preprocessing text and using sentencizer #9094

farrandi · 2021-08-31T00:20:00Z

farrandi
Aug 31, 2021

So, I would like to do 2 things , which are:

Preprocess text - remove all hidden characters and turn it into a blank " "
Split my long string into sentences by the hidden character \n

So I am doing this currently by:

puncts = ['!', '\n', ...]
config = {"punct_chars": puncts}
nlp.add_pipe("sentencizer", config=config)

doc = nlp(text)
for sent in doc.sents:
    sentence = re.sub(r'\s+',' ', sent.text.lower())
   new_doc = nlp(sentence)

As seen above, I am making a new doc object after the text has been split into sentences and "preprocessed". Is there a better way to do this without making a new doc? or any other method to make this run faster is also appreciated. Thanks!

polm · 2021-08-31T04:03:33Z

polm
Aug 31, 2021

Since step 1 is just a regex on the raw text, it would be a lot faster if you did that before you pass the text to spaCy. (Also if you're only replacing whitespace there's not reason to call text.lower().)

Besides that maybe see the Speed FAQ (#8402), in particular about disabling components you are using and nlp.pipe.

0 replies

farrandi · 2021-08-31T16:42:51Z

farrandi
Aug 31, 2021
Author

Thanks for the suggestion on nlp.pipe, will use it.

However, regarding replacing the regex.. that is my main problem. I would like to replace all the hidden characters including "\n" however I would like to separate the sentences by the character "\n". Currently I am solving this by running the doc 2x. First time to separate by "\n" then removing the"\n" and feeding it into the document.

[On another note, I am turning the text into lower case before passing it to spacy with hopes to increase accuracy because I plan to make the training data in lowercase too. Is that a good idea]

4 replies

polm Sep 1, 2021

However, regarding replacing the regex.. that is my main problem. I would like to replace all the hidden characters including "\n" however I would like to separate the sentences by the character "\n". Currently I am solving this by running the doc 2x. First time to separate by "\n" then removing the"\n" and feeding it into the document.

OK, how about this:

text = ...
texts = text.split("\n")
texts = [re.sub(r'\s+',' ', tt) for tt in texts]
text = "\n".join(texts)
doc = nlp(text)

This will be faster than calling spaCy twice.

[On another note, I am turning the text into lower case before passing it to spacy with hopes to increase accuracy because I plan to make the training data in lowercase too. Is that a good idea]

I'm a little confused. Is your data in lower case or are you just making it lower case for some reason?

The default spaCy models are trained on formal text with proper casing, like newspaper articles, so they can be confused by all lower case text. If your real data is lower case then it may make sense for you to train a model using only lower case text.

farrandi Sep 1, 2021
Author

Thanks for the suggestion for splitting the "\n" characters.

As for the second part, my data is not in lower case but I am making it lower case. I was planning to train the model but I was wondering if it would be better to train it with proper casing or changing it to all lower case. This is because I would mostly work with medical reports which might have words like: "left ventricle", "Left Ventricle", "left Ventricle", or "Left ventricle. What would you suggest?

polm Sep 2, 2021

If your data has some case information, I would first try using it as-is, and making sure to use augmentation. If your model seems to be sensitive to case after that, then I would move on to experimenting with lower case only, but not before.

In general you want to use the raw data as much as possible, as smoothing it over can remove useful information that the model can pick up on. So it's good to be careful about removing information before you're sure it's noise.

farrandi Sep 2, 2021
Author

Oh alright, I will try that.
Thanks for the suggestion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing text and using sentencizer #9094

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Preprocessing text and using sentencizer #9094

farrandi Aug 31, 2021

Replies: 2 comments · 4 replies

polm Aug 31, 2021

farrandi Aug 31, 2021 Author

polm Sep 1, 2021

farrandi Sep 1, 2021 Author

polm Sep 2, 2021

farrandi Sep 2, 2021 Author

farrandi
Aug 31, 2021

Replies: 2 comments 4 replies

polm
Aug 31, 2021

farrandi
Aug 31, 2021
Author

farrandi Sep 1, 2021
Author

farrandi Sep 2, 2021
Author