Preprocessing text and using sentencizer #9094
Replies: 2 comments 4 replies
-
Since step 1 is just a regex on the raw text, it would be a lot faster if you did that before you pass the text to spaCy. (Also if you're only replacing whitespace there's not reason to call Besides that maybe see the Speed FAQ (#8402), in particular about disabling components you are using and |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestion on However, regarding replacing the regex.. that is my main problem. I would like to replace all the hidden characters including "\n" however I would like to separate the sentences by the character "\n". Currently I am solving this by running the doc 2x. First time to separate by "\n" then removing the"\n" and feeding it into the document. [On another note, I am turning the text into lower case before passing it to spacy with hopes to increase accuracy because I plan to make the training data in lowercase too. Is that a good idea] |
Beta Was this translation helpful? Give feedback.
-
So, I would like to do 2 things , which are:
" "
\n
So I am doing this currently by:
As seen above, I am making a new doc object after the text has been split into sentences and "preprocessed". Is there a better way to do this without making a new doc? or any other method to make this run faster is also appreciated. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions