\n character tokenization and sentence segmentation #7864
Replies: 1 comment
-
The main thing is that you want your training data and the data you're going to process later to be similar, so if you preprocess your training data, you'll have to plan to preprocess any future inputs for good results. The pretrained pipeline like |
Beta Was this translation helpful? Give feedback.
-
The text that I work with in raw form has a lot of repeated \n characters within them. I noticed that \n\n\n will be tokenized and used as is_sent_end. Any suggestions on best practice so that I get the best quality sentence segmentation. Should I
I am using Spacy 3 and SentenceRecognizer
Beta Was this translation helpful? Give feedback.
All reactions