\n character tokenization and sentence segmentation #7864

Phat-Loc · 2021-04-23T01:49:26Z

Phat-Loc
Apr 23, 2021

The text that I work with in raw form has a lot of repeated \n characters within them. I noticed that \n\n\n will be tokenized and used as is_sent_end. Any suggestions on best practice so that I get the best quality sentence segmentation. Should I

Pre-process raw text and replace \n with space
Replace multiple \n with a single \n
Leave it alone

I am using Spacy 3 and SentenceRecognizer

adrianeboyd · 2021-04-23T08:05:51Z

adrianeboyd
Apr 23, 2021

The main thing is that you want your training data and the data you're going to process later to be similar, so if you preprocess your training data, you'll have to plan to preprocess any future inputs for good results.

The pretrained pipeline like en_core_web_sm typically haven't been trained with a lot of whitespace tokens, so their predictions may be poor. If you train a new model from scratch or fine-tune the senter from en_core_web_sm, it can learn how you want it to classify your whitespace tokens. Obviously if you annotate them really inconsistently in your training data it won't classify them well, but if they're always is_sent_start = False, then the model should be able learn that quickly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

\n character tokenization and sentence segmentation #7864

{{title}}

Replies: 1 comment

{{title}}

Select a reply

\n character tokenization and sentence segmentation #7864

Phat-Loc Apr 23, 2021

Replies: 1 comment

adrianeboyd Apr 23, 2021

Phat-Loc
Apr 23, 2021

adrianeboyd
Apr 23, 2021