Why does the German sentence tokenizer consider a semicolon a sentence ending? #13352
Replies: 1 comment 1 reply
-
Hi Tamara, good to see you here! To clarify a bit and to avoid confusion, it's not the "tokenizer" component in the spaCy pipelines that decides sentence boundaries. spaCy's tokenizer only decides on token boundaries. It's what you get when you'd initialize an "empty" For sentence segmentation, spaCy has 3 different components:
As we've found the If you'd like to have the English behaviour for your German text preprocessing, and/or you want a more uniform/predictable behaviour, you can include the
This prints a single sentence (instead of 2). Hope that helps! |
Beta Was this translation helpful? Give feedback.
-
I am using the sentence tokenizers for English and German to process a book in both languages. I noticed that the German tokenizer is considering a semicolon as a valid sentence ending and it cuts off the parts of sentences after one. While it might make linguistic sense in some isolated cases, this isn't often the case and what is separated is not a full, complete sentence. Is this grammatically grounded in some way, or it has more to do with the training data?
Besides being a problem on its own, it does pose a considerable challenge when the tokenizers are used on multiligual parallel data.
Is there a way to avoid this?
Thanks for the clarification!
Beta Was this translation helpful? Give feedback.
All reactions