Why does the German sentence tokenizer consider a semicolon a sentence ending? #13352

TamaraAtanasoska · 2024-02-26T18:52:58Z

TamaraAtanasoska
Feb 26, 2024

I am using the sentence tokenizers for English and German to process a book in both languages. I noticed that the German tokenizer is considering a semicolon as a valid sentence ending and it cuts off the parts of sentences after one. While it might make linguistic sense in some isolated cases, this isn't often the case and what is separated is not a full, complete sentence. Is this grammatically grounded in some way, or it has more to do with the training data?

Besides being a problem on its own, it does pose a considerable challenge when the tokenizers are used on multiligual parallel data.

Is there a way to avoid this?

Thanks for the clarification!

svlandeg · 2024-03-26T14:06:42Z

svlandeg
Mar 26, 2024
Maintainer

Hi Tamara, good to see you here!

To clarify a bit and to avoid confusion, it's not the "tokenizer" component in the spaCy pipelines that decides sentence boundaries. spaCy's tokenizer only decides on token boundaries. It's what you get when you'd initialize an "empty" English() or German() pipeline - this wouldn't actually be able to determine where the sentence boundaries are.

For sentence segmentation, spaCy has 3 different components:

the sentencizer is a simple rule-based component
the senter is trainable
the parser is trainable and determines sentence boundaries and grammatical relations at the same time

As we've found the parser to usually be the most accurate, this is the component that would determine sentence boundaries in our pretrained pipelines like en_core_web_lg and de_core_news_lg. So their predictions are really determined by how the training data looks like - I would assume that the German and English datasets are differently annotated with respect to ; usage.

If you'd like to have the English behaviour for your German text preprocessing, and/or you want a more uniform/predictable behaviour, you can include the sentencizer in your pipeline. If you still want to run the parser as well, simply add the sentencizer first, and the parser will respect its sentence boundaries:

    text = "Hallo Tamara; Wie geht es Ihnen?"
    nlp = spacy.load('de_core_news_lg')
    nlp.add_pipe("sentencizer", first=True)
    doc = nlp(text)
    for sent in doc.sents:
        print(sent)

This prints a single sentence (instead of 2).

Hope that helps!

1 reply

TamaraAtanasoska Mar 26, 2024
Author

Thank you this helps a lot! I fixed this with some post-processing, but I will rewrite this part of the code as this is a much better solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the German sentence tokenizer consider a semicolon a sentence ending? #13352

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why does the German sentence tokenizer consider a semicolon a sentence ending? #13352

TamaraAtanasoska Feb 26, 2024

Replies: 1 comment · 1 reply

svlandeg Mar 26, 2024 Maintainer

TamaraAtanasoska Mar 26, 2024 Author

TamaraAtanasoska
Feb 26, 2024

Replies: 1 comment 1 reply

svlandeg
Mar 26, 2024
Maintainer

TamaraAtanasoska Mar 26, 2024
Author