`Spacy` has inconsistency when dividing sentences #13346

DhruvSondhi · 2024-02-22T19:00:32Z

Hello,

I am using Spacy to divide sentences after joining a set of words with whitespaces. But to my dismay, this process has unpredictable and unexplainable behaviour. I have a custom segmentation function where I am trying to set custom sentence boundaries (ie is_sent_start).

Custom Function:

from spacy.language import Language

@Language.component("segm")
def set_custom_segmentation(doc):
    i = 0
    while i < len(doc[:-1]):
        if doc[i].text.lower() in ["eq", "fig", "al", 'table', "fig."]:
            doc[i+1].is_sent_start = False
            i+=1
        elif doc[i].text in ["(", "'s"]:
            doc[i].is_sent_start = False
            i+=1
        elif doc[i].text in [".", ")."]:
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False
        i+=1
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("segm", before="parser")
nlp.pipeline

This is my nlp.pipeline.

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x29f4c3ee0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x29f4c3f40>),
 ('segm', <function __main__.set_custom_segmentation(doc)>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x29f8380b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x29e3ee4c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x29dd1f100>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x29f7f7f40>)]

How to reproduce the behaviour

doc = nlp("Massive ETGs are summarized in a schematic way in Fig. 2 . ##(this is the sentence to consider)## We refer the reader to fig. 1 of Forbes et al. ( 2011 ) and fig. 10 of Faifer et al. ( 2011 ) for real-world examples of our schematic plot, which show not only the mean gradients but also the individual GC data points. Figure 2.")

for sent in doc.sents:
    print(sent)

This is the current output:

The form of the tokens here Fig. 2 . produces different outputs for the sentences. Please see the following examples.

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21 . (changed 2 . to 21 . )

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21. (removed space b/w 21 & period)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 2. (removed space b/w 2 & period)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 1 . (changed 2 to 1)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 3 . (changed 2 to 3)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4 . (changed 2 to 4)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4. (changed 2 to 4 and removed whitespace)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200. (changed 2 to 200 and removed space)

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200 . (changed 2 to 200)

There is inconsistent behaviour in the way the sentence boundaries are categorised here. I have other examples as well so if needed I can share them here.

Any help in understanding this would be appreciated.

Your Environment

spaCy version: 3.6.0
Platform: macOS-14.3.1-arm64-arm-64bit
Python version: 3.10.12
Pipelines: en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)

The text was updated successfully, but these errors were encountered:

danieldk · 2024-02-22T19:30:25Z

One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines:

https://spacy.io/api/dependencyparser#assigned-attributes

If you have your own pipe that sets boundaries, you may want to run this pipe after the dependency parser for this reason. Could you try to see if this improves things for you?

DhruvSondhi · 2024-02-22T20:49:13Z

Hello @danieldk,

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

I don't think this will work as the parsing being done here interferes with the custom segmentation boundaries that I require to be set due to certain edge cases such as Fig., eg., etc.

I saw another issue here: #3569. Similar issue.

koder-ua · 2024-02-24T21:03:56Z

I got a bit different, but similar issue

spacy == 3.7.4, mac


In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1   <<<<<<<<<<<<<<<<<<<<<< WRONG

In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3

In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3

In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3

danieldk · 2024-02-25T19:46:16Z

I got a bit different, but similar issue

This is a different question. Could you open a topic on the discussion forum?

danieldk · 2024-02-25T19:57:46Z

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

Ah right, sorry, I overlooked that. The issue with changing the boundaries after parsing is that it could result in dependency relations that cross sentence boundaries, which is one of the reasons why we disallow this. We'll have to look into this more deeply, because the parser should in principle respect boundaries that were set earlier. Also see

#11107
#7716

for more background.

danieldk added feat / parser Feature: Dependency Parser more-info-needed This issue needs more information labels Feb 22, 2024

github-actions bot removed the more-info-needed This issue needs more information label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Spacy` has inconsistency when dividing sentences #13346

`Spacy` has inconsistency when dividing sentences #13346

DhruvSondhi commented Feb 22, 2024

danieldk commented Feb 22, 2024 •

edited

Loading

DhruvSondhi commented Feb 22, 2024

koder-ua commented Feb 24, 2024 •

edited

Loading

danieldk commented Feb 25, 2024

danieldk commented Feb 25, 2024

Spacy has inconsistency when dividing sentences #13346

Spacy has inconsistency when dividing sentences #13346

Comments

DhruvSondhi commented Feb 22, 2024

How to reproduce the behaviour

Your Environment

danieldk commented Feb 22, 2024 • edited Loading

DhruvSondhi commented Feb 22, 2024

koder-ua commented Feb 24, 2024 • edited Loading

danieldk commented Feb 25, 2024

danieldk commented Feb 25, 2024

`Spacy` has inconsistency when dividing sentences #13346

`Spacy` has inconsistency when dividing sentences #13346

danieldk commented Feb 22, 2024 •

edited

Loading

koder-ua commented Feb 24, 2024 •

edited

Loading