-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy
has inconsistency when dividing sentences
#13346
Comments
One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines: https://spacy.io/api/dependencyparser#assigned-attributes If you have your own pipe that sets boundaries, you may want to run this pipe after the dependency parser for this reason. Could you try to see if this improves things for you? |
Hello @danieldk, Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state. I don't think this will work as the parsing being done here interferes with the custom segmentation boundaries that I require to be set due to certain edge cases such as I saw another issue here: #3569. Similar issue. |
I got a bit different, but similar issue spacy == 3.7.4, mac
|
This is a different question. Could you open a topic on the discussion forum? |
Ah right, sorry, I overlooked that. The issue with changing the boundaries after parsing is that it could result in dependency relations that cross sentence boundaries, which is one of the reasons why we disallow this. We'll have to look into this more deeply, because the parser should in principle respect boundaries that were set earlier. Also see for more background. |
Hello,
I am using
Spacy
to divide sentences after joining a set of words with whitespaces. But to my dismay, this process has unpredictable and unexplainable behaviour. I have a custom segmentation function where I am trying to set custom sentence boundaries (ieis_sent_start
).Custom Function:
This is my
nlp.pipeline
.How to reproduce the behaviour
This is the current output:
The form of the tokens here
Fig. 2 .
produces different outputs for the sentences. Please see the following examples.Massive ETGs are summarized in a schematic way in Fig. 21 .
(changed 2 . to 21 . )Massive ETGs are summarized in a schematic way in Fig. 21.
(removed space b/w 21 & period)Massive ETGs are summarized in a schematic way in Fig. 2.
(removed space b/w 2 & period)Massive ETGs are summarized in a schematic way in Fig. 1 .
(changed 2 to 1)Massive ETGs are summarized in a schematic way in Fig. 3 .
(changed 2 to 3)Massive ETGs are summarized in a schematic way in Fig. 4 .
(changed 2 to 4)Massive ETGs are summarized in a schematic way in Fig. 4.
(changed 2 to 4 and removed whitespace)Massive ETGs are summarized in a schematic way in Fig. 200.
(changed 2 to 200 and removed space)Massive ETGs are summarized in a schematic way in Fig. 200 .
(changed 2 to 200)There is inconsistent behaviour in the way the sentence boundaries are categorised here. I have other examples as well so if needed I can share them here.
Any help in understanding this would be appreciated.
Your Environment
The text was updated successfully, but these errors were encountered: