Clause segmentation #6557

9j7axvsLuF · 2020-12-11T20:48:58Z

9j7axvsLuF
Dec 11, 2020

Hello,

I have been using spaCy for a while and love it. I particularly appreciate the built-in pipeline for sentence segmentation, which I use regularly to produce sentence embeddings with a Transformer language model.

However, on average, the length of sentences is inversely correlated to the quality of their embeddings. For this reason, I want to break down long/convoluted sentences into simpler clauses that could each be given their own embedding.

SpaCy does not offer any straightforward pipeline for clause segmentation. One solution would be to implement this from scratch myself using dependency parsing, but this looks like a non-trivial problem given that there are most likely edge cases that might not be adequately handle by a quick and dirty script relying on this method.

(A more sophisticated solution would be to use some kind of seq2seq model that could take a sentence like "Matt made eggs and had breakfast" and output two reconstructed sentences like "Matt made eggs" and "Matt had breakfast". But I am not aware of any pretrained model that would allow me to do that.)

Would you have any recommendation to address this issue with spaCy? Generally, I feel like it would be a great benefit to the NLP community if spaCy could handle both clause and sentence segmentation in the future.

svlandeg · 2020-12-17T21:04:43Z

svlandeg
Dec 17, 2020
Maintainer

FYI - we converted this question to a thread on the discussion board.

We're aiming to keep the Issue tracker focused mainly on bug reports, while the new GH Discussion board can be a place where the community can come together, help eachother, discuss best practices, and so on.

1 reply

9j7axvsLuF Dec 17, 2020
Author

Thanks a lot!

9j7axvsLuF · 2020-12-17T21:08:08Z

9j7axvsLuF
Dec 17, 2020
Author

I found an interesting rule-based model for clause segmentation (sentence splitting) in this paper. Unfortunately the code is not shared by IBM, but here is the description of the model from Appendix A:

Given a complex sentence, the model runs the following processes once each:

Wh Handling

Using semantic role labeling, themodel looks for a Relational Argument (R-ARG),and the Subject Argument (asserted to be the ARG preceding the R-ARG). Then, a split is made with the Relational Argument replaced by the Subject Argument.

Conjunction Handling

The model looks for theword “and”. Using semantic role labeling, if the word following “and” is an argument (ARG), assert that “and” is followed by a sentence, and a split is made. Or, if the word following “and” is a verb (V), the model asserts the Subject Argument to be the ARG preceding the V; a split is made with “and” replaced by the Subject Argument.

Insertion Handling

Using dependency parsing, the model looks for a node with type participle modifier, relative clause modifier, prepositional modifier, adjective modifier,or appositional modifier. The clause with the node as the root is extracted, prepended with the subject, and split asa new simple sentence. The rest of the original complex sentence is split as another new simple sentence.

Would anyone be able to help me implement this with spaCy?

1 reply

thomashacker Jul 15, 2021

Hey 😄
I'm not sure if you're still interested but I'm working on something similar for Clause Segmentation. It's a very simple approach which only focuses on splitting compound sentences with two or more contextual statements in it.

For example: "This product has helped me recover from a back injury but made me nauseous"
This sentence includes two different statements about two different conditions

My approach was to first detect verb chunks ("has helped","made") and check if a conjunction lies between them.
If so, split the sentence at the respective index.

At the end I receive these two clauses:

"This product has helped me recover from a back injury"
"But made me nauseous" (You could also delete the conjunction here)

So, if you're still interested we could definitely work on an approach that might suit your usecase. 🎉

BramVanroy · 2021-07-15T13:36:30Z

BramVanroy
Jul 15, 2021

Why not use constituency parsing instead of clauses? With your end goal in mind, that seems a reasonable alternative. IIRC the Berkeley Neural Parser (benepar) has a plugin for spaCy as well, so it should fit into your work flow easily.

1 reply

thomashacker Jul 15, 2021

That's a great plug-in and also seems to work for the newest version of spaCy 3.1
Here's the Link (https://spacy.io/universe/project/self-attentive-parser)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clause segmentation #6557

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Clause segmentation #6557

9j7axvsLuF Dec 11, 2020

Replies: 3 comments · 3 replies

svlandeg Dec 17, 2020 Maintainer

9j7axvsLuF Dec 17, 2020 Author

9j7axvsLuF Dec 17, 2020 Author

Wh Handling

Conjunction Handling

Insertion Handling

thomashacker Jul 15, 2021

BramVanroy Jul 15, 2021

thomashacker Jul 15, 2021

9j7axvsLuF
Dec 11, 2020

Replies: 3 comments 3 replies

svlandeg
Dec 17, 2020
Maintainer

9j7axvsLuF Dec 17, 2020
Author

9j7axvsLuF
Dec 17, 2020
Author

BramVanroy
Jul 15, 2021