Clause segmentation #6557
Replies: 3 comments 3 replies
-
FYI - we converted this question to a thread on the discussion board. We're aiming to keep the Issue tracker focused mainly on bug reports, while the new GH Discussion board can be a place where the community can come together, help eachother, discuss best practices, and so on. |
Beta Was this translation helpful? Give feedback.
-
I found an interesting rule-based model for clause segmentation (sentence splitting) in this paper. Unfortunately the code is not shared by IBM, but here is the description of the model from Appendix A: Given a complex sentence, the model runs the following processes once each: Wh HandlingUsing semantic role labeling, themodel looks for a Relational Argument (R-ARG),and the Subject Argument (asserted to be the ARG preceding the R-ARG). Then, a split is made with the Relational Argument replaced by the Subject Argument. Conjunction HandlingThe model looks for theword “and”. Using semantic role labeling, if the word following “and” is an argument (ARG), assert that “and” is followed by a sentence, and a split is made. Or, if the word following “and” is a verb (V), the model asserts the Subject Argument to be the ARG preceding the V; a split is made with “and” replaced by the Subject Argument. Insertion HandlingUsing dependency parsing, the model looks for a node with type participle modifier, relative clause modifier, prepositional modifier, adjective modifier,or appositional modifier. The clause with the node as the root is extracted, prepended with the subject, and split asa new simple sentence. The rest of the original complex sentence is split as another new simple sentence. Would anyone be able to help me implement this with spaCy? |
Beta Was this translation helpful? Give feedback.
-
Why not use constituency parsing instead of clauses? With your end goal in mind, that seems a reasonable alternative. IIRC the Berkeley Neural Parser (benepar) has a plugin for spaCy as well, so it should fit into your work flow easily. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have been using spaCy for a while and love it. I particularly appreciate the built-in pipeline for sentence segmentation, which I use regularly to produce sentence embeddings with a Transformer language model.
However, on average, the length of sentences is inversely correlated to the quality of their embeddings. For this reason, I want to break down long/convoluted sentences into simpler clauses that could each be given their own embedding.
SpaCy does not offer any straightforward pipeline for clause segmentation. One solution would be to implement this from scratch myself using dependency parsing, but this looks like a non-trivial problem given that there are most likely edge cases that might not be adequately handle by a quick and dirty script relying on this method.
(A more sophisticated solution would be to use some kind of seq2seq model that could take a sentence like "Matt made eggs and had breakfast" and output two reconstructed sentences like "Matt made eggs" and "Matt had breakfast". But I am not aware of any pretrained model that would allow me to do that.)
Would you have any recommendation to address this issue with spaCy? Generally, I feel like it would be a great benefit to the NLP community if spaCy could handle both clause and sentence segmentation in the future.
Beta Was this translation helpful? Give feedback.
All reactions