Valency: better sentence splitting #958

myrix · 2023-03-30T08:39:12Z

Current sentence splitting in valency data extraction is rather ad-hoc, using very simple algorithm by Pavel Grashchenkov based on a list of possible sentence-ending punctuation tokens {'.', '!', '?', '...', '?!', '...»'}, see https://github.com/ispras/lingvodoc/blob/2c121263ffe26773bcc34aca1ed6e12c68939060/lingvodoc/scripts/valency.py#L17.

We should consider upgrading to a sentence splitter closer to the state of the art, e.g. NLTK's one, for better overall quality of the sentence splitting.

Though if the sentence structure could change, we would need to carefully and accurately enhance valency data updating procedures, see #775, so that's a point to keep in mind.

The text was updated successfully, but these errors were encountered:

myrix added enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valency: better sentence splitting #958

Valency: better sentence splitting #958

myrix commented Mar 30, 2023

Valency: better sentence splitting #958

Valency: better sentence splitting #958

Comments

myrix commented Mar 30, 2023