Valency: better sentence splitting #958
Labels
backend
bug is related to backend
enhancement
this label means that resolving the issue would improve some part of the system
Current sentence splitting in valency data extraction is rather ad-hoc, using very simple algorithm by Pavel Grashchenkov based on a list of possible sentence-ending punctuation tokens
{'.', '!', '?', '...', '?!', '...»'}
, see https://github.com/ispras/lingvodoc/blob/2c121263ffe26773bcc34aca1ed6e12c68939060/lingvodoc/scripts/valency.py#L17.We should consider upgrading to a sentence splitter closer to the state of the art, e.g. NLTK's one, for better overall quality of the sentence splitting.
Though if the sentence structure could change, we would need to carefully and accurately enhance valency data updating procedures, see #775, so that's a point to keep in mind.
The text was updated successfully, but these errors were encountered: