Input text must be sanitized before processing #23

F1uctus · 2022-05-06T01:46:19Z

Any text to be processed by ttc should be sanitized first.
Please remove any duplicate spaces and trim any whitespace before and after the line breaks. Do not replace line breaks with spaces as the replica extraction algorithm relies on them.
Excessive whitespaces are not handled well by the underlying library, spaCy, because, for example, the dependency parsing algorithm's training dataset does not contain them. As a consequence, prediction results may lose in accuracy.

If your use case requires the text to be left as-is, you can still sanitize the input, and then map the output spans into the original text by means of accumulated indices.

F1uctus added good first issue Good for newcomers wontfix This will not be worked on labels May 6, 2022

F1uctus pinned this issue Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input text must be sanitized before processing #23

Input text must be sanitized before processing #23

F1uctus commented May 6, 2022 •

edited

Loading

Input text must be sanitized before processing #23

Input text must be sanitized before processing #23

Comments

F1uctus commented May 6, 2022 • edited Loading

F1uctus commented May 6, 2022 •

edited

Loading