Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input text must be sanitized before processing #23

Open
F1uctus opened this issue May 6, 2022 · 0 comments
Open

Input text must be sanitized before processing #23

F1uctus opened this issue May 6, 2022 · 0 comments
Labels
good first issue Good for newcomers wontfix This will not be worked on

Comments

@F1uctus
Copy link
Owner

F1uctus commented May 6, 2022

See also: explosion/spaCy#7735

Any text to be processed by ttc should be sanitized first.
Please remove any duplicate spaces and trim any whitespace before and after the line breaks. Do not replace line breaks with spaces as the replica extraction algorithm relies on them.
Excessive whitespaces are not handled well by the underlying library, spaCy, because, for example, the dependency parsing algorithm's training dataset does not contain them. As a consequence, prediction results may lose in accuracy.

If your use case requires the text to be left as-is, you can still sanitize the input, and then map the output spans into the original text by means of accumulated indices.

@F1uctus F1uctus added good first issue Good for newcomers wontfix This will not be worked on labels May 6, 2022
@F1uctus F1uctus pinned this issue Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant