PR: To address performance issues with stopword removal #141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR to address performance regression stated in #140. This brings the time down from 940s to 0.27s for my test dataset (~3.4MB)
primary change is replacement of method
remove_patterns
which forced modification ofstrip_whitespace
implementation ofprepare!
methodI have also modified test cases to make them consistent; where stripping punctuation or stripping a pattern replaces the matched pattern with
0
length string i.e. deletes the matched pattern.This required special handling for whitespace removal, where one or more than single space is replaced with a
blank_space
of length 1. And all leading and trailing spaces are stripped.I don't think there is a right way for certain pre-processing tasks. For example: with
strip_punctuation
what is the correct way to handle the following strings when removing punctuations.don't mind!
=>don t mind
ordont mind
Intel(tm) Core i5-3300k
=>Intel tm Core i5 3300k
orInteltm Core i53300k