remove_words! fails for long terms & terms with punctuation #74

enkiv2 · 2018-04-06T15:30:17Z

Because remove_words! uses regex matching even for string input, it fails on actually-present terms if those terms are larger than the maximum pattern size accepted by PCRE. Actually-present terms also fail if they contain regex-like punctuation. This produces an error message that doesn't specify the failed pattern, and furthermore aborts remove_words! entirely.

The same problem occurs in remove_sparse_terms! and remove_frequent_terms!, since these also file down to a call to remove_pattern.

Would it be possible to force only string-literal substitution in the case where an array of type String is passed (and only use regex if the items passed are actually typed as regular expressions)?

aviks · 2018-04-08T15:54:11Z

Might be a good idea. Will need some thought on how to deprecate the existing behaviour. Care to do a PR?

enkiv2 · 2018-04-11T19:28:48Z

Well, I have a naive solution that works for StringDocument; unfortunately, it has major performance problems. Should I submit a PR or wait until I've figured out how to get reasonable performance out of it?

aviks · 2018-06-11T15:24:57Z

I'd say submit a PR. We can figure out performance later. Slow code is better than no code.

enkiv2 · 2018-06-11T15:47:45Z

PR: #76

Because of string copying overhead, for situations big enough for the regex size to matter, it's unusably slow. However, in the absence of reliable regex escaping, this has reliability benefits.

Ayushk4 · 2019-05-14T06:28:17Z

I think a better way to solve this issue could be using a TokenBuffer which is currently used by the tokenizers in WordTokenizers.jl. ( #143 )

enkiv2 · 2019-05-14T11:25:22Z

It sounds like this would be substantially faster than my patch, so I approve :)

…

On Tue, May 14, 2019 at 2:28 AM Ayush Kaushal ***@***.***> wrote: I think a better way to solve this issue could be using a TokenBuffer which is currently used by the tokenizers in WordTokenizers.jl. ( #143 <#143> ) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#74?email_source=notifications&email_token=AADXUGMGAAHFI2D2ZJYMGRLPVJLYFA5CNFSM4EZJCJYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVKMXUQ#issuecomment-492096466>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADXUGKGXUUES4IGO6TL3X3PVJLYFANCNFSM4EZJCJYA> .

enkiv2 mentioned this issue Jun 11, 2018

add path for literal string arrays in remove_patterns! #76

Open

Ayushk4 mentioned this issue May 14, 2019

TokenBuffer for preprocessing Documents #143

Open

Ayushk4 mentioned this issue Jun 23, 2019

Faster Preprocess [WIP] #163

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove_words! fails for long terms & terms with punctuation #74

remove_words! fails for long terms & terms with punctuation #74

enkiv2 commented Apr 6, 2018

aviks commented Apr 8, 2018

enkiv2 commented Apr 11, 2018

aviks commented Jun 11, 2018

enkiv2 commented Jun 11, 2018

Ayushk4 commented May 14, 2019

enkiv2 commented May 14, 2019 via email

remove_words! fails for long terms & terms with punctuation #74

remove_words! fails for long terms & terms with punctuation #74

Comments

enkiv2 commented Apr 6, 2018

aviks commented Apr 8, 2018

enkiv2 commented Apr 11, 2018

aviks commented Jun 11, 2018

enkiv2 commented Jun 11, 2018

Ayushk4 commented May 14, 2019

enkiv2 commented May 14, 2019 via email