Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE compilation error: regular expression is too large at offset 592769 #258

Open
jfb-h opened this issue Aug 12, 2021 · 1 comment
Open

Comments

@jfb-h
Copy link

jfb-h commented Aug 12, 2021

Upon trying to remove sparse terms from a corpus via

remove_sparse_terms!(corp, .05)

I run into the following error message:

PCRE compilation error: regular expression is too large at offset 592769

    error(::String)@error.jl:33
    compile(::String, ::UInt32)@pcre.jl:128
    compile(::Regex)@regex.jl:79
    Regex(::String, ::UInt32, ::UInt32)@regex.jl:44
    Regex@regex.jl:67[inlined]
    mk_regex(::String)@preprocessing.jl:31
    _combine_regex(::Set{AbstractString})@preprocessing.jl:547
    _build_regex(::Languages.English, ::UInt32, ::Set{AbstractString}, ::Set{AbstractString})@preprocessing.jl:542
    var"#prepare!#14"(::Set{AbstractString}, ::Set{AbstractString}, ::typeof(TextAnalysis.prepare!), ::TextAnalysis.Corpus{TextAnalysis.StringDocument{String}}, ::UInt32)@preprocessing.jl:414
    remove_words!@preprocessing.jl:227[inlined]
    remove_sparse_terms!(::TextAnalysis.Corpus{TextAnalysis.StringDocument{String}}, ::Float64)@preprocessing.jl:341
    top-level scope@Local: 18

Is this a bug or might this just mean there is something wrong with one of the documents? That might be a possibility as I'm dealing with patents which can get pretty messy.

I'm on Julia 1.6.1 and TextAnalysis v0.7.3.

@aviks
Copy link
Member

aviks commented Aug 18, 2021

Might well be a bug. Are the documents you use public? If so, would you be able to provide an example that fails?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants