Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringr functions not working for strings with \r or \n #517

Closed
Athospd opened this issue Jul 10, 2023 · 7 comments
Closed

stringr functions not working for strings with \r or \n #517

Athospd opened this issue Jul 10, 2023 · 7 comments

Comments

@Athospd
Copy link

Athospd commented Jul 10, 2023

Is it a bug?
PS: str_detect() has this behavior too.

pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\r @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\r @"


pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\n @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\n @"

Created on 2023-07-10 with reprex v2.0.2

@gagolews
Copy link
Contributor

The regex dot does match a newline character by default, but see the 'dotall' flag/setting in the stringi paper https://www.jstatsoft.org/article/view/v103i02

@Athospd
Copy link
Author

Athospd commented Jul 11, 2023

But "@[^(@@@@)]*@" make it work back again.
How is it not a bug? I couldn't find the answer in this link.

@gagolews
Copy link
Contributor

The meaning of [^(@@@@)] is: any character except (, @, and ). This includes the newline.

Another good tutorial on regexes is https://www.regular-expressions.info/

@Athospd
Copy link
Author

Athospd commented Jul 11, 2023

But they are behaving differently. Could I ask you for a more specific explanation? Those links are too vague.

They should return the same output, but they are not. Don't you agree with that?

when using the base-R things work differently from stringr. In my mind the logic is: The regex pattern is the same, so it should return the same output. Where am I getting it wrong?

@gagolews
Copy link
Contributor

If you pass perl=TRUE to the R regex functions, you will get the same behaviour, i.e., . not matching a newline by default. Regexes are a powerful tool, but the regex engines differ from each other. This is how they are implemented, it is part of their specification.

For ICU regexes (used in stringi and hence stringer), see https://unicode-org.github.io/icu/userguide/strings/regexp.html

For PCRE regexes (perl=TRUE) in base R, see https://www.pcre.org/current/doc/html/pcre2pattern.html

For TRE regexes (perl=FALSE - default), see https://github.com/laurikari/tre/ -- but these are not particularly well-documented.

I would rather say it is TRE that does things differently, not ICU/PCRE

Even Python regexes (https://docs.python.org/3/howto/regex.html) have the DOTALL distinction.

HTH

@Athospd
Copy link
Author

Athospd commented Jul 12, 2023

Hum interesting, thank you for this. I'm closing the issue!

@Athospd Athospd closed this as completed Jul 12, 2023
@hadley
Copy link
Member

hadley commented Aug 7, 2023

Also see the dotall argument to regex().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants