Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing error for non-ascii input #1521

Open
mine-cetinkaya-rundel opened this issue Nov 9, 2023 · 2 comments
Open

Confusing error for non-ascii input #1521

mine-cetinkaya-rundel opened this issue Nov 9, 2023 · 2 comments

Comments

@mine-cetinkaya-rundel
Copy link
Member

This is from R4DS.

library(readr)
 
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)$text
#> Warning in grepl("\n", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("\n", path): input string 1 is invalid
#> Warning in grepl("^((http|ftp)s?|sftp)://", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("^((http|ftp)s?|sftp)://", path): input string 1 is invalid
#> Warning in regexpr(regex, path, perl = TRUE): input string 1 is invalid UTF-8
#> Warning in grepl("^(/|[A-Za-z]:|\\\\|~)", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("^(/|[A-Za-z]:|\\\\|~)", path): input string 1 is invalid
#> Error: 'text El Ni<f1>o was particularly bad this year' does not exist in
#> current working directory
#> ('/private/var/folders/3_/xjzgh7dj511d5t7996fq_1sm0000gn/T/RtmpqIT6YG/reprex-a72c13762e55-cilia-stag').

Created on 2023-11-09 with reprex v2.0.2

@jennybc
Copy link
Member

jennybc commented Nov 9, 2023

In terms of what's happening, read_csv() is really calling vroom::vroom(delim = ",", locale = default_locale()) and the default locale has UTF-8 encoding. The escape sequence \xf1 is not valid UTF-8, which is the root cause of all the problems.

A complicating factor is that the input x1 is clearly intended as literal input, but it's being processed as a file path. The main error is reported last above:

#> Error: 'text El Ni<f1>o was particularly bad this year' does not exist in
#> current working directory

Then we're also getting lots of base R warnings from a failed file existence check, since the file path is not valid UTF-8.

I'm pretty surprised this code ever worked or that it worked in recent memory. I'll have a think on whether we can improve on the error. But the error and warnings above do actually explain what's wrong, albeit it in a rather cryptic way.

If you want to update the code, here are some ideas. Key changes are to explicitly convert from latin1 to UTF-8 and to use I() to indicate literal input.

If you want to keep using the \x escape sequence, then you'll need to convert that string to UTF-8:

library(readr)

x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(I(iconv(x1, "latin1", "utf-8")), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"
library(vroom)
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
vroom(I(iconv(x1, "latin1", "utf-8")), delim = ",", show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"

But if this is just about using an accented character in literal input, then use a \u escape sequence instead, to get a UTF-8 string.

library(readr)
x1 <- "text\nEl Ni\u00F1o was particularly bad this year"
read_csv(I(x1), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"

Created on 2023-11-09 with reprex v2.0.2.9000

@jennybc
Copy link
Member

jennybc commented Nov 9, 2023

The noise/errors around the original example (which originates in R4DS) have probably gotten worse over time due to changes in base R. Some relevant items from NEWS:

  • 4.3.0: Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8).
  • R 4.0.0: Most functions with file-path inputs will give an explicit error if a file-path input in a marked encoding cannot be translated (to the native encoding or in some cases on Windows to UTF-8), rather than translate to a different file path using escapes. Some (such as dir.exists(), file.exists(), file.access(), file.info(), list.files(), normalizePath() and path.expand()) treat this like any other non-existent file, often with a warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants