Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-byte grouping_mark doesn't work when source file is different encoding #1459

Open
cjyetman opened this issue Jan 13, 2023 · 4 comments
Labels
bug an unexpected problem or unintended behavior locale 🌏 read 📖

Comments

@cjyetman
Copy link

Using a multi-byte character as a grouping_mark doesn't work when the source file encoding is "windows-1252", while other uncommon non-multi-byte strings work as expected. I'm on macOS in UTF-8 locale. Is there a way to specify the grouping_mark so that it matches when the source file is in "windows-1252"? Seems like #796 is related, which was closed by tidyverse/vroom@959b4b7

mark <- " "
charToRaw(mark)
#> [1] 20
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.


mark <- ""
charToRaw(mark)
#> [1] e2 80 99
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1   123

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.

mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1   123

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.
@cjyetman
Copy link
Author

Similarly, readr::parse_number() doesn't parse as expected when the grouping_mark is a multi-byte character...

mark <- " "
charToRaw(mark)
#> [1] 20
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123 456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123456.8

mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123 456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123

mark <- ""
charToRaw(mark)
#> [1] e2 80 99
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123’456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123

@hadley
Copy link
Member

hadley commented Aug 1, 2023

Place to start is probably to figure out what's going on here:

library(readr)

parse_number("123--456", locale = locale(grouping_mark = "--"))
#> [1] 123456
parse_number("123\U00A0456", locale = locale(grouping_mark = "\U00A0"))
#> [1] 123

Created on 2023-08-01 with reprex v2.0.2

@cjyetman
Copy link
Author

Place to start is probably to figure out what's going on here:

library(readr)

parse_number("123--456", locale = locale(grouping_mark = "--"))
#> [1] 123456
parse_number("123\U00A0456", locale = locale(grouping_mark = "\U00A0"))
#> [1] 123

Created on 2023-08-01 with reprex v2.0.2

"--" seems to work because of some kind of recycling... multi-byte grouping marks where the bytes are different do not work

readr::parse_number("123-456", locale = readr::locale(grouping_mark = "-"))
#> [1] 123456
readr::parse_number("123|456", locale = readr::locale(grouping_mark = "|"))
#> [1] 123456

readr::parse_number("123-456", locale = readr::locale(grouping_mark = "---"))
#> [1] 123456
readr::parse_number("123---456", locale = readr::locale(grouping_mark = "-"))
#> [1] 123456

readr::parse_number("123|456", locale = readr::locale(grouping_mark = "|||"))
#> [1] 123456
readr::parse_number("123|||456", locale = readr::locale(grouping_mark = "|"))
#> [1] 123456

readr::parse_number("123|-456", locale = readr::locale(grouping_mark = "|-"))
#> [1] 123
readr::parse_number("123-|456", locale = readr::locale(grouping_mark = "-|"))
#> [1] 123

@cjyetman
Copy link
Author

cjyetman commented Jan 14, 2024

pretty sure this is iterating through bytes, not characters

readr/src/parse.cpp

Lines 165 to 178 in e529cb2

for (int i = 0; i < n; ++i) {
Token t;
if (x[i] == NA_STRING) {
t = Token(TOKEN_MISSING, i, -1);
} else {
SEXP string = x[i];
t = Token(CHAR(string), CHAR(string) + Rf_length(string), i, -1, false);
if (trim_ws) {
t.trim();
}
t.flagNA(na);
}
col->setValue(i, t);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior locale 🌏 read 📖
Projects
None yet
Development

No branches or pull requests

2 participants