multi-byte `grouping_mark` doesn't work when source file is different encoding #1459

cjyetman · 2023-01-13T17:30:57Z

Using a multi-byte character as a grouping_mark doesn't work when the source file encoding is "windows-1252", while other uncommon non-multi-byte strings work as expected. I'm on macOS in UTF-8 locale. Is there a way to specify the grouping_mark so that it matches when the source file is in "windows-1252"? Seems like #796 is related, which was closed by tidyverse/vroom@959b4b7

mark <- " "
charToRaw(mark)
#> [1] 20
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.


mark <- "’"
charToRaw(mark)
#> [1] e2 80 99
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1   123

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.

mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0
txt <- paste0("x\n123", mark, "456.78")
writeLines(txt, file("test_1252.csv", encoding = "windows-1252"))
writeLines(txt, file("test_utf8.csv", encoding = "UTF-8"))

readr::read_delim(
  file = "test_1252.csv",
  locale = readr::locale(encoding = "windows-1252", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1   123

readr::read_delim(
  file = "test_utf8.csv",
  locale = readr::locale(encoding = "UTF-8", grouping_mark = mark),
  delim = ";",
  show_col_types = FALSE
)
#> # A tibble: 1 × 1
#>         x
#>     <dbl>
#> 1 123457.

The text was updated successfully, but these errors were encountered:

cjyetman · 2023-01-17T09:16:27Z

Similarly, readr::parse_number() doesn't parse as expected when the grouping_mark is a multi-byte character...

mark <- " "
charToRaw(mark)
#> [1] 20
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123 456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123456.8

mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123 456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123

mark <- "’"
charToRaw(mark)
#> [1] e2 80 99
txt <- paste0("123", mark, "456.78")
txt
#> [1] "123’456.78"
readr::parse_number(txt, locale = readr::locale(grouping_mark = mark))
#> [1] 123

hadley · 2023-08-01T12:33:21Z

Place to start is probably to figure out what's going on here:

library(readr)

parse_number("123--456", locale = locale(grouping_mark = "--"))
#> [1] 123456
parse_number("123\U00A0456", locale = locale(grouping_mark = "\U00A0"))
#> [1] 123

^{Created on 2023-08-01 with reprex v2.0.2}

cjyetman · 2024-01-14T21:33:58Z

Place to start is probably to figure out what's going on here:
library(readr)

parse_number("123--456", locale = locale(grouping_mark = "--"))
#> [1] 123456
parse_number("123\U00A0456", locale = locale(grouping_mark = "\U00A0"))
#> [1] 123
Created on 2023-08-01 with reprex v2.0.2

"--" seems to work because of some kind of recycling... multi-byte grouping marks where the bytes are different do not work

readr::parse_number("123-456", locale = readr::locale(grouping_mark = "-"))
#> [1] 123456
readr::parse_number("123|456", locale = readr::locale(grouping_mark = "|"))
#> [1] 123456

readr::parse_number("123-456", locale = readr::locale(grouping_mark = "---"))
#> [1] 123456
readr::parse_number("123---456", locale = readr::locale(grouping_mark = "-"))
#> [1] 123456

readr::parse_number("123|456", locale = readr::locale(grouping_mark = "|||"))
#> [1] 123456
readr::parse_number("123|||456", locale = readr::locale(grouping_mark = "|"))
#> [1] 123456

readr::parse_number("123|-456", locale = readr::locale(grouping_mark = "|-"))
#> [1] 123
readr::parse_number("123-|456", locale = readr::locale(grouping_mark = "-|"))
#> [1] 123

cjyetman · 2024-01-14T23:08:48Z

pretty sure this is iterating through bytes, not characters

readr/src/parse.cpp

Lines 165 to 178 in e529cb2

    
           for (int i = 0; i < n; ++i) { 
        
             Token t; 
        
             if (x[i] == NA_STRING) { 
        
               t = Token(TOKEN_MISSING, i, -1); 
        
             } else { 
        
               SEXP string = x[i]; 
        
               t = Token(CHAR(string), CHAR(string) + Rf_length(string), i, -1, false); 
        
               if (trim_ws) { 
        
                 t.trim(); 
        
               } 
        
               t.flagNA(na); 
        
             } 
        
             col->setValue(i, t); 
        
           }

cjyetman mentioned this issue Jan 17, 2023

multi-byte character grouping_mark is not parsed correctly when source file is not ASCII/UTF-8 RMI-PACTA/pacta.portfolio.import#34

Open

hadley added bug an unexpected problem or unintended behavior multibyte 🦋 read 📖 locale 🌏 and removed multibyte 🦋 labels Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-byte `grouping_mark` doesn't work when source file is different encoding #1459

multi-byte `grouping_mark` doesn't work when source file is different encoding #1459

cjyetman commented Jan 13, 2023

cjyetman commented Jan 17, 2023

hadley commented Aug 1, 2023

cjyetman commented Jan 14, 2024

cjyetman commented Jan 14, 2024 •

edited

Loading

multi-byte grouping_mark doesn't work when source file is different encoding #1459

multi-byte grouping_mark doesn't work when source file is different encoding #1459

Comments

cjyetman commented Jan 13, 2023

cjyetman commented Jan 17, 2023

hadley commented Aug 1, 2023

cjyetman commented Jan 14, 2024

cjyetman commented Jan 14, 2024 • edited Loading

multi-byte `grouping_mark` doesn't work when source file is different encoding #1459

multi-byte `grouping_mark` doesn't work when source file is different encoding #1459

cjyetman commented Jan 14, 2024 •

edited

Loading