multi-byte character `grouping_mark` is not parsed correctly when source file is not ASCII/UTF-8 #34

cjyetman · 2023-01-17T09:43:24Z

If a portfolio file uses a multi-byte character grouping_mark and the file is encoded in something other than ASCII/UTF-8, the numbers including the grouping mark will be read in incorrectly. I have submitted an issue with {readr} here tidyverse/readr#1459

mark <- "’"
charToRaw(mark)
#> [1] e2 80 99

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD


mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD


mark <- "*"
charToRaw(mark)
#> [1] 2a

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD

One potential workaround to avoid waiting for an upstream fix would be to convert files encoded in anything other than ASCII/UTF-8 to ASCII/UTF-8 before trying to import it, e.g.

mark <- "’"
charToRaw(mark)
#> [1] e2 80 99

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

filepath <- "test_1252.csv"
write.csv(data, file = filepath, fileEncoding = "windows-1252", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv(filepath)
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD

encoding <- pacta.portfolio.import:::guess_file_encoding(filepath)
command <- sprintf("iconv -f %s -t utf8 %s > tmpfile && mv -f tmpfile %s", encoding, shQuote(filepath), shQuote(filepath))
system(command)
pacta.portfolio.import::read_portfolio_csv(filepath)
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD

The text was updated successfully, but these errors were encountered:

cjyetman · 2023-01-17T09:52:34Z

This was prompted by a portfolio using "right single quotation mark" as the grouping mark.

right single quotation mark
’
Hex UTF-8 code point: 2019 (U+2019)
Hex UTF-8 bytes: E2 80 99
Hex window-1252 codepoint: 92 ("\x92")
Decimal window-1252 codepoint: 146

"’"
#> [1] "’"
charToRaw("’")
#> [1] e2 80 99
iconv("’", from = "UTF-8", to = "windows-1252")
#> [1] "\x92"
iconv("\x92", from = "windows-1252", to = "UTF-8")
#> [1] "’"
`Encoding<-`("\x92", "latin1")
#> [1] "’"
0x92
#> [1] 146
rawToChar(as.raw(146))
#> [1] "\x92"
rawToChar(as.raw(0x92))
#> [1] "\x92"

cjyetman added bug Something isn't working priority labels Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-byte character `grouping_mark` is not parsed correctly when source file is not ASCII/UTF-8 #34

multi-byte character `grouping_mark` is not parsed correctly when source file is not ASCII/UTF-8 #34

cjyetman commented Jan 17, 2023

cjyetman commented Jan 17, 2023

multi-byte character grouping_mark is not parsed correctly when source file is not ASCII/UTF-8 #34

multi-byte character grouping_mark is not parsed correctly when source file is not ASCII/UTF-8 #34

Comments

cjyetman commented Jan 17, 2023

cjyetman commented Jan 17, 2023

multi-byte character `grouping_mark` is not parsed correctly when source file is not ASCII/UTF-8 #34

multi-byte character `grouping_mark` is not parsed correctly when source file is not ASCII/UTF-8 #34