Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-byte character grouping_mark is not parsed correctly when source file is not ASCII/UTF-8 #34

Open
cjyetman opened this issue Jan 17, 2023 · 1 comment
Labels
bug Something isn't working priority

Comments

@cjyetman
Copy link
Member

If a portfolio file uses a multi-byte character grouping_mark and the file is encoded in something other than ASCII/UTF-8, the numbers including the grouping mark will be read in incorrectly. I have submitted an issue with {readr} here tidyverse/readr#1459

mark <- ""
charToRaw(mark)
#> [1] e2 80 99

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD


mark <- "\U00A0"
charToRaw(mark)
#> [1] c2 a0

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD


mark <- "*"
charToRaw(mark)
#> [1] 2a

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

write.csv(data, file = "test_1252.csv", fileEncoding = "windows-1252", row.names = FALSE)
write.csv(data, file = "test_utf8.csv", fileEncoding = "UTF-8", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv("test_1252.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD
pacta.portfolio.import::read_portfolio_csv("test_utf8.csv")
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD

One potential workaround to avoid waiting for an upstream fix would be to convert files encoded in anything other than ASCII/UTF-8 to ASCII/UTF-8 before trying to import it, e.g.

mark <- ""
charToRaw(mark)
#> [1] e2 80 99

data <-
  data.frame(
    isin = c("US13872348", "US13872348"),
    market_value = c("1234", paste0("1", mark, "234")),
    currency = c("USD", "USD")
  )

filepath <- "test_1252.csv"
write.csv(data, file = filepath, fileEncoding = "windows-1252", row.names = FALSE)

pacta.portfolio.import::read_portfolio_csv(filepath)
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348            1 USD

encoding <- pacta.portfolio.import:::guess_file_encoding(filepath)
command <- sprintf("iconv -f %s -t utf8 %s > tmpfile && mv -f tmpfile %s", encoding, shQuote(filepath), shQuote(filepath))
system(command)
pacta.portfolio.import::read_portfolio_csv(filepath)
#> # A tibble: 2 × 3
#>   isin       market_value currency
#>   <chr>             <dbl> <chr>   
#> 1 US13872348         1234 USD     
#> 2 US13872348         1234 USD
@cjyetman cjyetman added bug Something isn't working priority labels Jan 17, 2023
@cjyetman
Copy link
Member Author

This was prompted by a portfolio using "right single quotation mark" as the grouping mark.

right single quotation mark

Hex UTF-8 code point: 2019 (U+2019)
Hex UTF-8 bytes: E2 80 99
Hex window-1252 codepoint: 92 ("\x92")
Decimal window-1252 codepoint: 146

""
#> [1] "’"
charToRaw("")
#> [1] e2 80 99
iconv("", from = "UTF-8", to = "windows-1252")
#> [1] "\x92"
iconv("\x92", from = "windows-1252", to = "UTF-8")
#> [1] "’"
`Encoding<-`("\x92", "latin1")
#> [1] "’"
0x92
#> [1] 146
rawToChar(as.raw(146))
#> [1] "\x92"
rawToChar(as.raw(0x92))
#> [1] "\x92"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority
Projects
None yet
Development

No branches or pull requests

1 participant