Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce default_locale2() #1445

Open
dpprdan opened this issue Oct 27, 2022 · 3 comments
Open

Introduce default_locale2() #1445

dpprdan opened this issue Oct 27, 2022 · 3 comments
Labels
feature a feature request or enhancement locale 🌏

Comments

@dpprdan
Copy link
Contributor

dpprdan commented Oct 27, 2022

I’ve recently run into the read_csv2() with dot as decimal separator issue that’s been mentioned here several times before.
Basically the problem is, that read_csv2() cannot handle csv files with semicolon as delimiter and the dot as decimal separator.

library(readr)
read_csv2(I("a;b\n1.0;2.0"), locale = locale(decimal_mark = "."))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1    10    20

The reply so far has been (paraphrasing) “This is not what read_csv2() is intended for, because you don’t have a comma as decimal separator. Use read_delim(delim =”;“) instead.”

I think this reasoning and its current implementation are problematic.

First, overriding the locale() setting is quite smelly (sorry, Hadley!)

readr/R/read_delim.R

Lines 333 to 337 in 41447ce

if (locale$decimal_mark == ".") {
cli::cli_alert_info("Using {.val ','} as decimal and {.val '.'} as grouping mark. Use {.fn read_delim} for more control.")
locale$decimal_mark <- ","
locale$grouping_mark <- "."
}

Due to that, read_csv2() does not apply custom decimal_mark and grouping_mark locale settings.
Either the dot, as mentioned above, but also the grouping_mark is ignored with read_csv2():

read_csv2(I("a;b\n1,8;2'345"), locale = locale(grouping_mark = "'"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): b
#> dbl (1): a
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a b    
#>   <dbl> <chr>
#> 1   1.8 2'345

compare

read_csv(I("a,b\n1.8,2'345"), locale = locale(grouping_mark = "'"))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): a
#> num (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.8  2345

In addition, the warning message is confusing (E.g. it is not clear that this is indeed an override) and easily overlooked in the other messaging (Can you spot it immediately above? It’s not obvious IMO even with this minimal column spec).

Even worse, this warning is always shown by default. See the read_csv2() example:

read_csv2(I("a;b\n1,0;2,0"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

This is because default_locale() implies decimal_mark = ".", which read_csv2() then has to override and throw the warning. But this essentially means that the locale = default_locale() default in read_csv2() is useless, because to get rid of the warning, you have to define a different locale() anyway.

read_csv2(I("a;b\n1,0;2,0"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

Conceptually, it would be more sound to treat read_csv2() the same as read_csv() and read_tsv():

read_csv() and read_tsv() are special cases of the more general read_delim().
They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.

I.e. set the delimiter, but grant flexibility on the decimal separator.

E.g. with read_tsv() you can easily do the following

read_tsv(I("a\tb\n1,1\t2,2"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2

read_csv2() would need a different locale/decimal separator default of course (but would need that anyway, because the current default is useless). Either a default_locale2() or locale(decimal_mark = ",") locale(decimal_mark = ",", grouping_mark = ".").

Is that something you would consider? Happy to draft a PR.

One last thing:

This format is common in some European countries.
https://readr.tidyverse.org/reference/read_delim.html

It’s not just “some European countries”. Half the world uses the comma as a decimal separator and while I don’t have any numbers on the semicolon as delimiter in CSVs, it is clear that it cannot be the comma in those countries. Maybe it’s just me, but frankly that line also sounds a tad dismissive to me - I know it’s not meant that way, but still.

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-10-27
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
#>  crayon        1.5.2   2022-09-29 [1] RSPM
#>  digest        0.6.30  2022-10-18 [1] CRAN (R 4.2.1)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.17    2022-10-07 [1] RSPM
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.2   2022-08-19 [1] CRAN (R 4.2.1)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3   2022-10-07 [1] RSPM
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.5   2022-10-06 [1] RSPM
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.3   2022-10-01 [1] CRAN (R 4.2.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.17    2022-10-07 [1] RSPM
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  styler        1.8.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyselect    1.2.0   2022-10-10 [1] RSPM
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  vroom         1.6.0   2022-09-30 [1] RSPM
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.34    2022-10-18 [1] CRAN (R 4.2.1)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@dpprdan
Copy link
Contributor Author

dpprdan commented Oct 27, 2022

Edit: This was/is a separate, albeit related, issue #1468.

@dpprdan
Copy link
Contributor Author

dpprdan commented Feb 26, 2023

@phish108 has brought to my attention that institutions in Switzerland, like the Federal Statistical Office, publish at least some CSV files with semicolon as delimiter and dot as decimal sign.

Here is one example (metadata)

library(readr)
ch_url <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/21765619/master"
read_lines(ch_url, n_max = 5)
#> [1] "\"TIME_PERIOD\";\"GEO\";\"POP\";\"ERWP\";\"ERWL\";\"UNIT_MEA\";\"OBS_VALUE\";\"OBS_CONFIDENCE\";\"OBS_STATUS\""
#> [2] "\"2010/2012\";\"101\";\"Total\";\"0\";\"Total\";\"pers\";12374.03;5.620;\"A\""                                 
#> [3] "\"2010/2012\";\"101\";\"Total\";\"1\";\"Total\";\"pers\";28394.11;3.718;\"A\""                                 
#> [4] "\"2010/2012\";\"101\";\"Total\";\"Total\";\"Total\";\"pers\";40768.14;3.074;\"A\""                             
#> [5] "\"2010/2012\";\"102\";\"Total\";\"0\";\"Total\";\"pers\";7288.64;7.321;\"A\""

another example (metadata)

ch_url2 <- "https://www.web.statistik.zh.ch/ogd/data/KANTON_ZUERICH_375.csv"
read_lines(ch_url2, n_max = 5)
#> [1] "BFS_NR;GEBIET_NAME;THEMA_NAME;SET_NAME;SUBSET_NAME;INDIKATOR_ID;INDIKATOR_NAME;INDIKATOR_JAHR;INDIKATOR_VALUE;EINHEIT_KURZ;EINHEIT_LANG;"            
#> [2] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1990;2.6;Mio.Fr.;Millionen Franken;"
#> [3] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1991;3.0;Mio.Fr.;Millionen Franken;"
#> [4] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1992;3.3;Mio.Fr.;Millionen Franken;"
#> [5] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1993;3.6;Mio.Fr.;Millionen Franken;"

Apparently this spec is not used consistently. Here is an example from the Statistical Office where they use comma as delimiter (metadata)

ch_url3 <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/24106318/master"
read_lines(ch_url3, n_max = 5)
#> [1] "\"REGION\",\"CANTON\",\"PERIOD\",\"VALUE\",\"STATUS\",\"OBS_COEF\""
#> [2] "\"Total\",\"Total\",\"2023-01\",\"2.19208880770041\",\"A\",\"A\""  
#> [3] "\"1\",\"Total\",\"2023-01\",\"3.47967832561423\",\"A\",\"A\""      
#> [4] "\"1\",\"22\",\"2023-01\",\"3.49266747468504\",\"A\",\"A\""         
#> [5] "\"1\",\"23\",\"2023-01\",\"2.97652113777818\",\"A\",\"A\""

But be that as it may, why not support semicolon as delimiter and dot as decimal sign with read_csv2(), when even official institutions publish files like that?

@hadley hadley changed the title make read_csv2() flexible re decimal separator Introduce defaul_locale2() Aug 1, 2023
@hadley hadley added feature a feature request or enhancement locale 🌏 labels Aug 1, 2023
@hadley
Copy link
Member

hadley commented Aug 1, 2023

Should fix at same time as #1468

@hadley hadley changed the title Introduce defaul_locale2() Introduce default_locale2() Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement locale 🌏
Projects
None yet
Development

No branches or pull requests

2 participants