Introduce `default_locale2()` #1445

dpprdan · 2022-10-27T16:34:28Z

I’ve recently run into the read_csv2() with dot as decimal separator issue that’s been mentioned here several times before.
Basically the problem is, that read_csv2() cannot handle csv files with semicolon as delimiter and the dot as decimal separator.

library(readr)
read_csv2(I("a;b\n1.0;2.0"), locale = locale(decimal_mark = "."))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1    10    20

The reply so far has been (paraphrasing) “This is not what read_csv2() is intended for, because you don’t have a comma as decimal separator. Use read_delim(delim =”;“) instead.”

I think this reasoning and its current implementation are problematic.

First, overriding the locale() setting is quite smelly (sorry, Hadley!)

readr/R/read_delim.R

Lines 333 to 337 in 41447ce

    
           if (locale$decimal_mark == ".") { 
        
             cli::cli_alert_info("Using {.val ','} as decimal and {.val '.'} as grouping mark. Use {.fn read_delim} for more control.") 
        
             locale$decimal_mark <- "," 
        
             locale$grouping_mark <- "." 
        
           }

Due to that, read_csv2() does not apply custom decimal_mark and grouping_mark locale settings.
Either the dot, as mentioned above, but also the grouping_mark is ignored with read_csv2():

read_csv2(I("a;b\n1,8;2'345"), locale = locale(grouping_mark = "'"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): b
#> dbl (1): a
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a b    
#>   <dbl> <chr>
#> 1   1.8 2'345

compare

read_csv(I("a,b\n1.8,2'345"), locale = locale(grouping_mark = "'"))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): a
#> num (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.8  2345

In addition, the warning message is confusing (E.g. it is not clear that this is indeed an override) and easily overlooked in the other messaging (Can you spot it immediately above? It’s not obvious IMO even with this minimal column spec).

Even worse, this warning is always shown by default. See the read_csv2() example:

read_csv2(I("a;b\n1,0;2,0"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

This is because default_locale() implies decimal_mark = ".", which read_csv2() then has to override and throw the warning. But this essentially means that the locale = default_locale() default in read_csv2() is useless, because to get rid of the warning, you have to define a different locale() anyway.

read_csv2(I("a;b\n1,0;2,0"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

Conceptually, it would be more sound to treat read_csv2() the same as read_csv() and read_tsv():

read_csv() and read_tsv() are special cases of the more general read_delim().
They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.

I.e. set the delimiter, but grant flexibility on the decimal separator.

E.g. with read_tsv() you can easily do the following

read_tsv(I("a\tb\n1,1\t2,2"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2

read_csv2() would need a different locale/decimal separator default of course (but would need that anyway, because the current default is useless). Either a default_locale2() or ~~locale(decimal_mark = ",")~~ locale(decimal_mark = ",", grouping_mark = ".").

Is that something you would consider? Happy to draft a PR.

One last thing:

This format is common in some European countries.
https://readr.tidyverse.org/reference/read_delim.html

It’s not just “some European countries”. Half the world uses the comma as a decimal separator and while I don’t have any numbers on the semicolon as delimiter in CSVs, it is clear that it cannot be the comma in those countries. Maybe it’s just me, but frankly that line also sounds a tad dismissive to me - I know it’s not meant that way, but still.

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-10-27
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
#>  crayon        1.5.2   2022-09-29 [1] RSPM
#>  digest        0.6.30  2022-10-18 [1] CRAN (R 4.2.1)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.17    2022-10-07 [1] RSPM
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.2   2022-08-19 [1] CRAN (R 4.2.1)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3   2022-10-07 [1] RSPM
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.5   2022-10-06 [1] RSPM
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.3   2022-10-01 [1] CRAN (R 4.2.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.17    2022-10-07 [1] RSPM
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  styler        1.8.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyselect    1.2.0   2022-10-10 [1] RSPM
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  vroom         1.6.0   2022-09-30 [1] RSPM
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.34    2022-10-18 [1] CRAN (R 4.2.1)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

dpprdan · 2022-10-27T18:04:13Z

Edit: This was/is a separate, albeit related, issue #1468.

dpprdan · 2023-02-26T21:35:38Z

@phish108 has brought to my attention that institutions in Switzerland, like the Federal Statistical Office, publish at least some CSV files with semicolon as delimiter and dot as decimal sign.

Here is one example (metadata)

library(readr)
ch_url <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/21765619/master"
read_lines(ch_url, n_max = 5)
#> [1] "\"TIME_PERIOD\";\"GEO\";\"POP\";\"ERWP\";\"ERWL\";\"UNIT_MEA\";\"OBS_VALUE\";\"OBS_CONFIDENCE\";\"OBS_STATUS\""
#> [2] "\"2010/2012\";\"101\";\"Total\";\"0\";\"Total\";\"pers\";12374.03;5.620;\"A\""                                 
#> [3] "\"2010/2012\";\"101\";\"Total\";\"1\";\"Total\";\"pers\";28394.11;3.718;\"A\""                                 
#> [4] "\"2010/2012\";\"101\";\"Total\";\"Total\";\"Total\";\"pers\";40768.14;3.074;\"A\""                             
#> [5] "\"2010/2012\";\"102\";\"Total\";\"0\";\"Total\";\"pers\";7288.64;7.321;\"A\""

another example (metadata)

ch_url2 <- "https://www.web.statistik.zh.ch/ogd/data/KANTON_ZUERICH_375.csv"
read_lines(ch_url2, n_max = 5)
#> [1] "BFS_NR;GEBIET_NAME;THEMA_NAME;SET_NAME;SUBSET_NAME;INDIKATOR_ID;INDIKATOR_NAME;INDIKATOR_JAHR;INDIKATOR_VALUE;EINHEIT_KURZ;EINHEIT_LANG;"            
#> [2] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1990;2.6;Mio.Fr.;Millionen Franken;"
#> [3] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1991;3.0;Mio.Fr.;Millionen Franken;"
#> [4] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1992;3.3;Mio.Fr.;Millionen Franken;"
#> [5] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1993;3.6;Mio.Fr.;Millionen Franken;"

Apparently this spec is not used consistently. Here is an example from the Statistical Office where they use comma as delimiter (metadata)

ch_url3 <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/24106318/master"
read_lines(ch_url3, n_max = 5)
#> [1] "\"REGION\",\"CANTON\",\"PERIOD\",\"VALUE\",\"STATUS\",\"OBS_COEF\""
#> [2] "\"Total\",\"Total\",\"2023-01\",\"2.19208880770041\",\"A\",\"A\""  
#> [3] "\"1\",\"Total\",\"2023-01\",\"3.47967832561423\",\"A\",\"A\""      
#> [4] "\"1\",\"22\",\"2023-01\",\"3.49266747468504\",\"A\",\"A\""         
#> [5] "\"1\",\"23\",\"2023-01\",\"2.97652113777818\",\"A\",\"A\""

But be that as it may, why not support semicolon as delimiter and dot as decimal sign with read_csv2(), when even official institutions publish files like that?

hadley · 2023-08-01T12:22:36Z

Should fix at same time as #1468

phish108 mentioned this issue Nov 21, 2022

documentation issue for read_delim(), retire read_csv() and read_csv2() #1452

Closed

dpprdan mentioned this issue Feb 7, 2023

read_csv2_chunked() needs to adjust locale like read_csv2() #1468

Open

hadley changed the title ~~make read_csv2() flexible re decimal separator~~ Introduce defaul_locale2() Aug 1, 2023

hadley added feature a feature request or enhancement locale 🌏 labels Aug 1, 2023

hadley changed the title ~~Introduce defaul_locale2()~~ Introduce default_locale2() Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `default_locale2()` #1445

Introduce `default_locale2()` #1445

dpprdan commented Oct 27, 2022 •

edited

Loading

dpprdan commented Oct 27, 2022 •

edited

Loading

dpprdan commented Feb 26, 2023

hadley commented Aug 1, 2023

Introduce default_locale2() #1445

Introduce default_locale2() #1445

Comments

dpprdan commented Oct 27, 2022 • edited Loading

dpprdan commented Oct 27, 2022 • edited Loading

dpprdan commented Feb 26, 2023

hadley commented Aug 1, 2023

Introduce `default_locale2()` #1445

Introduce `default_locale2()` #1445

dpprdan commented Oct 27, 2022 •

edited

Loading

dpprdan commented Oct 27, 2022 •

edited

Loading