Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation issue for read_delim(), retire read_csv() and read_csv2() #1452

Closed
phish108 opened this issue Nov 21, 2022 · 12 comments
Closed

Comments

@phish108
Copy link

This issue is directly related to #1445, but separate from it with respect of being a documentation issue.

As noted the read_csv* family does not allow to override locales properly. In my case I cannot set the Swiss locale ("ch", "de-ch", "fr-ch", or "it-ch"). None of those locale is recognised by locale(). Naturally, Swiss users would use read_csv2() as many CSV-files are semicolon separated, particularly CSV-files that where exported from EXCEL under these locale. In these cases, users would need to be able to flip the decimal_mark and the grouping_mark. The documentation wrongly suggests the locale-parameter for such cases.

However, read_delim() appears smart enough to handle pretty much all cases correctly without extra parameters.

With this respect the documentation is misleading by being overly concise. It suggests that the functions read_csv() and read_csv2() are not just mere convenience functions for read.csv() and read.csv2() and stresses the use of the locale parameter. However, in all of my cases all function pairs of the code below lead to the same results, apart from read_csv2(). Therefore, I have the impression that read_delim() offers the best way for importing CSV-files, correctly.

myDataA1 = read_delim("some_komma_file.csv") 
myDataA2 = read_csv("some_komma_file.csv")

myDataB1 = read_delim("some_semicolon_file.csv") # no problems with the swiss locale
myDataB2 = read_csv2("some_semicolon_file.csv") # breaks with swiss locale, no matter what

myDataC1 = read_delim("some_tabbed_file.tsv") 
myDataC2 = read_tsv("some_tabbed_file.tsv")

This smart behaviour of read_delim() should be stated more clearly in the reference documentation, because it makes life much easier by needing only one function over three. This would enhance the learnability of this part of the tidyverse.

As the documentation stands right now, the examples emphasize on the other functions, instead of highlighting the smartness of read_delim(). However, it appears to me, that read_delim() should be used for reading pretty much any delimited text files and not just to gain more control. The other functions make sense for (rare?) border cases where the data is contradictory regarding the delimiter. But even then using the parameters of read_delim() appears more useful. Therefore, I suggest to consider a lifecycle update to the other three functions.

@phish108
Copy link
Author

phish108 commented Nov 21, 2022

related to #1411 - by read_delim() not being smart enough for handling single column sources.

library(readr)
lines <- "C1
2011D2"
read_delim(lines) # Error: Could not guess the delimiter.

@dpprdan
Copy link
Contributor

dpprdan commented Feb 7, 2023

While I was 🤯 when I first read this, I am not convinced this is the right approach on further inspection (not that my view matters much).

You’re right that read_delim() is pretty good at guessing the delimiter (with a dot as decimal mark).

library(readr)
read_delim(I("a,b\n1.1,2.2")) # comma
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2
read_delim(I("a;b\n1.1;2.2")) # semicolon
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2
read_delim(I("a\tb\n1.1\t2.2")) # tab
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2

But it fails to parse decimal numbers correctly when the decimal mark is a comma.

read_delim(I("a;b\n1,1;2,2")) # semicolon
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1    11    22
read_delim(I("a\tb\n1,1\t2,2")) # tab
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1    11    22

This can be fixed of course by specifying locale. But not having to specify the decimal mark (and separator) is the whole point of read_csv() and friends.

read_delim(I("a;b\n1,1;2,2"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2

In these cases (semicolon-separated CSVs), users would need to be able to flip the decimal_mark and the grouping_mark. The documentation wrongly suggests the locale-parameter for such cases.

AFAICT the documentation is correct here, i.e. this is exactly how it works.
All read_ functions use default_locale() as the default argument for locale, which is dot as decimal separator and comma as group separator.

default_locale()
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %AD / %AT
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
#>         (Thu), Friday (Fri), Saturday (Sat)
#> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
#>         June (Jun), July (Jul), August (Aug), September (Sep), October
#>         (Oct), November (Nov), December (Dec)
#> AM/PM:  AM/PM
read_delim(I("a;b\n1,800;2.345"))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (1): b
#> num (1): a
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1  1800  2.35

(Because of #1445 it is not possible to use read_csv2() for this, but as I argue over there, I believe it should be.)
If you want to flip the decimal_mark and the grouping_mark, you need to specify it with locale.

read_delim(I("a;b\n1,800;2.345"), locale = locale(decimal_mark = ",", grouping_mark = "."))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (1): a
#> num (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.8  2345

Or use read_csv2()

read_csv2(I("a;b\n1,800;2.345"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (1): a
#> num (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.8  2345

As noted the read_csv* family does not allow to override locales properly.

This is only the case for read_csv2(), AFAICT. read_csv() even accepts locale(decimal_mark = ",") 😮

read_csv(I("a,b\n1_800,2_345"), locale = locale(decimal_mark = ",", grouping_mark = "_"))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1  1800  2345

I cannot set the Swiss locale (“ch”, “de-ch”, “fr-ch”, or “it-ch”). None of those locale is recognised by locale().

This is technically true.

locale("de_CH")
#> Error: Unknown language 'de_CH'

However, locale() uses the language code only for the date_names argument, not for the other settings like decimal_mark and grouping_mark. (Whether locale() should accept extended locale strings for date_names is a different topic.)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2023-02-07
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package     * version    date (UTC) lib source
#>  D archive       1.1.5      2022-05-06 [1] CRAN (R 4.2.2)
#>    bit           4.0.5      2022-11-15 [1] CRAN (R 4.2.2)
#>    bit64         4.0.5      2020-08-30 [1] CRAN (R 4.2.0)
#>    cli           3.6.0      2023-01-09 [1] CRAN (R 4.2.2)
#>    crayon        1.5.2      2022-09-29 [1] RSPM
#>    digest        0.6.31     2022-12-11 [1] CRAN (R 4.2.2)
#>    ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>    evaluate      0.20       2023-01-17 [1] CRAN (R 4.2.2)
#>    fansi         1.0.4      2023-01-22 [1] CRAN (R 4.2.2)
#>    fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>    fs            1.6.0      2023-01-23 [1] CRAN (R 4.2.2)
#>    glue          1.6.2.9000 2023-01-16 [1] Github (tidyverse/glue@5a16502)
#>    hms           1.1.2      2022-08-19 [1] CRAN (R 4.2.1)
#>    htmltools     0.5.4      2022-12-07 [1] CRAN (R 4.2.2)
#>    knitr         1.42       2023-01-25 [1] CRAN (R 4.2.2)
#>    lifecycle     1.0.3      2022-10-07 [1] RSPM
#>    magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>    pillar        1.8.1      2022-08-19 [1] CRAN (R 4.2.1)
#>    pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>    purrr         1.0.1      2023-01-10 [1] CRAN (R 4.2.2)
#>    R.cache       0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>    R.methodsS3   1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>    R.oo          1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>    R.utils       2.12.2     2022-11-11 [1] CRAN (R 4.2.2)
#>    R6            2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>    readr       * 2.1.3      2022-10-01 [1] CRAN (R 4.2.1)
#>    reprex        2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
#>    rlang         1.0.6      2022-09-24 [1] CRAN (R 4.2.1)
#>    rmarkdown     2.20       2023-01-19 [1] CRAN (R 4.2.2)
#>    rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.2.1)
#>    sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>    styler        1.9.0      2023-01-15 [1] CRAN (R 4.2.2)
#>    tibble        3.1.8      2022-07-22 [1] CRAN (R 4.2.1)
#>    tidyselect    1.2.0      2022-10-10 [1] RSPM
#>    tzdb          0.3.0      2022-03-28 [1] CRAN (R 4.2.0)
#>    utf8          1.2.3      2023-01-31 [1] CRAN (R 4.2.2)
#>    vctrs         0.5.2      2023-01-23 [1] CRAN (R 4.2.2)
#>    vroom         1.6.1      2023-01-22 [1] CRAN (R 4.2.2)
#>    withr         2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>    xfun          0.37       2023-01-31 [1] CRAN (R 4.2.2)
#>    yaml          2.3.7      2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@phish108
Copy link
Author

@dpprdan thank you for your analysis. I share your experience but not your conclusions.

The point is that the docs emphasize almost entirely on read_csv() and read_csv2(). The docs also state that one needs to use the locale to change details like the decimal separator.

This is true for read_delim() but not for read_csv(), read_csv2(), and read_tsv(). Therefore, the docs are simply incorrect, because the functions won't work that way, as you remarked yourself. Moreover, the docs somewhat suggest that read_delim() should be the weappon of last resort and therefore presents no examples of its proper use.

I ran the functions against several open datasets from the Swiss as well as from the Austrian and German government plus our own CSV files exported from EXCEL under the Swiss and Austro/German locale. No problem with read_csv() and read_csv2() for the Austrian and German data, where comma is the decimal point and the semicolon is the column separator. However, the same files also parsed absolutely fine with read_delim() without any parameters other than the file name.

The situation is different under 'de_ch': None of the swiss datasets are properly parsed with read_csv2()+locale(), because the Swiss use the semicolon as a column separator and the dot as the decimal point. Processing the same files with read_delim(), there are no problems with these files when the file name is passed as the sole parameter.

As the docs fail to present the smartness of read_delim(), my Swiss students fail to resolve this problem on their own: The docs create the false impression that read_delim() is complex and requires many parameters, while read_csv() and read_csv2() are the nice fellows that should be used pretty much all the time.

I started teaching that one would start with read_delim(), because in 99.9% of the cases it just delivers without the need to think about delimiters and decimal points. (Which I find too awesome to hide, BTW). This is pretty much the opposite of what the docs state. This is bad, because I also emphasize that they should read the docs in order to find out about the correct usage of functions. 😖

Only for the rough edge cases that you describe, the more specialized functions are actually useful, I find. I might be biased in my selection of datasets, but I fear that the cases only read_csv2() can resolve easily but read_delim() can't, are not as common as you and the docs suggest.

For beginners all of this is impossible to figure out all by themselves simply by looking at the docs. Even for me, it took me several days of repeated attempts and then finally turning to the code to come to my insights. All because the docs are pointing in the wrong direction.

Again, all of this could be easily resolved by adding a few statements and examples to the docs without changing any logic.

@dpprdan
Copy link
Contributor

dpprdan commented Feb 26, 2023

Here is a random file from Statistik Austria where read_delim() doesn’t work correctly without extra parameters (metadata).

library(readr)
at_url <- "https://data.statistik.gv.at/data/OGD_veste309_Veste309_1.csv"

delimiter: semicolon, decimal mark: comma

read_lines(at_url) |> head(2)
#> [1] "C-A11-0;C-STAATS-0;C-VEBDL-0;C-BESCHV-0;F-VESTE_AM;F-VESTE_Q25;F-VESTE_Q50;F-VESTE_Q75;F-VESTE_UB"
#> [2] "A11-1;STAATS-9;VEBDL-10;BESCHV-1;17,60;11,65;15,09;20,12;2650938,00"

comma is parsed as grouping mark and therefore numbers are parsed as x * 10^no of decimal places

read_delim(at_url) |> head(2)
#> Rows: 72 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (4): C-A11-0, C-STAATS-0, C-VEBDL-0, C-BESCHV-0
#> num (5): F-VESTE_AM, F-VESTE_Q25, F-VESTE_Q50, F-VESTE_Q75, F-VESTE_UB
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 9
#>   `C-A11-0` `C-STAATS-0` C-VEB…¹ C-BES…² F-VES…³ F-VES…⁴ F-VES…⁵ F-VES…⁶ F-VES…⁷
#>   <chr>     <chr>        <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 A11-1     STAATS-9     VEBDL-… BESCHV…    1760    1165    1509    2012  2.65e8
#> 2 A11-1     STAATS-9     VEBDL-… BESCHV…    1895    1278    1622    2174  1.69e8
#> # … with abbreviated variable names ¹​`C-VEBDL-0`, ²​`C-BESCHV-0`, ³​`F-VESTE_AM`,
#> #   ⁴​`F-VESTE_Q25`, ⁵​`F-VESTE_Q50`, ⁶​`F-VESTE_Q75`, ⁷​`F-VESTE_UB`

compare with read_csv2()

read_csv2(at_url)  |> head(2)
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 72 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (4): C-A11-0, C-STAATS-0, C-VEBDL-0, C-BESCHV-0
#> dbl (5): F-VESTE_AM, F-VESTE_Q25, F-VESTE_Q50, F-VESTE_Q75, F-VESTE_UB
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 9
#>   `C-A11-0` `C-STAATS-0` C-VEB…¹ C-BES…² F-VES…³ F-VES…⁴ F-VES…⁵ F-VES…⁶ F-VES…⁷
#>   <chr>     <chr>        <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 A11-1     STAATS-9     VEBDL-… BESCHV…    17.6    11.6    15.1    20.1 2650938
#> 2 A11-1     STAATS-9     VEBDL-… BESCHV…    19.0    12.8    16.2    21.7 1685788
#> # … with abbreviated variable names ¹​`C-VEBDL-0`, ²​`C-BESCHV-0`, ³​`F-VESTE_AM`,
#> #   ⁴​`F-VESTE_Q25`, ⁵​`F-VESTE_Q50`, ⁶​`F-VESTE_Q75`, ⁷​`F-VESTE_UB`

read_delim()’s parsing isn’t consistent, here is one where some numbers are parsed as character() vectors (metadata).

at_url2 <- "https://data.statistik.gv.at/data/OGD_vpi20_VPI_2020_1.csv"
read_delim(at_url2) |> head(2)
#> Rows: 351 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (8): C-VPIZR-0, C-VPI5NEU-0, F-VPIMZVM, F-VPIPZVM, F-VPIPZVJM, F-VPIEFVM...
#> num (2): F-VPIMZBM, F-VPIMZVJM
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 10
#>   `C-VPIZR-0`  C-VPI5N…¹ F-VPI…² F-VPI…³ F-VPI…⁴ F-VPI…⁵ F-VPI…⁶ F-VPI…⁷ F-VPI…⁸
#>   <chr>        <chr>       <dbl> <chr>     <dbl> <chr>   <chr>   <chr>   <chr>  
#> 1 VPIZR-202101 VPI-0      1.00e7 101,10… 9950000 -0,800… 0,80000 -0,782… 0,80000
#> 2 VPIZR-202101 VPI-01     9.76e6 100,80… 9870000 -3,200… -1,100… -0,363… -0,115…
#> # … with 1 more variable: `F-VPIGEWBM` <chr>, and abbreviated variable names
#> #   ¹​`C-VPI5NEU-0`, ²​`F-VPIMZBM`, ³​`F-VPIMZVM`, ⁴​`F-VPIMZVJM`, ⁵​`F-VPIPZVM`,
#> #   ⁶​`F-VPIPZVJM`, ⁷​`F-VPIEFVM`, ⁸​`F-VPIEFVJM`
read_csv2(at_url2) |> head(2)
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 351 Columns: 10── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): C-VPIZR-0, C-VPI5NEU-0
#> dbl (8): F-VPIMZBM, F-VPIMZVM, F-VPIMZVJM, F-VPIPZVM, F-VPIPZVJM, F-VPIEFVM,...
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 10
#>   `C-VPIZR-0`  C-VPI5N…¹ F-VPI…² F-VPI…³ F-VPI…⁴ F-VPI…⁵ F-VPI…⁶ F-VPI…⁷ F-VPI…⁸
#>   <chr>        <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 VPIZR-202101 VPI-0       100.     101.    99.5    -0.8     0.8  -0.782   0.8  
#> 2 VPIZR-202101 VPI-01       97.6    101.    98.7    -3.2    -1.1  -0.363  -0.115
#> # … with 1 more variable: `F-VPIGEWBM` <dbl>, and abbreviated variable names
#> #   ¹​`C-VPI5NEU-0`, ²​`F-VPIMZBM`, ³​`F-VPIMZVM`, ⁴​`F-VPIMZVJM`, ⁵​`F-VPIPZVM`,
#> #   ⁶​`F-VPIPZVJM`, ⁷​`F-VPIEFVM`, ⁸​`F-VPIEFVJM`

A write_csv2()/read_delim() round trip also doesn’t work.

tf <- tempfile()
write_csv2(head(mtcars, 3), tf)
read_lines(tf)
#> [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb"
#> [2] "21,0;6;160;110;3,90;2,620;16,46;0;1;4;4"     
#> [3] "21,0;6;160;110;3,90;2,875;17,02;0;1;4;4"     
#> [4] "22,8;4;108;93;3,85;2,320;18,61;1;1;4;1"

Again, numbers with comma are parsed * 10^no of decimal places

read_delim(tf)
#> Rows: 3 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (7): cyl, disp, hp, vs, am, gear, carb
#> num (4): mpg, drat, wt, qsec
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1   210     6   160   110   390  2620  1646     0     1     4     4
#> 2   210     6   160   110   390  2875  1702     0     1     4     4
#> 3   228     4   108    93   385  2320  1861     1     1     4     1
read_csv2(tf)
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 3 Columns: 11── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

I’d hardly call these edge cases. I don’t know what files you tested (because you did not provide any reproducible examples), but I’ve yet to see a csv file with semicolon as delimiter and comma as decimal that read_delim() does read correctly (and even then it’s easy to come up with counter examples, which render a "just use read_delim()" advice futile, see above).

Seeing how read_delim() garbles those files, I’d even be reluctant to trust its automatic guessing with file formats that seem to work fine at first glance. Looking at the open delimiter guessing issues in vroom, does not increase my confidence. (The delimiter guessing is a feature inherited from the vroom integration.)

It certainly sucks that the “Swiss file spec” (i.e. semicolon as delimiter, dot as decimal) isn’t handled well by readr at the moment (at least if you follow the docs). I suspect that much of the confusion stems from the hard-coded comma as decimal separator in read_csv2() (#1445), however. If that issue were solved read_csv2(locale = locale(decimal_mark = ".")) would work as documented.

@phish108
Copy link
Author

phish108 commented Feb 27, 2023

Thank you, @dpprdan for elaborating the case. But your examples are not really a problem, because all you describe could be easily resolved by using locale() with read_delim(), which the documentation fails to inform about. Instead the documentation suggests solutions, but always fail to work.

As I said, I am not against read_csv() and read_csv2() per se.

Personally, I feel more inclined towards fixing the documentation than the logic. However, right now neither is very user friendly.

@hadley
Copy link
Member

hadley commented Jul 31, 2023

There's a lot of discussion. Could someone please briefly summarise the problem for me?

@dpprdan
Copy link
Contributor

dpprdan commented Aug 1, 2023

IIUC, @phish108 proposes to focus the documentation on read_delim() as the primary {readr} function instead of read_csv() and read_csv2(). In fact he suggests to retire the latter two (see title) or just use them for special cases.

The motivation is

  1. that the delimiter guessing of read_delim() introduced by the {vroom} integration works well (enough), and
  2. that read_csv2() cannot handle "Swiss-formatted" CSV files (delim = ";", decimal_mark = ".", grouping_mark = ",") (see Introduce default_locale2() #1445 (comment) for examples) very well at the moment.

My counter arguments are basically

  1. read_delim()'s delimiter guessing does not work well enough for it to be the default,
  2. read_csv2() could handle "Swiss-formatted" CSV files, if it did not override the decimal_mark anymore (Introduce default_locale2() #1445)
  3. an important design principle of {readr} is to encourage explicit specification (AFAIU), e.g. "explicit" is mentioned 3x in that sense in the README (I am not sure I mentioned this point here before, uh, explicitly).

@hadley
Copy link
Member

hadley commented Aug 1, 2023

In that case, I think the best fix is to introduce default_locale() so that read_csv2() works as expected.

@hadley hadley closed this as completed Aug 1, 2023
@phish108
Copy link
Author

phish108 commented Aug 1, 2023

Hi @hadley

There are two things:

  • the issue of locale/default locale "not behaving". I agree that this is settled. But this is not related to this issue.

  • the core of this issue: read_delim() not being properly documented.

From the docs and the vignettes users won't get a clue in what way these functions differ from vanilla R.

My point regarding read_delim() is that the docs should mention a minimal and a more extensive example of using read_delim() and a short note that the function uses some heuristic for guessing the format. Or if this makes no sense from your viewpoint it should include at least a statement when to use and when not to use read_delim().

The locale vignette should also mention read_delim() as it is the only function makes full use of the locale, whereas read_csv*() do not. The current examples in the vignette do not work most of the time in my work and for my teaching.

All the logic might be obvious to you, it is not to me. Therefore, I ask for the function being documented a bit more verbose than just offering the mere parameters. This alone would resolve all our data import problem.

Therefore, I reopen the issue.

@hadley
Copy link
Member

hadley commented Aug 1, 2023

I've created #1503 to track some of the documentation issues.

I don't understand what you mean by read_csv() not fully using the locale.

@phish108
Copy link
Author

phish108 commented Aug 1, 2023

Thank you for the reference to the documentation issue.

As noted above, changing the decimal separator appears to have no effect at least in read_csv2() until very recently. This means that the spec given via locale() is not fully used.

For read_csv() this is not applicable because the comma is already taken as a column separator. Again, the spec given via locale() is not fully used.

With this respect, it is not entirely clear whether the parameters or the locale wins in case of conflict.

This is what I mean when I write that not all locale() features are available to these functions. From the docs I was unable to deduce the behavior of read_csv2().

@hadley
Copy link
Member

hadley commented Aug 1, 2023

The locale only affects the parsing of individual values, it does not affect the delimiter. We have #1445 to resolve this issue with read_csv2(). I have filed #1505 to address what happens if you use the same character for both delimiter and decimal mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants