Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv2_chunked() needs to adjust locale like read_csv2() #1468

Open
dpprdan opened this issue Feb 7, 2023 · 3 comments
Open

read_csv2_chunked() needs to adjust locale like read_csv2() #1468

dpprdan opened this issue Feb 7, 2023 · 3 comments
Labels
bug an unexpected problem or unintended behavior locale 🌏 read 📖

Comments

@dpprdan
Copy link
Contributor

dpprdan commented Feb 7, 2023

read_csv2_chunked() does not parse decimals correctly with the default_locale().

library(readr)
tf <- tempfile()
write_csv2(head(mtcars), tf)

Just to make sure: This really is a CSV2 and read_csv2() reads the data as expected

read_lines(tf) |> head(2)
#> [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb"
#> [2] "21,0;6;160;110;3,90;2,620;16,46;0;1;4;4"
read_csv2(tf)
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 6 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

with default_locale(), i.e. decimal_mark = ".", the decimals are not parsed as decimals but as integers

read_csv2_chunked(tf, DataFrameCallback$new(data.frame), chunk_size = 3)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   mpg = col_number(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_number(),
#>   wt = col_number(),
#>   qsec = col_number(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )
#>   mpg cyl disp  hp drat   wt qsec vs am gear carb index
#> 1 210   6  160 110  390 2620 1646  0  1    4    4     1
#> 2 210   6  160 110  390 2875 1702  0  1    4    4     1
#> 3 228   4  108  93  385 2320 1861  1  1    4    1     1
#> 4 214   6  258 110  308 3215 1944  1  0    3    1     4
#> 5 187   8  360 175  315 3440 1702  0  0    3    2     4
#> 6 181   6  225 105  276 3460 2022  1  0    3    1     4

All is fine when setting the adequate locale.

read_csv2_chunked(tf, DataFrameCallback$new(data.frame), chunk_size = 3, locale = locale(decimal_mark = ",", grouping_mark = "."))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb index
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     1
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     1
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1     1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1     4
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2     4
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1     4

Long story short, the default locale setting is wrong.
This could be fixed easily with a default_locale2()/locale(decimal_mark = ",", grouping_mark = ".") default. (cf #1445)

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Jul 31, 2023
@dpprdan
Copy link
Contributor Author

dpprdan commented Aug 1, 2023

reprex without prose:

library(readr)
tf <- tempfile()
write_csv2(head(mtcars), tf)

# decimals parse incorrectly with the `default_locale()`
read_csv2_chunked(tf, DataFrameCallback$new(data.frame), chunk_size = 3)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   mpg = col_number(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_number(),
#>   wt = col_number(),
#>   qsec = col_number(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )
#>   mpg cyl disp  hp drat   wt qsec vs am gear carb index
#> 1 210   6  160 110  390 2620 1646  0  1    4    4     1
#> 2 210   6  160 110  390 2875 1702  0  1    4    4     1
#> 3 228   4  108  93  385 2320 1861  1  1    4    1     1
#> 4 214   6  258 110  308 3215 1944  1  0    3    1     4
#> 5 187   8  360 175  315 3440 1702  0  0    3    2     4
#> 6 181   6  225 105  276 3460 2022  1  0    3    1     4

# decimals parse correctly with the adequate locale set
read_csv2_chunked(
  tf,
  DataFrameCallback$new(data.frame),
  chunk_size = 3,
  locale = locale(decimal_mark = ",", grouping_mark = ".")
)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb index
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     1
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     1
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1     1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1     4
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2     4
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1     4
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2023-08-01
#>  pandoc   3.1.5 @ C:/Users/DANIEL~1.AK-/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>  knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.3.1)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  styler        1.10.1  2023-06-05 [1] CRAN (R 4.3.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
#>  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
#>  vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Long story short, the default locale setting is wrong.
This could be fixed easily with a default_locale2()/locale(decimal_mark = ",", grouping_mark = ".") default. (cf #1445)

@dpprdan dpprdan changed the title read_csv2_chunked() parses decimal numbers like integers default_locale() read_csv2_chunked() parses decimals incorrectly by default Aug 1, 2023
@hadley
Copy link
Member

hadley commented Aug 1, 2023

Somewhat more minimal reprex:

library(readr)

path <- tempfile()
write_lines(c("x;y", "1,3;2,3"), path)

# wrong
read_csv2_chunked(
  path,
  DataFrameCallback$new(data.frame),
  chunk_size = 3,
  col_types = list()
)
#>    x  y index
#> 1 13 23     1

# right
read_csv2(path, col_types = list())
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1   1.3   2.3

# right
read_csv2_chunked(
  path,
  DataFrameCallback$new(data.frame),
  chunk_size = 3,
  locale = locale(decimal_mark = ",", grouping_mark = "."),
  col_types = list()
)
#>     x   y index
#> 1 1.3 2.3     1

Created on 2023-08-01 with reprex v2.0.2

@hadley hadley added bug an unexpected problem or unintended behavior read 📖 and removed reprex needs a minimal reproducible example labels Aug 1, 2023
@hadley hadley changed the title read_csv2_chunked() parses decimals incorrectly by default read_csv2_chunked() needs to adjust locale like read_csv2() Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior locale 🌏 read 📖
Projects
None yet
Development

No branches or pull requests

2 participants