Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readr guesses unexpected column types for values containing a “D” #1484

Open
klmr opened this issue Mar 15, 2023 · 2 comments
Open

readr guesses unexpected column types for values containing a “D” #1484

klmr opened this issue Mar 15, 2023 · 2 comments
Labels
bug an unexpected problem or unintended behavior col_types 🏥 read 📖

Comments

@klmr
Copy link

klmr commented Mar 15, 2023

parse_double apparently interprets D as an alternative to E for scientific exponent notation. For people used to R (unless they also know Fortran), this is quite unexpected, and does not seem to be documented anywhere. Compare to the behaviour of core R:

$ readr::parse_double('12d3')
[1] 12000

$ as.numeric('12d3')
[1] NA
Warning message:
NAs introduced by coercion

$ str2lang('12d3')
Error in str2lang("12d3") : <text>:1:3: unexpected symbol
1: 12d3
      ^

So far, so good. Unfortunately this leads to surprises during automatic column type guessing. For instance:

$ readr::read_csv(I("test\n12d3"))
Rows: 1 Columns: 1
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): testUse `spec()` to retrieve the full column specification for this data.Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 1
   test
  <dbl>
1 12000

I’d wager that this isn’t the expected or desired behaviour for most uses of ‘readr’. — Is there maybe a way to disable this? Something to the effect of “guess column types, but use a conservative parser for number formats.” Or, alternatively, maybe “guess column types, but do not consider scientific exponent notation.”

@klmr klmr changed the title ’readr guesses unexpected column types for labels containing a “D” ’readr guesses unexpected column types for values containing a “D” Mar 15, 2023
@hadley hadley changed the title ’readr guesses unexpected column types for values containing a “D” readr guesses unexpected column types for values containing a “D” Jul 31, 2023
@hadley
Copy link
Member

hadley commented Jul 31, 2023

Somewhat more minimal reprex:

readr::parse_double('12d3')
#> [1] 12000
as.numeric('12d3')
#> Warning: NAs introduced by coercion
#> [1] NA

Created on 2023-07-31 with reprex v2.0.2

If we fix this, it's possible that it might break existing code, but that seems fairly unlikely.

@hadley hadley closed this as completed Jul 31, 2023
@hadley hadley reopened this Jul 31, 2023
@hadley hadley added bug an unexpected problem or unintended behavior read 📖 labels Jul 31, 2023
@hadley
Copy link
Member

hadley commented Jul 31, 2023

Looks like there are a few letters where this is a problem:

library(readr)

x <- paste0("12", LETTERS, "3")

df <- data.frame(x, parsed = parse_double(x))
#> Warning: 21 parsing failures.
#> row col               expected actual
#>   1  -- no trailing characters   12A3
#>   2  -- no trailing characters   12B3
#>   3  -- no trailing characters   12C3
#>   7  -- no trailing characters   12G3
#>   8  -- no trailing characters   12H3
#> ... ... ...................... ......
#> See problems(...) for more details.
subset(df, !is.na(parsed))
#>       x parsed
#> 4  12D3  12000
#> 5  12E3  12000
#> 6  12F3  12000
#> 12 12L3  12000
#> 19 12S3  12000

Created on 2023-07-31 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior col_types 🏥 read 📖
Projects
None yet
Development

No branches or pull requests

2 participants