readr guesses unexpected column types for values containing a “D” #1484

klmr · 2023-03-15T08:35:29Z

parse_double apparently interprets D as an alternative to E for scientific exponent notation. For people used to R (unless they also know Fortran), this is quite unexpected, and does not seem to be documented anywhere. Compare to the behaviour of core R:

$ readr::parse_double('12d3')
[1] 12000

$ as.numeric('12d3')
[1] NA
Warning message:
NAs introduced by coercion

$ str2lang('12d3')
Error in str2lang("12d3") : <text>:1:3: unexpected symbol
1: 12d3
      ^

So far, so good. Unfortunately this leads to surprises during automatic column type guessing. For instance:

$ readr::read_csv(I("test\n12d3"))
Rows: 1 Columns: 1
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): test

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 1
   test
  <dbl>
1 12000

I’d wager that this isn’t the expected or desired behaviour for most uses of ‘readr’. — Is there maybe a way to disable this? Something to the effect of “guess column types, but use a conservative parser for number formats.” Or, alternatively, maybe “guess column types, but do not consider scientific exponent notation.”

The text was updated successfully, but these errors were encountered:

hadley · 2023-07-31T21:56:11Z

Somewhat more minimal reprex:

readr::parse_double('12d3')
#> [1] 12000
as.numeric('12d3')
#> Warning: NAs introduced by coercion
#> [1] NA

^{Created on 2023-07-31 with reprex v2.0.2}

If we fix this, it's possible that it might break existing code, but that seems fairly unlikely.

hadley · 2023-07-31T22:42:31Z

Looks like there are a few letters where this is a problem:

library(readr)

x <- paste0("12", LETTERS, "3")

df <- data.frame(x, parsed = parse_double(x))
#> Warning: 21 parsing failures.
#> row col               expected actual
#>   1  -- no trailing characters   12A3
#>   2  -- no trailing characters   12B3
#>   3  -- no trailing characters   12C3
#>   7  -- no trailing characters   12G3
#>   8  -- no trailing characters   12H3
#> ... ... ...................... ......
#> See problems(...) for more details.
subset(df, !is.na(parsed))
#>       x parsed
#> 4  12D3  12000
#> 5  12E3  12000
#> 6  12F3  12000
#> 12 12L3  12000
#> 19 12S3  12000

^{Created on 2023-07-31 with reprex v2.0.2}

klmr changed the title ~~’readr guesses unexpected column types for labels containing a “D”~~ ’readr guesses unexpected column types for values containing a “D” Mar 15, 2023

This was referenced Apr 4, 2023

read_csv parsing incorrectly #1411

Closed

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

Closed

hadley changed the title ~~’readr guesses unexpected column types for values containing a “D”~~ readr guesses unexpected column types for values containing a “D” Jul 31, 2023

hadley closed this as completed Jul 31, 2023

hadley reopened this Jul 31, 2023

hadley added bug an unexpected problem or unintended behavior read 📖 labels Jul 31, 2023

hadley added the col_types 🏥 label Aug 1, 2023

hadley mentioned this issue Aug 1, 2023

problem when guessing coltypes if values contains "L" as the last character tidyverse/vroom#476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readr guesses unexpected column types for values containing a “D” #1484

readr guesses unexpected column types for values containing a “D” #1484

klmr commented Mar 15, 2023

hadley commented Jul 31, 2023

hadley commented Jul 31, 2023

readr guesses unexpected column types for values containing a “D” #1484

readr guesses unexpected column types for values containing a “D” #1484

Comments

klmr commented Mar 15, 2023

hadley commented Jul 31, 2023

hadley commented Jul 31, 2023