read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

bkohrn · 2022-06-08T18:02:08Z

I have a file that has a first column value of "8D":

##Comments
##About
##File
##Creation
#SAMPLE,REGION,MUTATION_TYPE,MUTATION_CLASS,COUNT,DENOMINATOR,FREQUENCY
8D,OVERALL,A>T,SNV,1,1000000,1.00e-06
8D,OVERALL,A>C,SNV,1,1000000,1.00e-06
8D,OVERALL,A>G,SNV,1,1000000,1.00e-06
8D,OVERALL,T>A,SNV,1,1000000,1.00e-06
8D,OVERALL,T>C,SNV,1,1000000,1.00e-06
8D,OVERALL,T>G,SNV,1,1000000,1.00e-06
8D,OVERALL,C>A,SNV,1,2000000,5.00e-07
8D,OVERALL,C>T,SNV,1,2000000,5.00e-07
8D,OVERALL,C>G,SNV,1,2000000,5.00e-07
8D,OVERALL,G>A,SNV,1,2000000,5.00e-07
8D,OVERALL,G>T,SNV,1,2000000,5.00e-07
8D,OVERALL,G>C,SNV,1,2000000,5.00e-07

When I run it through read_csv, I expect to get:

# A tibble: 12 x 7
   `#SAMPLE` REGION  MUTATION_TYPE MUTATION_CLASS COUNT DENOMINATOR FREQUENCY
       <chr> <chr>   <chr>         <chr>          <dbl>       <dbl>     <dbl>
 1        8D OVERALL A>T           SNV                1     1000000 0.000001 
 2        8D OVERALL A>C           SNV                1     1000000 0.000001 
 3        8D OVERALL A>G           SNV                1     1000000 0.000001 
 4        8D OVERALL T>A           SNV                1     1000000 0.000001 
 5        8D OVERALL T>C           SNV                1     1000000 0.000001 
 6        8D OVERALL T>G           SNV                1     1000000 0.000001 
 7        8D OVERALL C>A           SNV                1     2000000 0.0000005
 8        8D OVERALL C>T           SNV                1     2000000 0.0000005
 9        8D OVERALL C>G           SNV                1     2000000 0.0000005
10        8D OVERALL G>A           SNV                1     2000000 0.0000005
11        8D OVERALL G>T           SNV                1     2000000 0.0000005
12        8D OVERALL G>C           SNV                1     2000000 0.0000005

Instead, what I get is:

# A tibble: 12 x 7
   `#SAMPLE` REGION  MUTATION_TYPE MUTATION_CLASS COUNT DENOMINATOR FREQUENCY
       <dbl> <chr>   <chr>         <chr>          <dbl>       <dbl>     <dbl>
 1         8 OVERALL A>T           SNV                1     1000000 0.000001 
 2         8 OVERALL A>C           SNV                1     1000000 0.000001 
 3         8 OVERALL A>G           SNV                1     1000000 0.000001 
 4         8 OVERALL T>A           SNV                1     1000000 0.000001 
 5         8 OVERALL T>C           SNV                1     1000000 0.000001 
 6         8 OVERALL T>G           SNV                1     1000000 0.000001 
 7         8 OVERALL C>A           SNV                1     2000000 0.0000005
 8         8 OVERALL C>T           SNV                1     2000000 0.0000005
 9         8 OVERALL C>G           SNV                1     2000000 0.0000005
10         8 OVERALL G>A           SNV                1     2000000 0.0000005
11         8 OVERALL G>T           SNV                1     2000000 0.0000005
12         8 OVERALL G>C           SNV                1     2000000 0.0000005

The first column seems to be parsed as a dbl, rather than as a chr. I can force the expected behavior using col_types, but this still strikes me as an issue that needs to be fixed.

Confusingly, if the first column instead has the value "8A", the code behaves as expected.

Minimal example:

fIn <- "##Comments
##About
##File
##Creation
#SAMPLE,REGION,MUTATION_TYPE,MUTATION_CLASS,COUNT,DENOMINATOR,FREQUENCY
8D,OVERALL,A>T,SNV,1,1000000,1.00e-06
8D,OVERALL,A>C,SNV,1,1000000,1.00e-06
8D,OVERALL,A>G,SNV,1,1000000,1.00e-06
8D,OVERALL,T>A,SNV,1,1000000,1.00e-06
8D,OVERALL,T>C,SNV,1,1000000,1.00e-06
8D,OVERALL,T>G,SNV,1,1000000,1.00e-06
8D,OVERALL,C>A,SNV,1,2000000,5.00e-07
8D,OVERALL,C>T,SNV,1,2000000,5.00e-07
8D,OVERALL,C>G,SNV,1,2000000,5.00e-07
8D,OVERALL,G>A,SNV,1,2000000,5.00e-07
8D,OVERALL,G>T,SNV,1,2000000,5.00e-07
8D,OVERALL,G>C,SNV,1,2000000,5.00e-07
"
read_csv(fIn, 
         comment = "##", 
         trim_ws = TRUE)

The text was updated successfully, but these errors were encountered:

bkohrn · 2022-06-08T18:04:40Z

I'm using R v4.1.2, and readr v2.1.1, and have observed this behavior on multiple different systems.

phish108 · 2022-11-21T08:48:14Z

also #1411

cjyetman · 2023-04-04T12:49:26Z

related #1484

philippbayer · 2023-06-06T06:11:54Z

I have encountered the same issue today with readr 2.1.2 and R 4.3.0. For me it was columns with the popular marker gene '16S' which turns into 16 in a double column.

Interestingly, it's not the same for every character: columns made out of 16X, 16O, and 16Z are identified as character columns, columns made out of 16D, 16S, 16L are identified as numeric columns.

I wrote a quick script to automate 16A to 16Z, the following letters get cut off: D, E, F, L, and S. It also happens with their lower-case versions d, e, f, l, and s. Isn't that weird!

Edit: these mostly correspond with the C++ numeric types - double, float, long, and short. Not sure about E, might be just parsed as a scientific number like 1e3.

hadley · 2023-07-31T22:40:46Z

Duplicate of #1484

sbearrows added the col_types 🏥 label Aug 25, 2022

sbearrows self-assigned this Aug 25, 2022

sbearrows added bug an unexpected problem or unintended behavior and removed bug an unexpected problem or unintended behavior col_types 🏥 labels Sep 2, 2022

hadley unassigned sbearrows Jul 31, 2023

hadley marked this as a duplicate of #1484 Jul 31, 2023

hadley closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

bkohrn commented Jun 8, 2022

bkohrn commented Jun 8, 2022

phish108 commented Nov 21, 2022

cjyetman commented Apr 4, 2023

philippbayer commented Jun 6, 2023 •

edited

Loading

hadley commented Jul 31, 2023

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

Comments

bkohrn commented Jun 8, 2022

bkohrn commented Jun 8, 2022

phish108 commented Nov 21, 2022

cjyetman commented Apr 4, 2023

philippbayer commented Jun 6, 2023 • edited Loading

hadley commented Jul 31, 2023

philippbayer commented Jun 6, 2023 •

edited

Loading