Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv sometimes mis-identifies what type of column it should be reading and drops part of the column #1406

Closed
bkohrn opened this issue Jun 8, 2022 · 5 comments

Comments

@bkohrn
Copy link

bkohrn commented Jun 8, 2022

I have a file that has a first column value of "8D":

##Comments
##About
##File
##Creation
#SAMPLE,REGION,MUTATION_TYPE,MUTATION_CLASS,COUNT,DENOMINATOR,FREQUENCY
8D,OVERALL,A>T,SNV,1,1000000,1.00e-06
8D,OVERALL,A>C,SNV,1,1000000,1.00e-06
8D,OVERALL,A>G,SNV,1,1000000,1.00e-06
8D,OVERALL,T>A,SNV,1,1000000,1.00e-06
8D,OVERALL,T>C,SNV,1,1000000,1.00e-06
8D,OVERALL,T>G,SNV,1,1000000,1.00e-06
8D,OVERALL,C>A,SNV,1,2000000,5.00e-07
8D,OVERALL,C>T,SNV,1,2000000,5.00e-07
8D,OVERALL,C>G,SNV,1,2000000,5.00e-07
8D,OVERALL,G>A,SNV,1,2000000,5.00e-07
8D,OVERALL,G>T,SNV,1,2000000,5.00e-07
8D,OVERALL,G>C,SNV,1,2000000,5.00e-07

When I run it through read_csv, I expect to get:

# A tibble: 12 x 7
   `#SAMPLE` REGION  MUTATION_TYPE MUTATION_CLASS COUNT DENOMINATOR FREQUENCY
       <chr> <chr>   <chr>         <chr>          <dbl>       <dbl>     <dbl>
 1        8D OVERALL A>T           SNV                1     1000000 0.000001 
 2        8D OVERALL A>C           SNV                1     1000000 0.000001 
 3        8D OVERALL A>G           SNV                1     1000000 0.000001 
 4        8D OVERALL T>A           SNV                1     1000000 0.000001 
 5        8D OVERALL T>C           SNV                1     1000000 0.000001 
 6        8D OVERALL T>G           SNV                1     1000000 0.000001 
 7        8D OVERALL C>A           SNV                1     2000000 0.0000005
 8        8D OVERALL C>T           SNV                1     2000000 0.0000005
 9        8D OVERALL C>G           SNV                1     2000000 0.0000005
10        8D OVERALL G>A           SNV                1     2000000 0.0000005
11        8D OVERALL G>T           SNV                1     2000000 0.0000005
12        8D OVERALL G>C           SNV                1     2000000 0.0000005

Instead, what I get is:

# A tibble: 12 x 7
   `#SAMPLE` REGION  MUTATION_TYPE MUTATION_CLASS COUNT DENOMINATOR FREQUENCY
       <dbl> <chr>   <chr>         <chr>          <dbl>       <dbl>     <dbl>
 1         8 OVERALL A>T           SNV                1     1000000 0.000001 
 2         8 OVERALL A>C           SNV                1     1000000 0.000001 
 3         8 OVERALL A>G           SNV                1     1000000 0.000001 
 4         8 OVERALL T>A           SNV                1     1000000 0.000001 
 5         8 OVERALL T>C           SNV                1     1000000 0.000001 
 6         8 OVERALL T>G           SNV                1     1000000 0.000001 
 7         8 OVERALL C>A           SNV                1     2000000 0.0000005
 8         8 OVERALL C>T           SNV                1     2000000 0.0000005
 9         8 OVERALL C>G           SNV                1     2000000 0.0000005
10         8 OVERALL G>A           SNV                1     2000000 0.0000005
11         8 OVERALL G>T           SNV                1     2000000 0.0000005
12         8 OVERALL G>C           SNV                1     2000000 0.0000005

The first column seems to be parsed as a dbl, rather than as a chr. I can force the expected behavior using col_types, but this still strikes me as an issue that needs to be fixed.

Confusingly, if the first column instead has the value "8A", the code behaves as expected.

Minimal example:

fIn <- "##Comments
##About
##File
##Creation
#SAMPLE,REGION,MUTATION_TYPE,MUTATION_CLASS,COUNT,DENOMINATOR,FREQUENCY
8D,OVERALL,A>T,SNV,1,1000000,1.00e-06
8D,OVERALL,A>C,SNV,1,1000000,1.00e-06
8D,OVERALL,A>G,SNV,1,1000000,1.00e-06
8D,OVERALL,T>A,SNV,1,1000000,1.00e-06
8D,OVERALL,T>C,SNV,1,1000000,1.00e-06
8D,OVERALL,T>G,SNV,1,1000000,1.00e-06
8D,OVERALL,C>A,SNV,1,2000000,5.00e-07
8D,OVERALL,C>T,SNV,1,2000000,5.00e-07
8D,OVERALL,C>G,SNV,1,2000000,5.00e-07
8D,OVERALL,G>A,SNV,1,2000000,5.00e-07
8D,OVERALL,G>T,SNV,1,2000000,5.00e-07
8D,OVERALL,G>C,SNV,1,2000000,5.00e-07
"
read_csv(fIn, 
         comment = "##", 
         trim_ws = TRUE)
@bkohrn
Copy link
Author

bkohrn commented Jun 8, 2022

I'm using R v4.1.2, and readr v2.1.1, and have observed this behavior on multiple different systems.

@sbearrows sbearrows self-assigned this Aug 25, 2022
@sbearrows sbearrows added bug an unexpected problem or unintended behavior and removed bug an unexpected problem or unintended behavior col_types 🏥 labels Sep 2, 2022
@phish108
Copy link

also #1411

@cjyetman
Copy link

cjyetman commented Apr 4, 2023

related #1484

@philippbayer
Copy link

philippbayer commented Jun 6, 2023

I have encountered the same issue today with readr 2.1.2 and R 4.3.0. For me it was columns with the popular marker gene '16S' which turns into 16 in a double column.

Interestingly, it's not the same for every character: columns made out of 16X, 16O, and 16Z are identified as character columns, columns made out of 16D, 16S, 16L are identified as numeric columns.

I wrote a quick script to automate 16A to 16Z, the following letters get cut off: D, E, F, L, and S. It also happens with their lower-case versions d, e, f, l, and s. Isn't that weird!

Edit: these mostly correspond with the C++ numeric types - double, float, long, and short. Not sure about E, might be just parsed as a scientific number like 1e3.

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Duplicate of #1484

@hadley hadley marked this as a duplicate of #1484 Jul 31, 2023
@hadley hadley closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants