Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_tsv() gives problems on gzipped file, not when uncompressed #1523

Open
gavinband opened this issue Nov 23, 2023 · 1 comment
Open

read_tsv() gives problems on gzipped file, not when uncompressed #1523

gavinband opened this issue Nov 23, 2023 · 1 comment

Comments

@gavinband
Copy link

Thanks for making readr and tidyverse!

I am using read_tsv() (read 2.1.4) to parse this largeish file from a public repository:

http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz

My code is:

system( 'curl -O http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = "Mus_musculus.GRCm39.110.chr.gff3.gz"
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

However, this reports:

One or more parsing issues, call `problems()` on your data frame for details

Sure enough there are problems:

> readr::problems(X)
# A tibble: 3,853,670 × 5
      row   col expected   actual            file 
    <int> <int> <chr>      <chr>             <chr>
 1 433594     4 an integer ensembl           ""   
 2 433594     5 a double   ncRNA_gene        ""   
 3 433595     4 an integer ensembl           ""   
 4 433595     5 a double   miRNA             ""   
 5 433596     4 an integer ensembl           ""   
 6 433596     5 a double   exon              ""   
 7 433597     4 an integer cpg               ""   
 8 433597     5 a double   biological_region ""   
 9 433598     4 an integer Eponine           ""   
10 433598     5 a double   biological_region ""   
# ℹ 3,853,660 more rows
# ℹ Use `print(n = ...)` to see more rows

The parsed line 433594 looks like this:

> X[433594,]
# A tibble: 1 × 9
  seqid source type  start   end    score strand   phase attributes
  <chr> <chr>  <chr> <int> <dbl>    <dbl> <chr>    <int> <chr>     
1 #     "#\n"  3        NA    NA 60677110 60677223    NA -         

However if I unzip the file first then the problem goes away:

system( 'gunzip Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = 'Mus_musculus.GRCm39.110.chr.gff3'
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

With correct results on that line:

> X[433594,]
# A tibble: 1 × 9
  seqid source  type          start      end score strand phase attributes      
  <chr> <chr>   <chr>         <int>    <dbl> <dbl> <chr>  <int> <chr>           
1 13    ensembl ncRNA_gene 60677110 60677223    NA -         NA ID=gene:ENSMUSG…

(I can re-gzip the file to restore the problem)

One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):

% gunzip -c Mus_musculus.GRCm39.110.chr.gff3.gz| head -n 433594 | tail | cut -f1-5 
13  havana  ncRNA_gene  44304071    44305429
13  havana  lnc_RNA 44304071    44305429
13  havana  exon    44304071    44304143
13  havana  exon    44305240    44305429
###
13  havana  ncRNA_gene  44339236    44369963
13  havana  lnc_RNA 44339236    44369910
13  havana  exon    44339236    44343939
13  havana  exon    44348532    44348783
13  havana  exon    44369847    44369910

Session info:

> packageVersion( 'readr' )
[1] ‘2.1.4> packageVersion( 'vroom' )
[1] ‘1.6.0> packageVersion( 'tidyverse' )
[1] ‘2.0.0> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1

Many thanks for any help with this issue.

@johan-gson
Copy link

johan-gson commented Aug 17, 2024

I have encountered exactly the same thing. I am very sure that my file is ok, it only contains tab-separated count numbers. If I unzip the gz file, I can read all 451163 lines (and 13857 columns), although I only read a selection of them. With the gz file, I only get 309927 lines. The gz file size is 5857388x1024 bytes. Interestingly, this means that the code manages to read 309927/451163x5857388x1024 = 4120309944 bytes, while 2^32 is 4294967296. Is there a 32 bit integer limit somewhere in the code, where there should be a 64-bit integer? Looks very suspicious to me. I'm pretty convinced this is a bug, and it is likely that it has to do with a 32-bit variable of some kind. Could anyone look into this, it is pretty annoying since these files are very large when not gunzipped.

I'm running R on a Windows 10 64-bit machine, but it seems gavinband had a mac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants