`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

gavinband · 2023-11-23T18:12:58Z

Thanks for making readr and tidyverse!

I am using read_tsv() (read 2.1.4) to parse this largeish file from a public repository:

http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz

My code is:

system( 'curl -O http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = "Mus_musculus.GRCm39.110.chr.gff3.gz"
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

However, this reports:

One or more parsing issues, call `problems()` on your data frame for details

Sure enough there are problems:

> readr::problems(X)
# A tibble: 3,853,670 × 5
      row   col expected   actual            file 
    <int> <int> <chr>      <chr>             <chr>
 1 433594     4 an integer ensembl           ""   
 2 433594     5 a double   ncRNA_gene        ""   
 3 433595     4 an integer ensembl           ""   
 4 433595     5 a double   miRNA             ""   
 5 433596     4 an integer ensembl           ""   
 6 433596     5 a double   exon              ""   
 7 433597     4 an integer cpg               ""   
 8 433597     5 a double   biological_region ""   
 9 433598     4 an integer Eponine           ""   
10 433598     5 a double   biological_region ""   
# ℹ 3,853,660 more rows
# ℹ Use `print(n = ...)` to see more rows

The parsed line 433594 looks like this:

> X[433594,]
# A tibble: 1 × 9
  seqid source type  start   end    score strand   phase attributes
  <chr> <chr>  <chr> <int> <dbl>    <dbl> <chr>    <int> <chr>     
1 #     "#\n"  3        NA    NA 60677110 60677223    NA -

However if I unzip the file first then the problem goes away:

system( 'gunzip Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = 'Mus_musculus.GRCm39.110.chr.gff3'
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

With correct results on that line:

> X[433594,]
# A tibble: 1 × 9
  seqid source  type          start      end score strand phase attributes      
  <chr> <chr>   <chr>         <int>    <dbl> <dbl> <chr>  <int> <chr>           
1 13    ensembl ncRNA_gene 60677110 60677223    NA -         NA ID=gene:ENSMUSG…

(I can re-gzip the file to restore the problem)

One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):

% gunzip -c Mus_musculus.GRCm39.110.chr.gff3.gz| head -n 433594 | tail | cut -f1-5 
13  havana  ncRNA_gene  44304071    44305429
13  havana  lnc_RNA 44304071    44305429
13  havana  exon    44304071    44304143
13  havana  exon    44305240    44305429
###
13  havana  ncRNA_gene  44339236    44369963
13  havana  lnc_RNA 44339236    44369910
13  havana  exon    44339236    44343939
13  havana  exon    44348532    44348783
13  havana  exon    44369847    44369910

Session info:

> packageVersion( 'readr' )
[1] ‘2.1.4’
> packageVersion( 'vroom' )
[1] ‘1.6.0’
> packageVersion( 'tidyverse' )
[1] ‘2.0.0’

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1

Many thanks for any help with this issue.

The text was updated successfully, but these errors were encountered:

johan-gson · 2024-08-17T19:31:28Z

I have encountered exactly the same thing. I am very sure that my file is ok, it only contains tab-separated count numbers. If I unzip the gz file, I can read all 451163 lines (and 13857 columns), although I only read a selection of them. With the gz file, I only get 309927 lines. The gz file size is 5857388x1024 bytes. Interestingly, this means that the code manages to read 309927/451163x5857388x1024 = 4120309944 bytes, while 2^32 is 4294967296. Is there a 32 bit integer limit somewhere in the code, where there should be a 64-bit integer? Looks very suspicious to me. I'm pretty convinced this is a bug, and it is likely that it has to do with a 32-bit variable of some kind. Could anyone look into this, it is pretty annoying since these files are very large when not gunzipped.

I'm running R on a Windows 10 64-bit machine, but it seems gavinband had a mac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

gavinband commented Nov 23, 2023

johan-gson commented Aug 17, 2024 •

edited

Loading

read_tsv() gives problems on gzipped file, not when uncompressed #1523

read_tsv() gives problems on gzipped file, not when uncompressed #1523

Comments

gavinband commented Nov 23, 2023

johan-gson commented Aug 17, 2024 • edited Loading

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

johan-gson commented Aug 17, 2024 •

edited

Loading