-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv returns empty data.frame for remote gzipped csv files #1555
Comments
This behaviour is not new to readr 2.1.5. It also doesn't occur for all remote csv.gz, so it might be related to AWS. I wonder if there is an server setting that blocks reading and unpacking csv.gz? Also odd that no error is returned. library(readr)
packageVersion("readr")
#> [1] '2.1.0'
aws_csv_gz <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202303.csv.gz"
readr::read_csv(aws_csv_gz)
#> Rows: 0 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (26): radar, datetime, height, u, v, w, ff, dd, sd_vvp, gap, eta, dens, ...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 × 26
#> # ℹ 26 variables: radar <chr>, datetime <chr>, height <chr>, u <chr>, v <chr>,
#> # w <chr>, ff <chr>, dd <chr>, sd_vvp <chr>, gap <chr>, eta <chr>,
#> # dens <chr>, dbz <chr>, dbz_all <chr>, n <chr>, n_dbz <chr>, n_all <chr>,
#> # n_dbz_all <chr>, rcs <chr>, sd_vvp_threshold <chr>, vcp <chr>,
#> # radar_latitude <chr>, radar_longitude <chr>, radar_height <chr>,
#> # radar_wavelength <chr>, source_file <chr>
zenodo_aws_csv_gz <- "https://zenodo.org/records/5653311/files/O_ASSEN-gps-2018.csv.gz"
readr::read_csv(zenodo_aws_csv_gz)
#> Rows: 14893 Columns: 22
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): sensor-type, individual-taxon-canonical-name, study-name
#> dbl (14): event-id, location-long, location-lat, external-temperature, gps:...
#> lgl (4): visible, bar:barometric-pressure, import-marked-outlier, manually...
#> dttm (1): timestamp
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 14,893 × 22
#> `event-id` visible timestamp `location-long` `location-lat`
#> <dbl> <lgl> <dttm> <dbl> <dbl>
#> 1 20432407608 TRUE 2018-05-09 14:59:05 6.58 53.0
#> 2 20432407611 TRUE 2018-05-09 15:28:58 6.58 53.0
#> 3 20432407614 TRUE 2018-05-09 15:58:50 6.58 53.0
#> 4 20432407617 TRUE 2018-05-09 16:29:54 6.58 53.0
#> 5 20432407620 TRUE 2018-05-09 16:59:53 6.58 53.0
#> 6 20432407623 TRUE 2018-05-09 17:30:07 6.58 53.0
#> 7 20432407626 TRUE 2018-05-09 17:59:52 6.58 53.0
#> 8 20432407631 TRUE 2018-05-09 18:59:50 6.58 53.0
#> 9 20432407634 TRUE 2018-05-09 19:29:47 6.58 53.0
#> 10 20432407637 TRUE 2018-05-09 20:00:06 6.59 53.0
#> # ℹ 14,883 more rows
#> # ℹ 17 more variables: `bar:barometric-pressure` <lgl>,
#> # `external-temperature` <dbl>, `gps:dop` <dbl>, `gps:satellite-count` <dbl>,
#> # `gps-time-to-fix` <dbl>, `ground-speed` <dbl>, heading <dbl>,
#> # `height-above-msl` <dbl>, `import-marked-outlier` <lgl>,
#> # `location-error-numerical` <dbl>, `manually-marked-outlier` <lgl>,
#> # `vertical-error-numerical` <dbl>, `sensor-type` <chr>, … Created on 2024-09-17 with reprex v2.1.0 |
I can replicate in 2.1.5.9000, also in vroom 1.6.5: remote_file <-
"https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202303.csv.gz"
vroom::vroom(remote_file)
#> Rows: 0 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (26): radar, datetime, height, u, v, w, ff, dd, sd_vvp, gap, eta, dens, ...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 × 26
#> # ℹ 26 variables: radar <chr>, datetime <chr>, height <chr>, u <chr>, v <chr>,
#> # w <chr>, ff <chr>, dd <chr>, sd_vvp <chr>, gap <chr>, eta <chr>,
#> # dens <chr>, dbz <chr>, dbz_all <chr>, n <chr>, n_dbz <chr>, n_all <chr>,
#> # n_dbz_all <chr>, rcs <chr>, sd_vvp_threshold <chr>, vcp <chr>,
#> # radar_latitude <chr>, radar_longitude <chr>, radar_height <chr>,
#> # radar_wavelength <chr>, source_file <chr> Created on 2024-09-17 with reprex v2.1.0 However, the file reads just fine using remote_file <-
"https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202303.csv.gz"
read_remote_file<- function(remote_file) {
raw_response <-
httr2::request(remote_file) |>
httr2::req_perform() |>
httr2::resp_body_raw()
temp_path <- tempfile()
writeBin(raw_response, temp_path)
readr::read_csv(temp_path)
}
read_remote_file(remote_file)
#> Rows: 162400 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): radar, source_file
#> dbl (21): height, u, v, w, ff, dd, sd_vvp, eta, dens, dbz, dbz_all, n, n_db...
#> lgl (2): gap, vcp
#> dttm (1): datetime
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 162,400 × 26
#> radar datetime height u v w ff dd sd_vvp
#> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN NaN 4.77
#> 2 bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN NaN 4.59
#> 3 bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN NaN 4.83
#> 4 bejab 2023-03-03 22:00:00 200 2.46 -3.27 49.6 4.09 143. 4.63
#> 5 bejab 2023-03-03 22:00:00 200 5.08 -0.0754 16.7 5.08 90.9 4.91
#> 6 bejab 2023-03-03 22:00:00 200 1.80 0.815 28.9 1.97 65.6 4.56
#> 7 bejab 2023-03-03 22:00:00 400 1.57 0.738 5.90 1.74 64.9 3.42
#> 8 bejab 2023-03-03 22:00:00 400 1.02 -0.360 -4.19 1.08 110. 3.06
#> 9 bejab 2023-03-03 22:00:00 400 1.31 0.704 1.47 1.48 61.7 3.70
#> 10 bejab 2023-03-03 22:00:00 600 1.74 -0.335 1.48 1.77 101. 2.14
#> # ℹ 162,390 more rows
#> # ℹ 17 more variables: gap <lgl>, eta <dbl>, dens <dbl>, dbz <dbl>,
#> # dbz_all <dbl>, n <dbl>, n_dbz <dbl>, n_all <dbl>, n_dbz_all <dbl>,
#> # rcs <dbl>, sd_vvp_threshold <dbl>, vcp <lgl>, radar_latitude <dbl>,
#> # radar_longitude <dbl>, radar_height <dbl>, radar_wavelength <dbl>,
#> # source_file <chr> data.table::fread(remote_file)
#> radar datetime height u v w ff
#> <char> <POSc> <int> <num> <num> <num> <num>
#> 1: bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN
#> 2: bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN
#> 3: bejab 2023-03-03 22:00:00 0 NaN NaN NaN NaN
#> 4: bejab 2023-03-03 22:00:00 200 2.455805 -3.2683022 49.64532 4.088126
#> 5: bejab 2023-03-03 22:00:00 200 5.075827 -0.0754339 16.73815 5.076387
#> ---
#> 162396: bejab 2023-03-31 23:45:00 4600 NaN NaN NaN NaN
#> 162397: bejab 2023-03-31 23:45:00 4600 NaN NaN NaN NaN
#> 162398: bejab 2023-03-31 23:45:00 4800 NaN NaN NaN NaN
#> 162399: bejab 2023-03-31 23:45:00 4800 NaN NaN NaN NaN
#> 162400: bejab 2023-03-31 23:45:00 4800 NaN NaN NaN NaN
#> dd sd_vvp gap eta dens dbz dbz_all
#> <num> <num> <lgcl> <num> <num> <num> <num>
#> 1: NaN 4.7692399 TRUE 107.288979 9.7535439 -5.262448 27.517876
#> 2: NaN 4.5863905 TRUE 124.144234 11.2858391 -4.628733 24.785967
#> 3: NaN 4.8284831 TRUE 155.976990 14.1797266 -3.637393 26.132660
#> 4: 143.07877 4.6271343 FALSE 61.823353 5.6203046 -7.656473 25.266855
#> 5: 90.85143 4.9096479 FALSE 53.051174 4.8228340 -8.321049 20.831184
#> ---
#> 162396: NaN NaN TRUE 4.426118 0.4023744 -19.107769 -3.006126
#> 162397: NaN NaN TRUE 10.359014 0.9417285 -15.414815 -3.090068
#> 162398: NaN 0.6628532 TRUE 16.421793 0.0000000 -13.413793 -3.526877
#> 162399: NaN 0.6688918 TRUE 7.906731 0.0000000 -16.588030 -3.623610
#> 162400: NaN 0.5773958 TRUE 11.081260 0.0000000 -15.122108 -4.020990
#> n n_dbz n_all n_dbz_all rcs sd_vvp_threshold vcp
#> <int> <int> <int> <int> <num> <num> <lgcl>
#> 1: 33 2207 178 8857 11 2 NA
#> 2: 44 2242 159 8854 11 2 NA
#> 3: 87 2484 177 8835 11 2 NA
#> 4: 224 18810 397 22903 11 2 NA
#> 5: 201 18990 398 22875 11 2 NA
#> ---
#> 162396: 33 648 689 1436 11 2 NA
#> 162397: 74 727 670 1435 11 2 NA
#> 162398: 178 1019 1174 2160 11 2 NA
#> 162399: 90 1037 1028 2159 11 2 NA
#> 162400: 138 1069 1047 2157 11 2 NA
#> radar_latitude radar_longitude radar_height radar_wavelength
#> <num> <num> <int> <num>
#> 1: 51.1917 3.0642 50 5.3
#> 2: 51.1917 3.0642 50 5.3
#> 3: 51.1917 3.0642 50 5.3
#> 4: 51.1917 3.0642 50 5.3
#> 5: 51.1917 3.0642 50 5.3
#> ---
#> 162396: 51.1917 3.0642 50 5.3
#> 162397: 51.1917 3.0642 50 5.3
#> 162398: 51.1917 3.0642 50 5.3
#> 162399: 51.1917 3.0642 50 5.3
#> 162400: 51.1917 3.0642 50 5.3
#> source_file
#> <char>
#> 1: bejab_vp_20230303T220000Z_0x9.h5
#> 2: bejab_vp_20230303T220500Z_0x9.h5
#> 3: bejab_vp_20230303T221000Z_0x9.h5
#> 4: bejab_vp_20230303T220000Z_0x9.h5
#> 5: bejab_vp_20230303T220500Z_0x9.h5
#> ---
#> 162396: bejab_vp_20230331T235000Z_0x9.h5
#> 162397: bejab_vp_20230331T235500Z_0x9.h5
#> 162398: bejab_vp_20230331T234500Z_0x9.h5
#> 162399: bejab_vp_20230331T235000Z_0x9.h5
#> 162400: bejab_vp_20230331T235500Z_0x9.h5 Created on 2024-09-17 with reprex v2.1.0 |
@PietrH I just did some experimentation. I think it would not relate to AWS as a server. I started a local server with require(vroom)
#> Loading required package: vroom
download.file("https://aloftdata.s3-eu-west-1.amazonaws.com/uva/monthly/bezav/2017/bezav_vpts_201708.csv.gz",'~/bezav_vpts_201708.csv.gz')
system("gunzip -k ~/bezav_vpts_201708.csv.gz")
system("gzip -kc ~/bezav_vpts_201708.csv > ~/bezav_vpts_201708.csv.recompress.gz")
# The original file reads fine from disk
vroom::vroom("~/bezav_vpts_201708.csv.gz")
#> Rows: 215175 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): radar, source_file
#> dbl (21): height, u, v, w, ff, dd, sd_vvp, eta, dens, dbz, dbz_all, n, n_db...
#> lgl (2): gap, vcp
#> dttm (1): datetime
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 215,175 × 26
#> radar datetime height u v w ff dd sd_vvp
#> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bezav 2017-08-01 00:00:08 0 -2.39 -5.78 17.5 6.25 202. 2.92
#> 2 bezav 2017-08-01 00:00:08 200 -2.57 -7.29 2.20 7.73 199. 2.50
#> 3 bezav 2017-08-01 00:00:08 400 -4.33 -6.34 0.198 7.68 214. 2.57
#> 4 bezav 2017-08-01 00:00:08 600 -8.13 -4.22 -14.2 9.16 243. 3.99
#> 5 bezav 2017-08-01 00:00:08 800 -11.4 2.23 0.189 11.6 281. 4.53
#> 6 bezav 2017-08-01 00:00:08 1000 0.503 -7.48 16.4 7.50 176. 5.92
#> 7 bezav 2017-08-01 00:00:08 1200 0.172 -2.70 40.4 2.70 176. 4.75
#> 8 bezav 2017-08-01 00:00:08 1400 -8.36 -2.73 12.0 8.80 252. 6.25
#> 9 bezav 2017-08-01 00:00:08 1600 -9.63 -6.26 7.60 11.5 237. 5.66
#> 10 bezav 2017-08-01 00:00:08 1800 -1.01 0.244 36.3 1.03 284. 2.48
#> # ℹ 215,165 more rows
#> # ℹ 17 more variables: gap <lgl>, eta <dbl>, dens <dbl>, dbz <dbl>,
#> # dbz_all <dbl>, n <dbl>, n_dbz <dbl>, n_all <dbl>, n_dbz_all <dbl>,
#> # rcs <dbl>, sd_vvp_threshold <dbl>, vcp <lgl>, radar_latitude <dbl>,
#> # radar_longitude <dbl>, radar_height <dbl>, radar_wavelength <dbl>,
#> # source_file <chr>
# However if I read the file through a local server it fails
vroom::vroom("http://localhost:8080/bezav_vpts_201708.csv.gz")
#> Rows: 0 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (26): radar, datetime, height, u, v, w, ff, dd, sd_vvp, gap, eta, dens, ...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 × 26
#> # ℹ 26 variables: radar <chr>, datetime <chr>, height <chr>, u <chr>, v <chr>,
#> # w <chr>, ff <chr>, dd <chr>, sd_vvp <chr>, gap <chr>, eta <chr>,
#> # dens <chr>, dbz <chr>, dbz_all <chr>, n <chr>, n_dbz <chr>, n_all <chr>,
#> # n_dbz_all <chr>, rcs <chr>, sd_vvp_threshold <chr>, vcp <chr>,
#> # radar_latitude <chr>, radar_longitude <chr>, radar_height <chr>,
#> # radar_wavelength <chr>, source_file <chr>
# The file compressed again seems to work fine again through the same local server
vroom::vroom("http://localhost:8080/bezav_vpts_201708.csv.recompress.gz")
#> Rows: 215175 Columns: 26
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): radar, source_file
#> dbl (21): height, u, v, w, ff, dd, sd_vvp, eta, dens, dbz, dbz_all, n, n_db...
#> lgl (2): gap, vcp
#> dttm (1): datetime
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 215,175 × 26
#> radar datetime height u v w ff dd sd_vvp
#> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bezav 2017-08-01 00:00:08 0 -2.39 -5.78 17.5 6.25 202. 2.92
#> 2 bezav 2017-08-01 00:00:08 200 -2.57 -7.29 2.20 7.73 199. 2.50
#> 3 bezav 2017-08-01 00:00:08 400 -4.33 -6.34 0.198 7.68 214. 2.57
#> 4 bezav 2017-08-01 00:00:08 600 -8.13 -4.22 -14.2 9.16 243. 3.99
#> 5 bezav 2017-08-01 00:00:08 800 -11.4 2.23 0.189 11.6 281. 4.53
#> 6 bezav 2017-08-01 00:00:08 1000 0.503 -7.48 16.4 7.50 176. 5.92
#> 7 bezav 2017-08-01 00:00:08 1200 0.172 -2.70 40.4 2.70 176. 4.75
#> 8 bezav 2017-08-01 00:00:08 1400 -8.36 -2.73 12.0 8.80 252. 6.25
#> 9 bezav 2017-08-01 00:00:08 1600 -9.63 -6.26 7.60 11.5 237. 5.66
#> 10 bezav 2017-08-01 00:00:08 1800 -1.01 0.244 36.3 1.03 284. 2.48
#> # ℹ 215,165 more rows
#> # ℹ 17 more variables: gap <lgl>, eta <dbl>, dens <dbl>, dbz <dbl>,
#> # dbz_all <dbl>, n <dbl>, n_dbz <dbl>, n_all <dbl>, n_dbz_all <dbl>,
#> # rcs <dbl>, sd_vvp_threshold <dbl>, vcp <lgl>, radar_latitude <dbl>,
#> # radar_longitude <dbl>, radar_height <dbl>, radar_wavelength <dbl>,
#> # source_file <chr>
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 (2024-04-24)
#> os Ubuntu 22.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Amsterdam
#> date 2024-10-15
#> pandoc 3.1.11 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bit 4.5.0 2024-09-20 [1] CRAN (R 4.4.0)
#> bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.0)
#> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)
#> curl 5.2.3 2024-09-20 [1] CRAN (R 4.4.0)
#> digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
#> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#> knitr 1.46 2024-04-06 [1] CRAN (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.4.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.4.0)
#> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.4.0)
#> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.4.0)
#> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
#> rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.4.0)
#> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
#> styler 1.10.3 2024-04-07 [1] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> vroom * 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)
#> withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.0)
#> xfun 0.43 2024-03-25 [1] CRAN (R 4.4.0)
#> yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.0)
#>
#> [1] /home/bart/R/x86_64-pc-linux-gnu-library/4.4
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
read_csv()
fails to return data when a file is a remote gzipped csv file (downloading the file and reading in locally does work), example below:Incorrectly returns an empty data.frame:
Downloading the file locally and then reading it in does correctly read the data:
returns the correct data as expected:
The text was updated successfully, but these errors were encountered: