Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vroom_fwf not reading the first line? #503

Open
sysilviakim opened this issue Jul 25, 2023 · 2 comments
Open

vroom_fwf not reading the first line? #503

sysilviakim opened this issue Jul 25, 2023 · 2 comments

Comments

@sysilviakim
Copy link

sysilviakim commented Jul 25, 2023

Hello team,

Thank you for the amazing package! I have an importing issue---I'm sure I'm doing something silly, but I can't quite figure it out. Both read_fwf and vroom_fwf are producing files that lack one line (the first line, to be precise) when importing fixed-width files. Edit: it only does this when n_max is specified as a value other than Inf, or when the file is a local file in a Windows machine: see Stack Overflow post here.

There are two files:

Suppose that the fixed-width file and the CSV file are at the root directory. The code I used is (a part of a larger codebase)

library(dplyr)
library(vroom)
library(data.table)

test <- fread(
  "test.csv",
  strip.white = TRUE, header = FALSE, blank.lines.skip = TRUE
) %>%
  filter(!is.na(V2)) %>%
  mutate(V1 = gsub(" |\\(", ".", gsub("\\)", "", V1)))
  
## gives one line
vroom::vroom_fwf(
  "vroom_fwf_test.txt", fwf_widths(test$V3, test$V1),
  n_max = 1000, col_types = cols(.default = "c"), id = "file_name"
)

This will only produce one row of data. But there are two lines in this raw file, as evidenced by

writeLines(read_lines(path)) ## two lines

which produces two lines as expected. If I leave only one line in the raw data, it'll produce zero imported rows.

Now, if n_max = Inf as in the default, it's fine:

## gives two lines as it should
vroom::vroom_fwf(
  "vroom_fwf_test.txt", fwf_widths(test$V3, test$V1),
  n_max = Inf, col_types = cols(.default = "c"), id = "file_name"
)

Even with n_max = 1000 specified, the following works fine, too: the original files have been uploaded to GitHub and are being directly called from there:

## gives two lines as it should
vroom::vroom_fwf(
  file = "https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt",
  col_positions = with(
    read.csv(
      "https://github.com/tidyverse/vroom/files/12156786/test.csv",
      header = F
    ), vroom::fwf_widths(V3, V1)
  ),
  n_max = 1000,
  col_types = vroom::cols(.default = "c"),
  id = "file_name"
)

I am not sure where I've gone wrong. They are literally the same files, and I've checked that col_positions is not the problem. Perhaps it is a line-ending issue? My session info is as follows:

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x6
[test.csv](https
[vroom_fwf_test.txt](https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt)
://github.com/tidyverse/vroom/files/12156786/test.csv)
4 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] sf_1.0-9           censusxy_1.1.1     tidygeocoder_1.0.5
 [4] foreign_0.8-83     lubridate_1.9.0    timechange_0.1.1  
 [7] data.table_1.14.8  vroom_1.6.0        janitor_2.1.0     
[10] readxl_1.4.1       assertthat_0.2.1   here_1.0.1        
[13] stringi_1.7.8      forcats_0.5.2      stringr_1.5.0     
[16] dplyr_1.1.0        purrr_1.0.0        readr_2.1.3       
[19] tidyr_1.2.1        tibble_3.1.8       ggplot2_3.4.0     
[22] tidyverse_1.3.2    plyr_1.8.8         MASS_7.3-58.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          class_7.3-20        rprojroot_2.0.3    
 [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
 [7] backports_1.4.1     reprex_2.0.2        e1071_1.7-12       
[10] httr_1.4.4          pillar_1.8.1        rlang_1.0.6        
[13] googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
[16] bit_4.0.5           munsell_0.5.0       proxy_0.4-27       
[19] broom_1.0.2         compiler_4.2.2      modelr_0.1.10      
[22] pkgconfig_2.0.3     tidyselect_1.2.0    fansi_1.0.3        
[25] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
[28] withr_2.5.0         grid_4.2.2          jsonlite_1.8.4     
[31] gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
[34] magrittr_2.0.3      units_0.8-1         scales_1.2.1       
[37] KernSmooth_2.23-20  cli_3.6.0           renv_0.16.0        
[40] fs_1.5.2            snakecase_0.11.0    xml2_1.3.3         
[43] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.2        
[46] tools_4.2.2         bit64_4.0.5         glue_1.6.2         
[49] hms_1.1.2           parallel_4.2.2      colorspace_2.0-3   
[52] gargle_1.2.1        classInt_0.4-8      rvest_1.0.3        
[55] haven_2.5.1   

Has anybody encountered a similar problem? Thank you very much.

@jennybc
Copy link
Member

jennybc commented Sep 29, 2023

Do the people giving this a thumbs up also have different examples of this problem? If so, please share! @bernardlf @jay-sf

@jay-sf
Copy link

jay-sf commented Oct 4, 2023

@jennybc Found alternative solution there: https://stackoverflow.com/a/76759892/6574038
Cheers! J

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants