Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning message with build_index #1

Closed
wgmmaas opened this issue Apr 2, 2024 · 3 comments
Closed

Warning message with build_index #1

wgmmaas opened this issue Apr 2, 2024 · 3 comments

Comments

@wgmmaas
Copy link

wgmmaas commented Apr 2, 2024

Hi @lecy et al.,

Thanks for your work on this package. I get a warning message that I did not get before:

> index <- build_index(tax.years = 2019)

Warning message:
One or more parsing issues, call problems() on your data frame for
details, e.g.:
dat <- vroom(...)
problems(dat)

What could be the reason for the warning? And is it safe to ignore it, as I end up with the index of 523,999 observations (only two observations short of the 524,001 it should find for 2019 according to the README)?
Thanks, Wim

@lecy
Copy link
Member

lecy commented Apr 3, 2024

I am not familiar with the warning, but I suspect it is from the readr package and probably related to data types.

See: tidyverse/readr#1477

Or potentially dplyr when the disaggregated data frames are being stacked.

I suspect it's harmless - for example integers and doubles mixing, which impacts representation in memory in R but would not change how the data would appear once written to a CSV file.

But please let me know if you discover otherwise.

@wgmmaas
Copy link
Author

wgmmaas commented Apr 3, 2024

Thanks Jesse, you are correct. It is a parsing problem in readr. It is guessing the "LegalDomicileCountry" column type incorrectly (see below). As this does not affect the rest of my application, I will ignore it. Thanks.

URL <- paste0("https://nccs-efile.s3.us-east-1.amazonaws.com/index/data-commons-efile-index-", 2019, ".csv")
d <- readr::read_csv(URL, show_col_types = FALSE)
parsing_problems <- problems(d)
if (nrow(parsing_problems) > 0) {
  print(parsing_problems)
}

> print(parsing_problems)
# A tibble: 181 x 5
     row   col expected           actual file 
   <int> <int> <chr>              <chr>  <chr>
 1  1819    13 1/0/T/F/TRUE/FALSE CA     ""   
 2  3225    13 1/0/T/F/TRUE/FALSE NI     ""   
 3  5076    13 1/0/T/F/TRUE/FALSE CA     ""   
 4  5078    13 1/0/T/F/TRUE/FALSE CJ     ""   
 5  5502    13 1/0/T/F/TRUE/FALSE CA     ""   
 6  7666    13 1/0/T/F/TRUE/FALSE HO     ""   
 7  8408    13 1/0/T/F/TRUE/FALSE CA     ""   
 8  9305    13 1/0/T/F/TRUE/FALSE UK     ""   
 9 14025    13 1/0/T/F/TRUE/FALSE AU     ""   
10 21681    13 1/0/T/F/TRUE/FALSE BD     ""   
# i 171 more rows
# i Use `print(n = ...)` to see more rows

Edit: I patched to the newest version that uses data.table and I do not get the error anymore, thanks!

@wgmmaas wgmmaas closed this as completed Apr 3, 2024
@lecy
Copy link
Member

lecy commented Apr 3, 2024

Ok, great. And yes, I updated the build_index() function so that all columns are loaded as strings (character vectors). Glad it worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants