Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

col_select in read_package() does not support tidyselection #162

Open
khusmann opened this issue Nov 11, 2023 · 5 comments
Open

col_select in read_package() does not support tidyselection #162

khusmann opened this issue Nov 11, 2023 · 5 comments
Labels
function:read_package Function read_package()
Milestone

Comments

@khusmann
Copy link
Contributor

I expected col_select in read_package to allow tidyselection like it's namesake arg in read_delim et al., but it looks like it only supports character vectors:

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=comments)

# Error in col_select %in% field_names : object 'comments' not found

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=starts_with("c"))

# Error in starts_with("c") : could not find function "starts_with"
@khusmann
Copy link
Contributor Author

Hmmm in read_resource is field_names basically the same as col_names?

field_names looks like it's only used with the col_select assertion at the top there, and is the only thing the tidyselect is breaking on. If you remove it entirely (lines 210-228), tidyselect works just fine (because it's directly passed to read_delim).

Was the all(col_select %in% field_names) assertion added to make the error message a little more readable? When I have the chance I'll look into the tidyselection stuff to see if there's a tidy way to make that assertion instead...

@khusmann
Copy link
Contributor Author

I looked into this a bit more, and it looks like the cryptic error messages thrown by col_select on invalid selections are a vroom issue. Once that is resolved, we should be able to directly forward col_select to read_delim, and always get nice error messages. So I don't think we need to make the assertion on our end; we can just remove lines 210-228 to resolve this issue.

@peterdesmet
Copy link
Member

peterdesmet commented Nov 13, 2023

Hi @khusmann, thanks for looking into this.

Note that the variable field_names contains the names of the fields in the schema (not the headers in the CSV).

  1. Lines 212-228 checks up front if col_select only contains names that were defined in the schema. If that can be simplified and adapted to support tidyselection, great!
  2. Indeed col_names is basically the same as field_names, but checked additionally to make sure there are no empty names (i.e. fields in the schema without a name). The code could be adapted to do that checking earlier (e.g. right after line 210 or even as part of get_schema() ). That would allow to use the variable field_names throughout (and not additionally define col_names).

@khusmann
Copy link
Contributor Author

Thanks for the feedback!

  1. I think the simplest way of doing this is to simply add .default=col_skip() to the col_types list in read_delim. Because col_types is created via the schema, this would prevent loading any columns in the csv that weren't also valid fields. Unfortunately, with the vroom issue I mention above, you get cryptic errors in the special case when the user selects no valid fields. If we want to throw our own errors instead, instead of waiting on vroom to have this fixed, lmk, and I can do that instead with our own tidyselection call.

Side note: right now it doesn't look like col_select applies on resources with read_from == "df" and read_from == "data"... do we want it to apply to all calls of read_resource, or just when reading CSVs? (I think all calls probably makes the most sense)

  1. Hmmm, it looks like the "no empty names check" happens already in check_schema() (L20) which is called by get_schema(). So if I'm reading this correctly I think we can treat the name field as a guaranteed attribute of a well-formed schema, no need to check again... what do you think?

@peterdesmet peterdesmet added the function:read_package Function read_package() label Jan 26, 2024
@peterdesmet
Copy link
Member

peterdesmet commented Mar 27, 2024

There have been some updates in the code, so I'll summarize what needs to be done:

  • Don't define col_names, just use field_names throughout (item 2 in this and this comment: 44b2c55
  • Convert a tidyselection in col_select so frictionless_error_colselect_mismatch (L211-221) can deal with that. Alternatively and not preferred: drop the frictionless_error_colselect_mismatch check and handle vroom errors like Error in UseMethod("collector_value").
  • Write tests for this behaviour.
  • Extend col_select from CSV resources to all resources (including df and data). Described in Extend col_select to data.frame and data resources #197

@peterdesmet peterdesmet changed the title col_select in read_package does not support tidyselection col_select in read_package() does not support tidyselection Mar 27, 2024
@peterdesmet peterdesmet added this to the 1.2.0 milestone Jul 8, 2024
@peterdesmet peterdesmet modified the milestones: 1.2.0, 1.3.0 Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
function:read_package Function read_package()
Projects
None yet
Development

No branches or pull requests

2 participants