`col_select` in `read_package()` does not support tidyselection #162

khusmann · 2023-11-11T18:55:07Z

I expected col_select in read_package to allow tidyselection like it's namesake arg in read_delim et al., but it looks like it only supports character vectors:

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=comments)

# Error in col_select %in% field_names : object 'comments' not found

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=starts_with("c"))

# Error in starts_with("c") : could not find function "starts_with"

The text was updated successfully, but these errors were encountered:

khusmann · 2023-11-11T19:15:25Z

Hmmm in read_resource is field_names basically the same as col_names?

field_names looks like it's only used with the col_select assertion at the top there, and is the only thing the tidyselect is breaking on. If you remove it entirely (lines 210-228), tidyselect works just fine (because it's directly passed to read_delim).

Was the all(col_select %in% field_names) assertion added to make the error message a little more readable? When I have the chance I'll look into the tidyselection stuff to see if there's a tidy way to make that assertion instead...

khusmann · 2023-11-12T20:42:00Z

I looked into this a bit more, and it looks like the cryptic error messages thrown by col_select on invalid selections are a vroom issue. Once that is resolved, we should be able to directly forward col_select to read_delim, and always get nice error messages. So I don't think we need to make the assertion on our end; we can just remove lines 210-228 to resolve this issue.

peterdesmet · 2023-11-13T08:52:58Z

Hi @khusmann, thanks for looking into this.

Note that the variable field_names contains the names of the fields in the schema (not the headers in the CSV).

Lines 212-228 checks up front if col_select only contains names that were defined in the schema. If that can be simplified and adapted to support tidyselection, great!
Indeed col_names is basically the same as field_names, but checked additionally to make sure there are no empty names (i.e. fields in the schema without a name). The code could be adapted to do that checking earlier (e.g. right after line 210 or even as part of get_schema() ). That would allow to use the variable field_names throughout (and not additionally define col_names).

khusmann · 2023-11-13T19:38:33Z

Thanks for the feedback!

I think the simplest way of doing this is to simply add .default=col_skip() to the col_types list in read_delim. Because col_types is created via the schema, this would prevent loading any columns in the csv that weren't also valid fields. Unfortunately, with the vroom issue I mention above, you get cryptic errors in the special case when the user selects no valid fields. If we want to throw our own errors instead, instead of waiting on vroom to have this fixed, lmk, and I can do that instead with our own tidyselection call.

Side note: right now it doesn't look like col_select applies on resources with read_from == "df" and read_from == "data"... do we want it to apply to all calls of read_resource, or just when reading CSVs? (I think all calls probably makes the most sense)

Hmmm, it looks like the "no empty names check" happens already in check_schema() (L20) which is called by get_schema(). So if I'm reading this correctly I think we can treat the name field as a guaranteed attribute of a well-formed schema, no need to check again... what do you think?

peterdesmet · 2024-03-27T07:44:34Z

There have been some updates in the code, so I'll summarize what needs to be done:

Don't define col_names, just use field_names throughout (item 2 in this and this comment: 44b2c55
Convert a tidyselection in col_select so frictionless_error_colselect_mismatch (L211-221) can deal with that. Alternatively and not preferred: drop the frictionless_error_colselect_mismatch check and handle vroom errors like Error in UseMethod("collector_value").
Write tests for this behaviour.
Extend col_select from CSV resources to all resources (including df and data). Described in Extend col_select to data.frame and data resources #197

peterdesmet added the function:read_package Function read_package() label Jan 26, 2024

peterdesmet changed the title ~~col_select in read_package does not support tidyselection~~ col_select in read_package() does not support tidyselection Mar 27, 2024

peterdesmet mentioned this issue Mar 27, 2024

Extend col_select to data.frame and data resources #197

Open

peterdesmet added this to the 1.2.0 milestone Jul 8, 2024

peterdesmet modified the milestones: 1.2.0, 1.3.0 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`col_select` in `read_package()` does not support tidyselection #162

`col_select` in `read_package()` does not support tidyselection #162

khusmann commented Nov 11, 2023

khusmann commented Nov 11, 2023

khusmann commented Nov 12, 2023

peterdesmet commented Nov 13, 2023 •

edited

Loading

khusmann commented Nov 13, 2023

peterdesmet commented Mar 27, 2024 •

edited

Loading

col_select in read_package() does not support tidyselection #162

col_select in read_package() does not support tidyselection #162

Comments

khusmann commented Nov 11, 2023

khusmann commented Nov 11, 2023

khusmann commented Nov 12, 2023

peterdesmet commented Nov 13, 2023 • edited Loading

khusmann commented Nov 13, 2023

peterdesmet commented Mar 27, 2024 • edited Loading

`col_select` in `read_package()` does not support tidyselection #162

`col_select` in `read_package()` does not support tidyselection #162

peterdesmet commented Nov 13, 2023 •

edited

Loading

peterdesmet commented Mar 27, 2024 •

edited

Loading