added the parameter "columns" to remove_constant #458

mgacc0 · 2021-09-10T23:10:54Z

The new parameter "columns" to remove_constant specifies which columns to check.

The default is to check all columns.
But since we don't know initially the names of the columns, it is expressed inversely as "c()".

I would prefer that "columns" could be the second parameter (instead of the last) so it would be possible to write

   df1 %>%
    remove_constant(c("col2", "col3"))

instead of having to write

   df1 %>%
    remove_constant(columns = c("col2", "col3"))

But adding a second parameter could break compatibility for some users if they had code like

   df1 %>%
    remove_constant(TRUE)

meaning

   df1 %>%
    remove_constant(na.rm = TRUE)

The new parameter "columns" to `remove_constant` specifies which columns to check. The default is to check all columns. But since we don't know initially the names of the columns, it is expressed inversely as "c()". I would prefer that "columns" could be the second parameter (instead of the last) so it would be possible to write ```r df1 %>% remove_constant(c("col2", "col3")) ``` instead of having to write ```r df1 %>% remove_constant(columns = c("col2", "col3")) ``` But adding a second parameter could break compatibility for some users if they had code like ```r df1 %>% remove_constant(TRUE) ``` meaning ```r df1 %>% remove_constant(na.rm = TRUE) ```

billdenney · 2021-09-10T23:47:26Z

Thanks for the PR. I agree that the feature is useful.

I have two comments on the code:

Please ensure that it is an error if the column name or number is not in the input dataset. People make typographical errors, and making it an error to have an invalid column name will prevent that error. (And, add a test for that error with both numbers and names.)
Please only run the tests for uniqueness on the selected column names rather than on all columns. When working with bigger datasets, that can make a difference in runtime.

And, please do keep it as the last argument to prevent an unnecessary break to backward compatibility. (My bias is to name all arguments in my code because sometimes people don't keep the order the same, but that's just me.)

sfirke · 2021-09-11T01:53:37Z

Thank you both, for the PR and for the review! I'm not able to look at this right now but I see the discussion about testing variable selection. If this uses tidyselect specifications, which is ideal, that has its own handling of invalid column names that we could piggyback on rather than reinvent. I recently added column selection to adorn functions in janitor, as an example.

…

On Fri, Sep 10, 2021, 7:47 PM Bill Denney ***@***.***> wrote: Thanks for the PR. I agree that the feature is useful. I have two comments on the code: 1. Please ensure that it is an error if the column name or number is not in the input dataset. People make typographical errors, and making it an error to have an invalid column name will prevent that error. (And, add a test for that error with both numbers and names.) 2. Please only run the tests for uniqueness on the selected column names rather than on all columns. When working with bigger datasets, that can make a difference in runtime. And, please do keep it as the last argument to prevent an unnecessary break to backward compatibility. (My bias is to name all arguments in my code because sometimes people don't keep the order the same, but that's just me.) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#458 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZYDEH353AGPYK3ABBA6NDUBKKJVANCNFSM5D2IKPQQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jzadra · 2021-09-12T17:21:29Z

I would also like to suggest using the tidyverse select functinslity rather than word column names. Thanks, Jon

…

-- Jonathan Zadra, PhD (he/him) Director, Data Science Sorenson Impact Center David Eccles School of Business, University of Utah www.sorensonimpact.com (801) 581-4815

On Sep 10, 2021, 15:53 -1000, Sam Firke ***@***.***>, wrote: Thank you both, for the PR and for the review! I'm not able to look at this right now but I see the discussion about testing variable selection. If this uses tidyselect specifications, which is ideal, that has its own handling of invalid column names that we could piggyback on rather than reinvent. I recently added column selection to adorn functions in janitor, as an example. On Fri, Sep 10, 2021, 7:47 PM Bill Denney ***@***.***> wrote: > Thanks for the PR. I agree that the feature is useful. > > I have two comments on the code: > > 1. Please ensure that it is an error if the column name or number is > not in the input dataset. People make typographical errors, and making it > an error to have an invalid column name will prevent that error. (And, add > a test for that error with both numbers and names.) > 2. Please only run the tests for uniqueness on the selected column > names rather than on all columns. When working with bigger datasets, that > can make a difference in runtime. > > And, please do keep it as the last argument to prevent an unnecessary > break to backward compatibility. (My bias is to name all arguments in my > code because sometimes people don't keep the order the same, but that's > just me.) > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#458 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABZYDEH353AGPYK3ABBA6NDUBKKJVANCNFSM5D2IKPQQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

sfirke · 2023-01-04T16:27:53Z

I am finally reviewing this (sorry). I agree with the above comments: this is a useful PR, thank you! Please:

keep columns as the last argument
use dplyr::select() on an argument ... . That way tidyselect does all the work, you won't need the lines where you check whether columns are in the data.frame
subset the data.frame with dplyr::select() up front so that as Bill notes, only the selected columns are analyzed for uniqueness

I know this has been dormant for a while. I'll leave it open in case you want to finish it at some point.

mgacc0 added 2 commits September 11, 2021 01:04

Update test-remove-empties.R

cfa5b34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added the parameter "columns" to remove_constant #458

added the parameter "columns" to remove_constant #458

mgacc0 commented Sep 10, 2021

billdenney commented Sep 10, 2021

sfirke commented Sep 11, 2021 via email

jzadra commented Sep 12, 2021 via email

sfirke commented Jan 4, 2023

added the parameter "columns" to remove_constant #458

Are you sure you want to change the base?

added the parameter "columns" to remove_constant #458

Conversation

mgacc0 commented Sep 10, 2021

billdenney commented Sep 10, 2021

sfirke commented Sep 11, 2021 via email

jzadra commented Sep 12, 2021 via email

sfirke commented Jan 4, 2023