Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added the parameter "columns" to remove_constant #458

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mgacc0
Copy link

@mgacc0 mgacc0 commented Sep 10, 2021

The new parameter "columns" to remove_constant specifies which columns to check.

The default is to check all columns.
But since we don't know initially the names of the columns, it is expressed inversely as "c()".

I would prefer that "columns" could be the second parameter (instead of the last) so it would be possible to write

   df1 %>%
    remove_constant(c("col2", "col3"))

instead of having to write

   df1 %>%
    remove_constant(columns = c("col2", "col3"))

But adding a second parameter could break compatibility for some users if they had code like

   df1 %>%
    remove_constant(TRUE)

meaning

   df1 %>%
    remove_constant(na.rm = TRUE)

The new parameter "columns" to `remove_constant` specifies which columns to check.


The default is to check all columns.
But since we don't know initially the names of the columns, it is expressed inversely as "c()".

I would prefer that "columns" could be the second parameter (instead of the last) so it would be possible to write
```r
   df1 %>%
    remove_constant(c("col2", "col3"))
```
instead of having to write
```r
   df1 %>%
    remove_constant(columns = c("col2", "col3"))
```
But adding a second parameter could break compatibility for some users if they had code like
```r
   df1 %>%
    remove_constant(TRUE)
```
meaning
```r
   df1 %>%
    remove_constant(na.rm = TRUE)
```
@billdenney
Copy link
Collaborator

Thanks for the PR. I agree that the feature is useful.

I have two comments on the code:

  1. Please ensure that it is an error if the column name or number is not in the input dataset. People make typographical errors, and making it an error to have an invalid column name will prevent that error. (And, add a test for that error with both numbers and names.)
  2. Please only run the tests for uniqueness on the selected column names rather than on all columns. When working with bigger datasets, that can make a difference in runtime.

And, please do keep it as the last argument to prevent an unnecessary break to backward compatibility. (My bias is to name all arguments in my code because sometimes people don't keep the order the same, but that's just me.)

@sfirke
Copy link
Owner

sfirke commented Sep 11, 2021 via email

@jzadra
Copy link
Contributor

jzadra commented Sep 12, 2021 via email

@sfirke
Copy link
Owner

sfirke commented Jan 4, 2023

I am finally reviewing this (sorry). I agree with the above comments: this is a useful PR, thank you! Please:

  • keep columns as the last argument
  • use dplyr::select() on an argument ... . That way tidyselect does all the work, you won't need the lines where you check whether columns are in the data.frame
  • subset the data.frame with dplyr::select() up front so that as Bill notes, only the selected columns are analyzed for uniqueness

I know this has been dormant for a while. I'll leave it open in case you want to finish it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants