Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compact syntax for summarizing #26

Open
moodymudskipper opened this issue Sep 9, 2020 · 6 comments
Open

compact syntax for summarizing #26

moodymudskipper opened this issue Sep 9, 2020 · 6 comments

Comments

@moodymudskipper
Copy link
Owner

moodymudskipper commented Sep 9, 2020

Note : largely outdated, keeping around until all issues have moved to better places

I feel both dplyr and data.table are too verbose for most summarizing operations.

Moreover dplyr became very sophisticated with accross, but it leads to having summarize(across(where situations that I find awkward.

What about making those 2 equivalent :

starwars %.% {
  {
    max({"mass "; "birth_year"}, na.rm = TRUE)
    mean({is.integer}, na.rm = TRUE)
  } ~ sex + gender
}

starwars %>%
  group_by(sex, gender) %>%
  summarize(
    across(one_of("mass", "birth_year"), max, na.rm = TRUE),
    across(where(is.integer), mean, na.rm = TRUE)) %>%
  ungroup()

{} means "all", but it can be filled with symbols or litterals so it means"if" if it evaluates to a function, or "at" if it evaluates to a numeric or character.

{ has already several meanings, that's the flaw :

  • There's a first { to start the pipe
  • Then { is used to turn off implicit dots
  • If used before ~, then it still turns off implicit dots but means we'll summarize
  • Inside of the latter, if it contains only symbols or litteral, apply special behavior.

Also, the use of ; is really unorthodox.

But all in all I think it might be worth it.

The last post in #21 proposes some simpler summarize behaviors to accompany it

@moodymudskipper
Copy link
Owner Author

rather than {}, better use ? :

starwars %.% {
  {
    max(?c("mass ", "birth_year"), na.rm = TRUE)
    mean(?is.integer, na.rm = TRUE)
  } ~ sex + gender
}

To say "all" we can keep the {} syntax or we do ?names(.), which isn't horrible to type.

Leveraging tidyselect if it's installed would be nice

@moodymudskipper
Copy link
Owner Author

A problem of the above syntax is if my input is used several times in my function.

We could in these cases use ?. not to repeat the ?c("mass ", "birth_year") expression

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Sep 9, 2020

The other problem is how to deal with this with the debugging pipe.

I think we change the following :

{
    max(?c("mass ", "birth_year"), na.rm = TRUE)
    mean(?is.integer, na.rm = TRUE)
  } ~ sex + gender

to :

. <- naked_pipe::np_summarize(
  data = .,
  expr = { max(?c("mass ", "birth_year"), na.rm = TRUE);  mean(?is.integer, na.rm = TRUE) },
  by = sex + gender)

And we do our development inside of compute_by_group, so what the debugging pipe shows is what is really happening, and the standard syntax maps to it.

We can also have a function np_step, which takes data as the first arg and a nakedpipe step expression as the 2nd, would work as a placeholder to help with the technical debt caused by the debugging and translating features

@moodymudskipper
Copy link
Owner Author

? can also be used on the rhs, and we get the features of group_by_at, group_by_if

@moodymudskipper
Copy link
Owner Author

Some random ideas about the rhs :

  • While aggregate.formula documents only the use of +, other operators just work the same, we can make them work differently though
  • Using only - we group by all except ...
  • Using * we create missing combinations and fill computed columns with NAs

@moodymudskipper

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant