Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data manipulation brainstorm #29

Open
moodymudskipper opened this issue Sep 18, 2020 · 1 comment
Open

data manipulation brainstorm #29

moodymudskipper opened this issue Sep 18, 2020 · 1 comment

Comments

@moodymudskipper
Copy link
Owner

moodymudskipper commented Sep 18, 2020

QUESTION 1 : General syntax

We use the ? operator to select columns from our data. It :

  • describes a selection of columns to modify if on the lhs of =
  • returns a data frame with selected columns if used somewhere on the rhs of =
  • describes a selection of columns to use as groups, if on the rhs of ~

The former will be used for the equivalent of dplyr::across() operations.

starwars %.% {
  {
    ?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
    ?is.integer = ~mean(., na.rm = TRUE)
  } ~ sex + gender
}

Note that we don't need := unlike in tidyverse because these are not named arguments.

We can support custom names, because ? can be binary.

"max_{col}"?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
"mean_{col}"?is.integer = ~mean(., na.rm = TRUE)

In general the lhs of ? is used to rename the selection.

We could provide a vignette per dplyr help page, comparing all examples.

QUESTION 2: Should we support functions as rhs ?

so ?is.integer = ~max(.) could be written ?is.integer = max ?

It's unambiguous, but increases the chances of making mistakes.

Given that formulas are not much more verbose let's skip for now.

QUESTION 3: Should we support these for regular mutate/summarize calls ?

so we'd do for instance

data %.% {
  foo = ~toupper(.)
}

It seems harmless, and more consistent in fact. it's just that if we say yes to question 2, this would be risky

data %.% {
  foo = toupper
}

I think better hold back on 2 but say yes to 3

QUESTION 4 : How to select by regex

By using regex and () or ! or ~ with the following syntax we can do what the tidyselect stuff does.

? ("^Petal") = ~toupper(.)
? {"^Petal"} = ~toupper(.)
?! "^Petal" = ~toupper(.)
?~ "^Petal" = ~toupper(.)
? "~^Petal" = ~toupper(.)
? "/^Petal" = ~toupper(.)
  • "~" is often associated with regex but we use it already to aggregate, for lambdas, and for side effects (unary ~~). Another issue is then we cannot use ~ for lambdas after ? or lambdas will have to be between ().
  • we already have a lot of {} too
  • other unary ops are +, - and I don't think they look good here.
  • ?! ".*" prevents us to use ! to negate, and "foo" ?! ".*" is not very readable.
  • ? (".*") doesn't look good when the regex contains "(" (with named captures the regex will contain both ? and ().

A alternate solution would be to treat differently the rhs of ? if it is a string obeying a given format that we wouldn't expect for a column, maybe starting with "~".

QUESTION 5 : Should we expand this to select and with which features ?

I didn't really intend to propose some shorthands for selecting but with the above it seems to come for free. instead of select_if or select(starts_with(...), ...) when we can do :

df %.% {
  ?is.numeric
  ?~"^S"
}

I think this is also very good to introduce the fancy features, first selection, then how to rename, then mutate using ? to select the input and rename the output.

QUESTION 6 : how to combine selections ?

? selections on consecutive, lines means we keep the intersection of those.

How to select this OR that :

  • anonymous functions : ?(~sapply(., is.numeric) | grepl("^S", names(.)))
  • pile them up : ?is.numeric ?~"^S" : surprising but unambiguous and really compact.
  • use | or & with adjusted behaviors : ? is.numeric | {"^S"}

This can also be used for mutating and summarizing, though in that case we would miss a handy syntax for AND.

QUESTION 7 : Should we support renaming and how ?

I think it's non essential, but from what we have it follows almost naturally that we can do :

data %.% {
  "new_name" ?"old_name"
  "updated_{col}" ?c("old1", "old2")
}

Once we're comfortable with the fact that ? is used to select on the right side and rename on the left side it becomes intuitive enough.

QUESTION 8 : Should we force summarizing operations to return only one row per group ?

Our current way of doing it is just keeping grouping columns and applying any transformation, if it keeps the length then it will
be like a grouped transmute call.

summarize used to impose it and fail if not respected, now it allows it. A third option is not to fail but to nest if output is longer, but it's likely to behave unpredictably.

We can use ~1~ instead of ~ to force the summary to be one row by group.

If we force it to be one row, we'll sometimes need to unnest in the following step, and we cannot pivot longer using the aggregation syntax.

This is related to the 2 next questions.

QUESTION 9 : How do we do grouped mutate calls ?

Simple ones can be handled with transform() :

starwars %.% {
  ?c("name", "mass", "homeworld")
  transform(rank = min_rank(desc(mass))) ~ homeworld
}

For grouped mutate_at etc we'd need another syntax I think

starwars %.% {
  ?c("name", "mass", "homeworld")
  {rank = min_rank(desc(mass)))} ~keep_all~ homeworld
}

This means the default is not to keep all, here we do, so summarized calls will be recycled using the unchanged values, this makes sense. We can add ~keep_unused~ and ~keep_used~ to our special cases, and they might be used with ~ (none) if we need those without groups. The latter are scarcely useful though.

An issue with this naming is that we might keep all columns but still aggregate all. A better and shorter way might be ~w~ for "window".

We talked above about having ~1~, we could also have ~m~ to have margins to our summaries (see ?reshape::melt),
~n~ might keeps unaggregated columns and nest them.

QUESTION 10 : Should we unnest and how ?

I'm really not sure we should, but tidyr::unnest has become a bit verbose and strict, and unnest_legacy() is long to type.

Maybe we can use ++ ?

++ ?c("col1", "col2")

The idea is that "+" is often used in UIs to mean "develop", so we'd have "+-" for unnest, "++" for unnest_longer, and "--" for unnest_wider, and we can use as many as we want on the same line. We can use ? notation or naked column names.

QUESTION 11 : Should we implement spread ?

see #28

QUESTION 12 : then how to gather

see #28

QUESTION 13 : how to group by "the other columns"

This can be handy and it's what spreading functions do implicitly.

In base R we often see the dot for this : vars1 ~ .

We use the dot a lot already, so I think vars1 ~ (unused) is better, the () signal that it's a special syntax. We could also have other special values, see 2 questions below.

QUESTION 14: rowwise operations?

What about:

foo(...) ~ (row) 

QUESTION 15: summarize without group ?

What about:

foo(...) ~ (none) 

Or :

foo(...) ~ NULL

I think I prefer (none), more consistent.

It will be useful for summarizing with one big group, or to transmute

QUESTION 16: Rethinking filtering

arises from a few observations :

  • Filtering as it is now works only if we use a given set of operators, for example is.na(foo) won't work
  • If we have a long conditional expression it's confusing because we have to read on to see if it is a condition, if we lose in reading what we spend in typing this is not good

I we use ? for column selection, we could use ?? for row selection so :

  • we could use ?? is.na(foo), and of course things like ?? foo == 0
  • no matter how long are the expressions, it's obvious that we're subsetting

We see the same pattern as in our select vs rename, the unary call subsets while the binary call does not.

How about filter_if filter_at?

?all? (?is.numeric) > 0 # or `all?? (?is.numeric) > 0
?all? 0 < ?is.numeric # equivalent avoiding parens

The rhs of ?? should be a logical of same length as nrow, or recyclable, or a numeric, ?any? and ?all? can be used if we have a logical df or matrix of recyclable number of rows.

We'll still support current behavior, but discourage its use for complex conditions.

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Nov 18, 2020

All questions are pretty well answered now, still hesitant about Q4 but we can pick one and change later.

  • Master branch should be cleaned from undocumented dsl magic (basically we keep subset and mutate only)
  • We should then push V1 to CRAN
  • A new issue should be opened translating these issues in action points, we should reference this present issue and close it
  • Features should be implemented in a "dsl" branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant