You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"~" is often associated with regex but we use it already to aggregate, for lambdas, and for side effects (unary ~~). Another issue is then we cannot use ~ for lambdas after ? or lambdas will have to be between ().
we already have a lot of {} too
other unary ops are +, - and I don't think they look good here.
?! ".*" prevents us to use ! to negate, and "foo" ?! ".*" is not very readable.
? (".*") doesn't look good when the regex contains "(" (with named captures the regex will contain both ? and ().
A alternate solution would be to treat differently the rhs of ? if it is a string obeying a given format that we wouldn't expect for a column, maybe starting with "~".
QUESTION 5 : Should we expand this to select and with which features ?
I didn't really intend to propose some shorthands for selecting but with the above it seems to come for free. instead of select_if or select(starts_with(...), ...) when we can do :
df %.% {
?is.numeric
?~"^S"
}
I think this is also very good to introduce the fancy features, first selection, then how to rename, then mutate using ? to select the input and rename the output.
QUESTION 6 : how to combine selections ?
? selections on consecutive, lines means we keep the intersection of those.
pile them up : ?is.numeric ?~"^S" : surprising but unambiguous and really compact.
use | or & with adjusted behaviors : ? is.numeric | {"^S"}
This can also be used for mutating and summarizing, though in that case we would miss a handy syntax for AND.
QUESTION 7 : Should we support renaming and how ?
I think it's non essential, but from what we have it follows almost naturally that we can do :
data %.% {
"new_name" ?"old_name"
"updated_{col}" ?c("old1", "old2")
}
Once we're comfortable with the fact that ? is used to select on the right side and rename on the left side it becomes intuitive enough.
QUESTION 8 : Should we force summarizing operations to return only one row per group ?
Our current way of doing it is just keeping grouping columns and applying any transformation, if it keeps the length then it will
be like a grouped transmute call.
summarize used to impose it and fail if not respected, now it allows it. A third option is not to fail but to nest if output is longer, but it's likely to behave unpredictably.
We can use ~1~ instead of ~ to force the summary to be one row by group.
If we force it to be one row, we'll sometimes need to unnest in the following step, and we cannot pivot longer using the aggregation syntax.
This means the default is not to keep all, here we do, so summarized calls will be recycled using the unchanged values, this makes sense. We can add ~keep_unused~ and ~keep_used~ to our special cases, and they might be used with ~ (none) if we need those without groups. The latter are scarcely useful though.
An issue with this naming is that we might keep all columns but still aggregate all. A better and shorter way might be ~w~ for "window".
We talked above about having ~1~, we could also have ~m~ to have margins to our summaries (see ?reshape::melt), ~n~ might keeps unaggregated columns and nest them.
QUESTION 10 : Should we unnest and how ?
I'm really not sure we should, but tidyr::unnest has become a bit verbose and strict, and unnest_legacy() is long to type.
Maybe we can use ++ ?
++ ?c("col1", "col2")
The idea is that "+" is often used in UIs to mean "develop", so we'd have "+-" for unnest, "++" for unnest_longer, and "--" for unnest_wider, and we can use as many as we want on the same line. We can use ? notation or naked column names.
This can be handy and it's what spreading functions do implicitly.
In base R we often see the dot for this : vars1 ~ .
We use the dot a lot already, so I think vars1 ~ (unused) is better, the () signal that it's a special syntax. We could also have other special values, see 2 questions below.
QUESTION 14: rowwise operations?
What about:
foo(...) ~ (row)
QUESTION 15: summarize without group ?
What about:
foo(...) ~ (none)
Or :
foo(...) ~ NULL
I think I prefer (none), more consistent.
It will be useful for summarizing with one big group, or to transmute
QUESTION 16: Rethinking filtering
arises from a few observations :
Filtering as it is now works only if we use a given set of operators, for example is.na(foo) won't work
If we have a long conditional expression it's confusing because we have to read on to see if it is a condition, if we lose in reading what we spend in typing this is not good
I we use ? for column selection, we could use ?? for row selection so :
we could use ?? is.na(foo), and of course things like ?? foo == 0
no matter how long are the expressions, it's obvious that we're subsetting
We see the same pattern as in our select vs rename, the unary call subsets while the binary call does not.
The rhs of ?? should be a logical of same length as nrow, or recyclable, or a numeric, ?any? and ?all? can be used if we have a logical df or matrix of recyclable number of rows.
We'll still support current behavior, but discourage its use for complex conditions.
The text was updated successfully, but these errors were encountered:
QUESTION 1 : General syntax
We use the
?
operator to select columns from our data. It :The former will be used for the equivalent of
dplyr::across()
operations.Note that we don't need
:=
unlike in tidyverse because these are not named arguments.We can support custom names, because
?
can be binary.In general the lhs of
?
is used to rename the selection.We could provide a vignette per dplyr help page, comparing all examples.
QUESTION 2: Should we support functions as rhs ?
so
?is.integer = ~max(.)
could be written?is.integer = max
?It's unambiguous, but increases the chances of making mistakes.
Given that formulas are not much more verbose let's skip for now.
QUESTION 3: Should we support these for regular mutate/summarize calls ?
so we'd do for instance
It seems harmless, and more consistent in fact. it's just that if we say yes to question 2, this would be risky
I think better hold back on 2 but say yes to 3
QUESTION 4 : How to select by regex
By using regex and
()
or!
or~
with the following syntax we can do what the tidyselect stuff does."~"
is often associated with regex but we use it already to aggregate, for lambdas, and for side effects (unary~~
). Another issue is then we cannot use~
for lambdas after?
or lambdas will have to be between()
.{}
too+
,-
and I don't think they look good here.?! ".*"
prevents us to use!
to negate, and"foo" ?! ".*"
is not very readable.? (".*")
doesn't look good when the regex contains "(" (with named captures the regex will contain both?
and()
.A alternate solution would be to treat differently the rhs of
?
if it is a string obeying a given format that we wouldn't expect for a column, maybe starting with "~".QUESTION 5 : Should we expand this to select and with which features ?
I didn't really intend to propose some shorthands for selecting but with the above it seems to come for free. instead of
select_if
orselect(starts_with(...), ...)
when we can do :I think this is also very good to introduce the fancy features, first selection, then how to rename, then mutate using
?
to select the input and rename the output.QUESTION 6 : how to combine selections ?
?
selections on consecutive, lines means we keep the intersection of those.How to select this OR that :
?(~sapply(., is.numeric) | grepl("^S", names(.)))
?is.numeric ?~"^S"
: surprising but unambiguous and really compact.|
or&
with adjusted behaviors :? is.numeric | {"^S"}
This can also be used for mutating and summarizing, though in that case we would miss a handy syntax for AND.
QUESTION 7 : Should we support renaming and how ?
I think it's non essential, but from what we have it follows almost naturally that we can do :
Once we're comfortable with the fact that
?
is used to select on the right side and rename on the left side it becomes intuitive enough.QUESTION 8 : Should we force summarizing operations to return only one row per group ?
Our current way of doing it is just keeping grouping columns and applying any transformation, if it keeps the length then it will
be like a grouped transmute call.
summarize used to impose it and fail if not respected, now it allows it. A third option is not to fail but to nest if output is longer, but it's likely to behave unpredictably.
We can use
~1~
instead of~
to force the summary to be one row by group.If we force it to be one row, we'll sometimes need to unnest in the following step, and we cannot pivot longer using the aggregation syntax.
This is related to the 2 next questions.
QUESTION 9 : How do we do grouped mutate calls ?
Simple ones can be handled with
transform()
:For grouped mutate_at etc we'd need another syntax I think
This means the default is not to keep all, here we do, so summarized calls will be recycled using the unchanged values, this makes sense. We can add
~keep_unused~
and~keep_used~
to our special cases, and they might be used with~ (none)
if we need those without groups. The latter are scarcely useful though.An issue with this naming is that we might keep all columns but still aggregate all. A better and shorter way might be
~w~
for "window".We talked above about having
~1~
, we could also have~m~
to have margins to our summaries (see?reshape::melt
),~n~
might keeps unaggregated columns and nest them.QUESTION 10 : Should we unnest and how ?
I'm really not sure we should, but
tidyr::unnest
has become a bit verbose and strict, andunnest_legacy()
is long to type.Maybe we can use
++
?The idea is that "+" is often used in UIs to mean "develop", so we'd have "+-" for
unnest
, "++" forunnest_longer
, and"--"
forunnest_wider
, and we can use as many as we want on the same line. We can use?
notation or naked column names.QUESTION 11 : Should we implement spread ?
see #28
QUESTION 12 : then how to gather
see #28
QUESTION 13 : how to group by "the other columns"
This can be handy and it's what spreading functions do implicitly.
In base R we often see the dot for this :
vars1 ~ .
We use the dot a lot already, so I think
vars1 ~ (unused)
is better, the()
signal that it's a special syntax. We could also have other special values, see 2 questions below.QUESTION 14: rowwise operations?
What about:
QUESTION 15: summarize without group ?
What about:
Or :
I think I prefer
(none)
, more consistent.It will be useful for summarizing with one big group, or to transmute
QUESTION 16: Rethinking filtering
arises from a few observations :
is.na(foo)
won't workI we use
?
for column selection, we could use??
for row selection so :?? is.na(foo)
, and of course things like?? foo == 0
We see the same pattern as in our select vs rename, the unary call subsets while the binary call does not.
How about filter_if filter_at?
The rhs of
??
should be a logical of same length as nrow, or recyclable, or a numeric,?any?
and?all?
can be used if we have a logical df or matrix of recyclable number of rows.We'll still support current behavior, but discourage its use for complex conditions.
The text was updated successfully, but these errors were encountered: