-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping missing values more easily #2314
Comments
This is a good idea, however I also like Perhaps DataFramesMeta could provide a block-level
Where
I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain. |
It's great you're thinking about how to make working with missing values easier! I 100% agree. A macro may be good. I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like A third possibility would be to allow users to change the default option for |
To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"? I've long been frustrated with the difficulty of working with |
@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing
@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.
@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them. A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit
@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values: using Cassette, Missings
Cassette.@context PassMissingCtx
Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)
# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
@eval begin
Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
end
end
f(x) = x > 0 ? log(x) : -Inf
julia> Cassette.@overdub PassMissingCtx() f(missing)
missing
julia> Cassette.@overdub PassMissingCtx() f(1)
0.0
julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true
julia> Cassette.@overdub PassMissingCtx() missing | true
true Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works: Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))
julia> x = [1, missing];
julia> Cassette.@overdub PassMissingCtx() sum(x)
1 Maybe this kind of thing could be made simpler to use by providing a macro like In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with |
Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.
I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode. So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with In terms of your examples, @linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)
@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x))) Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions. |
One alternative is that filter/transform/combine could always skip missing (i.e. kwarg |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Could we think about making |
I would not have a problem with this. We already have it in
|
Regarding 2, note that as discussed at #2258 we have to pass views for |
I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes. |
@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case. @bkamins I agree with 1 and 2.1 (views). I think for the case of |
Yes - for OK - so it seems we have a consensus here. I will propose a PR. |
That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'? If this happens, #2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command. |
I mean tat this would still error:
(but I guess this is natural to do it this way)
OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking. |
Started thinking about it :). The trickiest part will be fast aggregation functions for |
Just to clarify, I think the following should work: select(df, :a => collect, skipmissing = true) and actually also select(df, :a => collect∘skipmissing , skipmissing = true) since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as However, the following should error combine(df, :a => collect, :b => collect, skipmissing = true) because the length of |
Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
to work but
will fail. Which means to me that if we want to add such a kwarg it should not be called In particular:
and
would both work but produce different results. |
My proposal above about "spreading" missing values seems relevant here.
This takes
This takes
Takes
EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with |
Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform. |
In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows. I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly. |
can you please give an example what you exactly mean? |
@matthieugomez I was trying to write a post comparing
But I don't have a Stata installation at the moment. Can you confirm when the I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure . |
@nalimilan Yes in the second case the mean is only given on rows that are not missing. |
@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing] |
That's fine, though.
|
Yes, But for simplicity I think it would be better to have a single argument called |
I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.
Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in |
Regarding the discussion #2484. I think that
An element of Regarding the comment by @nalimilan:
I do not propose to skip missing values. What I propose is to use We could add a |
You need I think we can use |
I don't think we are ready to address |
fixed, it was a typo |
+1. It would also be nice to have a `where!` version, as well as a `view`
kwarg for `where`.
Last thing is the name. I like `where`. The only issue is that it is used
for special syntax in Julia. Is that a problem? In particular, I think it
creates some issues with using Lazy’s macro `@>`, but maybe it can be
fixed.
…On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński ***@***.***> wrote:
You need gdf above
fixed, it was a typo
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2314 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA>
.
|
Sure - we would mirror that - thank you for raising this, as it is easy to forget.
I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl. |
|
Well I'd say you're playing with words. :-) Selecting rows for with TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that |
Is not it what dplyr does already (ie skip missing in filter but not in
mutate)?
…On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < ***@***.***> wrote:
I do not propose to skip missing values. What I propose is to use isequal
not == for testing (and this is a logically valid rule). Actually in my
codes I normally use isequal because of this for logical conditions as a
rule.
We could add a errror_on_missing (or something similar) kwarg to where to
choose if we use isequal or == to do the test (but I would still make it
default to false, i.e. use isequal by default)
Well I'd say you're playing with words. :-) Selecting rows for with isequal(x,
true) is really skipping rows for which x is missing. And I would be
inclined to call the argument skipmissing=true instead of
errror_on_missing=false (that would be more standard given our
terminology).
TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit
inconsistent to consider that where shouldn't throw an error in the
presence of missing values (when you don't handle them manually), but that
combine/select/transform should. Of course it's much simpler and obvious
what to do to handle missing values with where, so it's easier and less
risky to implement. But in both cases 1) you lose some safety if you didn't
expect missing values to be present and they happen to be there, and 2)
it's still painful to work with combine/select/transform in the presence
of missing values.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2314 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA>
.
|
The point is that we are not skipping missing in the sense how e.g.
For the above reason I have not proposed My point is that the decision what to do in |
Regarding
Which one do we prefer? |
|
It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like
filter(:col => x -> coalesce(x > 1, false), df)
orcombine(gd, :col => (x -> sum(skipmissing(x)))
isn't ideal. One proposal to alleviate this is #2258: add askipmissing
argument to functions likefilter
,select
,transform
andcombine
to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.That would be one step towards being more user-friendly, but one would still have to repeat
skipmissing=true
all the time when dealing with missing values. I figured two solutions could be considered to improve this:@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at JuliaStats/DataArrays.jl#39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the
DataFrame
level is that it will have no effect on operations applied directly to column vectors, likesum(df.col)
. But that's better than nothing.Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard
The text was updated successfully, but these errors were encountered: