Skipping missing values more easily #2314

nalimilan · 2020-07-07T20:57:12Z

It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is #2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.

That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:

Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.
Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at JuliaStats/DataArrays.jl#39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.

Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard

The text was updated successfully, but these errors were encountered:

pdeffebach · 2020-07-09T16:16:17Z

Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.

This is a good idea, however I also like skipmissing = true at the level of a transform call or even at the level of argument because it's explicit.

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like

@transform(df, @sm y = :x1 .+ mean(:x2))

Where @sm macro, does the necessary transformation described in #2258

function wrapper(fun, x, y)
    sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
    sub_out = fun(sx, sy)
    full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
    full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases

    return full_out
end

Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

matthieugomez · 2020-07-09T20:34:22Z

It's great you're thinking about how to make working with missing values easier! I 100% agree.

A macro may be good.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

c42f · 2020-07-10T02:04:36Z

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

I've long been frustrated with the difficulty of working with missing given that it infects all downstream operations and wished it worked differently in Base. (I acknowledge my frustration could be misguided — perhaps it's saved me from some horrible bugs without knowing it :-) )

nalimilan · 2020-07-10T09:11:04Z

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like
@transform(df, @sm y = :x1 .+ mean(:x2))
Where @sm macro, does the necessary transformation described in #2258

@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing skipmissing=true. Being able to apply it to a chain of operations would already be more useful.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them.

A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit skipmissing=false.

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

using Cassette, Missings

Cassette.@context PassMissingCtx

Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)

# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
    @eval begin
        Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
    end
end

f(x) = x > 0 ? log(x) : -Inf

julia> Cassette.@overdub PassMissingCtx() f(missing)
missing

julia> Cassette.@overdub PassMissingCtx() f(1)
0.0

julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true

julia> Cassette.@overdub PassMissingCtx() missing | true
true

Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works:

Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))

julia> x = [1, missing];

julia> Cassette.@overdub PassMissingCtx() sum(x)
1

Maybe this kind of thing could be made simpler to use by providing a macro like @passmissing as a shorthand for Cassette.@overdub PassMissingCtx(). But it's important to measure all the implications of this approach: since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects. For example, Cassette.@overdub PassMissingCtx() [missing] gives missing, which will break lots of package code. A safer approach would be to only apply passmissing to a whitelist of functions for which it makes sense (scalar functions, mainly), like DataValues does -- with the drawback that the list is kind of arbitrary.

In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with select/transform/combine it makes sense to only apply passmissing to the top-level functions, rather than recursively. And for reductions, passing views of complete rows as proposed at #2258 should be enough, when combined with a convenient DataFramesMeta syntax.

c42f · 2020-07-11T12:23:07Z

My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.

since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects

I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode.

So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with missing. Actually I wonder whether you could do something more like broadcast lowering where all function calls within the expression are lifted in a certain way such that dispatch can be customized, so, eg, @linq f(x,y,z) becomes linq_missing(f, x, y, z).

In terms of your examples,

@linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)

@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x)))

Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions.

matthieugomez · 2020-08-02T18:32:53Z

One alternative is that filter/transform/combine could always skip missing (i.e. kwarg skipmissing = true is the default).

bkamins · 2020-08-06T17:12:10Z

Ah - sorry - I thought this issue "overriden" that one. So actually this issue should be on hold till we resolve #2258, because it only asks to make #2258 simpler - right?

matthieugomez · 2020-08-22T16:59:48Z

Could we think about making filter/transform/combine/select have the kwarg skipmissing = true by default? It's not unheard of — that's what Stata and Panda (for combine) do. Also, data.table and dplyr automatically skip missing in their versions of filter.

bkamins · 2020-08-22T19:31:26Z

I would not have a problem with this. We already have it in groupby so it would be consistent. So I assume you want:

for filter wrap a predicate p in x -> coalesce(p(x), false)
for transform/combine/select we have two decisions:
- do we pass views or skipmissing wrappers? (given the second question probably views, it is going to be a bit slow, but I think skipmissing=true does not have to be super fast as it is a convenience wrapper)
- what do we do if multiple columns are passed - skip rows that have at least one missing? (this is what groupby does)

nalimilan · 2020-08-22T20:48:57Z

Regarding 2, note that as discussed at #2258 we have to pass views for select and transform as we need to be able to reassign values to the non-missing rows in the input. When multiple columns are passed, we should only keep complete observations, otherwise something like [:x, :y] => + wouldn't work due to different lengths.

nalimilan · 2020-08-22T20:49:41Z

I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes.

matthieugomez · 2020-08-22T21:28:52Z

@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case.

@bkamins I agree with 1 and 2.1 (views). I think for the case of [:x, :y] => + it should skip missing on both (as @nalimilan points out), but for [:x, :y] .=> mean, or for :x => mean, :y => mean, it should skipmissing on x and y separately.

bkamins · 2020-08-22T22:24:48Z

Yes - for [:x, :y] .=> mean this is separate (which means that transform/select will throw an error in cases when there is no match and a vector is returned).

OK - so it seems we have a consensus here. I will propose a PR.

bkamins · 2020-08-22T22:26:16Z

Just a maintenance question - when we add skipmissing kwarg should both #2314 and #2258 be closed or something more should be done/kept track of?

matthieugomez · 2020-08-22T22:29:43Z

That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'?

If this happens, #2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command.

bkamins · 2020-08-22T22:35:42Z

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

this thread should be open if the default value is false

OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking.

bkamins · 2020-08-22T22:40:37Z

That would be awesome, thanks!

Started thinking about it :). The trickiest part will be fast aggregation functions for GroupedDataFrame case (as usual), but also they should be doable.

matthieugomez · 2020-08-22T22:50:46Z

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

and actually also

select(df, :a => collect∘skipmissing , skipmissing = true)

since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as a.

However, the following should error

combine(df, :a => collect, :b => collect, skipmissing = true)

because the length of collect∘skipmissing(a) is different from the length of collect∘skipmissing(b)

bkamins · 2020-08-23T07:21:35Z

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:

select(df, :a => collect , skipmissing = true)

to work but

select(df, :a => collect∘skipmissing)

will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:

select(df, :a => mean, skipmissing = true)

and

select(df, :a => mean∘skipmissing)

would both work but produce different results.

pdeffebach · 2020-08-23T12:28:32Z

My proposal above about "spreading" missing values seems relevant here.

select(df, :a => collect, skipmissing = true)

This takes :a, applies skipmissing, collects the result, then loops through indices of :a and, when not missing`, fills it in.

select(df, :a => collect∘skipmissing , skipmissing = true)

This takes :a, applies skipmissing, then applies it again, collects the result, and loops through indices of :a as before.

select(df, :a => mean, skipmissing = true)

Takes :a, applies skipmissing, takes the mean, then loops through and fills in indices of :a where indices of :a are not missing. Perhaps we can make an exception for scalar values? If a scalar is returned, it gets spread across the entire vector, regardless of missing values? That way it would match

select(df, :a => mean∘skipmissing; skipmissing = false)

EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with filter currently and we should start with that since everyone agrees on that behavior and we are confident it won't result in unpredictable / inconsistent behavior.

matthieugomez · 2020-08-23T16:51:29Z

Just to clarify, I think the following should work:
select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
select(df, :a => collect , skipmissing = true)
to work but
select(df, :a => collect∘skipmissing)
will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:
select(df, :a => mean, skipmissing = true)
and
select(df, :a => mean∘skipmissing)
would both work but produce different results.

Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform.

matthieugomez · 2020-08-27T14:04:02Z

In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows.

I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly.

bkamins · 2020-08-27T14:36:42Z

is that lag will not work correctly.

can you please give an example what you exactly mean?

pdeffebach · 2020-08-27T14:37:54Z

@matthieugomez I was trying to write a post comparing

gen y = mean(x)
gey y = mean(x) if !missing(x)

But I don't have a Stata installation at the moment. Can you confirm when the mean is different?

I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure .

matthieugomez · 2020-08-27T14:46:55Z

@nalimilan Yes in the second case the mean is only given on rows that are not missing.

matthieugomez · 2020-08-27T14:50:25Z

@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

pdeffebach · 2020-08-27T15:14:22Z

sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

That's fine, though. lag doesn't error with missing values so there is no need to do it (assuming we don't make DropMissing the default, which I don't think we should yet).

lags are hard anyways, you have to remember grouping etc.

nalimilan · 2020-08-27T15:28:32Z

Yes, lag/lead/diff are another case that would require opting-out explicitly of missing values propagation (if we make it the default). If we think this kind of situation (in addition to ismissing, etc.) is too common or too confusing, we could avoid it by default and require people to use skipmissing=true/ passmissing=true explicitly. Anyway we could first try this system and later see whether it should be the default.

But for simplicity I think it would be better to have a single argument called skipmissing that would do both, since we concluded that there are no major use cases in which passmissing and skipmissing would need to be separate. For row-wise operation, skipmissing somewhat makes sense as a way to say "only apply the operation to non-missing rows (and fill others with missing)". It's a slight abuse but probably OK. For column-wise operation, skipmissing would be exactly what happens to input columns, and the passmissing behavior would just be a consequence of that: if the function returns a single scalar, we can fill all rows, including missing ones (passmissing=false in our terminology), but if it returns a vector of the same length as its input, we can only fill non-missing rows (passmissing=true in our terminology).

bkamins · 2020-08-27T15:38:15Z

(if we make it the default)

I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.

But for simplicity I think it would be better to have a single argument called

Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in source => fun => destination (we need a name as we cannot use SkipMissing).

bkamins · 2020-10-19T10:11:23Z

Regarding the discussion #2484.

I think that where should have the following specification (I leave out some sanity checks but give a simplified idea of the impementation):

where(df::DataFrame, args...) = df[isequal.(.&(eachcol(select(df, args..., copycols=false))), true), :]
where(gdf::GroupedDataFrame, args...) = parent(df)[isequal.(.&(eachcol(select(gdf, args..., copycols=false, keepkeys=false))), true), :]

An element of args can only be of the form cols => fun and we transform it to cols => fun => x_i (to make sure we have unique output column names).

Regarding the comment by @nalimilan:

One potential issue is that having where skip missing values but other functions like combine, select and transform not skip them

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

pdeffebach · 2020-10-19T13:22:55Z

You need gdf above, as transformations are on the group level when given a grouped data frame above.

I think we can use isequal without addressing transform and select. We aren't skipping missing during the transformation, just when we choose which rows to keep.

pdeffebach · 2020-10-19T13:24:34Z

I don't think we are ready to address transform and select yet. There hasn't been consensus on behavior. better to implement something in Missings.jl here and give users a chance to see if they like it.

bkamins · 2020-10-19T13:27:37Z

You need gdf above

fixed, it was a typo

matthieugomez · 2020-10-19T15:24:04Z

+1. It would also be nice to have a `where!` version, as well as a `view` kwarg for `where`. Last thing is the name. I like `where`. The only issue is that it is used for special syntax in Julia. Is that a problem? In particular, I think it creates some issues with using Lazy’s macro `@>`, but maybe it can be fixed.

…

On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński ***@***.***> wrote: You need gdf above fixed, it was a typo — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA> .

bkamins · 2020-10-19T15:39:10Z

It would also be nice to have a where! version, as well as a view kwarg for where.

Sure - we would mirror that - thank you for raising this, as it is easy to forget.

I like where

I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl.

pdeffebach · 2020-10-19T16:09:02Z

where works with linq, but the Lazy pipings all work with @where so there hasn't been a conflict.

nalimilan · 2020-10-19T20:13:32Z

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).

TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.

matthieugomez · 2020-10-19T20:18:44Z

Is not it what dplyr does already (ie skip missing in filter but not in mutate)?

…

On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < ***@***.***> wrote: I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule. We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default) Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology). TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA> .

bkamins · 2020-10-19T21:25:58Z

Well I'd say you're playing with words. :-)

The point is that we are not skipping missing in the sense how e.g. select would skip them. As for select skipmissing means REMOVE FROM THE COLLECTION or IGNORE THEM while in where it is TREAT THEM AS FALSE. And this is the point of my reasoning.
This is a different operation that we perform (skipping missings in the sense of select in where would make no sense).

skipmissing=true

For the above reason I have not proposed skipmissing kwarg but some other name (of course I am not attached to the proposed one as it is not nice obviously).

My point is that the decision what to do in select and how to implement where are logically orthogonal although they have very similar end effect.

bkamins · 2020-10-20T19:50:59Z

Regarding where for GroupedDataFrame we have two options:

it would produce rows in the order of the parent (i.e. it would internally rely on select; in this scenario subsetting groups in GroupedDataFrame is not allowed)
it would produce rows in the order of groups (i.e. it would internally rely on combine; in this scenario subsetting groups in GroupedDataFrame is allowed)

Which one do we prefer?

pdeffebach · 2020-10-20T20:27:30Z

As it's what I've already implemented in DataFramesMeta. But in general i think its good to not re-order things too much, and this is way closer to the mental model of select then filter.

bkamins added decision feature non-breaking The proposed change is not breaking labels Jul 7, 2020

bkamins added this to the 1.0 milestone Jul 7, 2020

matthieugomez mentioned this issue Aug 6, 2020

Improwe workflows with filtered DataFrame #2354

Open

This comment has been minimized.

Sign in to view

bkamins mentioned this issue Aug 6, 2020

Add a skipmissing kwarg to select/transform/combine #2258

Open

matthieugomez mentioned this issue Sep 1, 2020

Creating new columns on a view should fill in missings everywhere else. #2211

Closed

bkamins mentioned this issue Sep 7, 2020

Rename tail to avoid clash with Base.tail #2261

Closed

nalimilan mentioned this issue Sep 8, 2020

add predicate support for names and more tests #2417

Merged

matthieugomez mentioned this issue Oct 3, 2020

add WhereDataFrame #2467

Closed

nalimilan mentioned this issue Oct 19, 2020

Release 0.22 tracking #2484

Closed

20 tasks

bkamins mentioned this issue Oct 22, 2020

Add subset #2496

Merged

bkamins mentioned this issue Mar 4, 2021

Release 1.0 tracking #2640

Closed

19 tasks

bkamins modified the milestones: 1.0, 1.x Mar 25, 2021

nalimilan mentioned this issue Nov 28, 2023

feature request: allow skipmissing column types #3398

Open

Skipping missing values more easily #2314

Skipping missing values more easily #2314

Comments

nalimilan commented Jul 7, 2020

pdeffebach commented Jul 9, 2020

matthieugomez commented Jul 9, 2020 • edited Loading

c42f commented Jul 10, 2020

nalimilan commented Jul 10, 2020

c42f commented Jul 11, 2020 • edited Loading

matthieugomez commented Aug 2, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

bkamins commented Aug 6, 2020

matthieugomez commented Aug 22, 2020

bkamins commented Aug 22, 2020 • edited Loading

nalimilan commented Aug 22, 2020

nalimilan commented Aug 22, 2020

matthieugomez commented Aug 22, 2020 • edited Loading

bkamins commented Aug 22, 2020

bkamins commented Aug 22, 2020

matthieugomez commented Aug 22, 2020

bkamins commented Aug 22, 2020

bkamins commented Aug 22, 2020

matthieugomez commented Aug 22, 2020 • edited Loading

bkamins commented Aug 23, 2020 • edited Loading

pdeffebach commented Aug 23, 2020 • edited Loading

matthieugomez commented Aug 23, 2020

matthieugomez commented Aug 27, 2020 • edited Loading

bkamins commented Aug 27, 2020

pdeffebach commented Aug 27, 2020 • edited Loading

matthieugomez commented Aug 27, 2020

matthieugomez commented Aug 27, 2020 • edited Loading

pdeffebach commented Aug 27, 2020

nalimilan commented Aug 27, 2020

bkamins commented Aug 27, 2020

bkamins commented Oct 19, 2020 • edited Loading

pdeffebach commented Oct 19, 2020

pdeffebach commented Oct 19, 2020

bkamins commented Oct 19, 2020

matthieugomez commented Oct 19, 2020 via email

bkamins commented Oct 19, 2020

pdeffebach commented Oct 19, 2020

nalimilan commented Oct 19, 2020

matthieugomez commented Oct 19, 2020 via email • edited Loading

bkamins commented Oct 19, 2020

bkamins commented Oct 20, 2020

pdeffebach commented Oct 20, 2020

matthieugomez commented Jul 9, 2020 •

edited

Loading

c42f commented Jul 11, 2020 •

edited

Loading

matthieugomez commented Aug 2, 2020 •

edited

Loading

bkamins commented Aug 22, 2020 •

edited

Loading

matthieugomez commented Aug 22, 2020 •

edited

Loading

matthieugomez commented Aug 22, 2020 •

edited

Loading

bkamins commented Aug 23, 2020 •

edited

Loading

pdeffebach commented Aug 23, 2020 •

edited

Loading

matthieugomez commented Aug 27, 2020 •

edited

Loading

pdeffebach commented Aug 27, 2020 •

edited

Loading

matthieugomez commented Aug 27, 2020 •

edited

Loading

bkamins commented Oct 19, 2020 •

edited

Loading

matthieugomez commented Oct 19, 2020 via email •

edited

Loading