Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isordered #26

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add isordered #26

wants to merge 7 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 21, 2020

Fixes #23

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; @nalimilan ?

test/runtests.jl Outdated

@testset "isordered" begin

@test isordered([1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing because isordered isn't exported from DataAPI; I'd rather not export to be consistent with other DataAPI definitions, but I also think it's probably a fairly safe case if we really want to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will fix it - every time I think that the PR is simple (and I just edit it in the web browser) I mess up things. I should be more careful.

src/DataAPI.jl Outdated
@@ -89,6 +91,9 @@ If the collection is not sortable then the order of levels is unspecified.

Contrary to [`unique`](@ref), this function may return values which do not
actually occur in the data, and does not preserve their order of appearance in `x`.

If a type of `A` implements `levels` with a meaningful order then it should also
implement `isordered` as by defualt `isordered` returns `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
implement `isordered` as by defualt `isordered` returns `false`.
implement `isordered` as by default `isordered` returns `false`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence raises many issues about what we really want isordered to mean. Types that implement levels are AbstractArrays, and in general they don't know with what eltypes they will be used. So the ordering doesn't depend on them.

CategoricalArray is a very special case, but even there I would say that isordered works because the element type is CategoricalValue, which supports <. So everything is defined by the element type.

Likewise, below, I wouldn't say that < is consistent with levels. Since < is defined on elements, not on arrays, and levels is defined on arrays, there's only one way to make them consistent: have levels return values sorted according to </isless, which is what the levels fallback does. So actually it's < and isless that have to be consistent (this is probably always the case).

What could happen is that a special array type could define levels to return something which isn't consistent with </isless. But that doesn't mean that entries in the arrays wouldn't be comparable using <. Do we want to support this? That sounds a bit weird, but it could happen e.g. with a special type similar to CategoricalArray but which wouldn't wrap values in CategoricalValue (IOW a type similar to PooledArray but for which levels would return the pool). Should this kind of object define isordered to return false? That would be weird since entries could still be compared using < -- but the result would be different from the order in levels...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was thinking how to put it best and I just "kind of" reused what CategoricalArrays.jl defined:

Test whether entries in A can be compared using <, > and similar operators, using the ordering of levels.

Which implies that isordered should tell us if what levels(A) returns is ordered if we use < for comparison of these elements (so technically if issorted(levels(A)) returns true).

In particular I understood that if levels returns something for a type that supports < that fails issorted(levels(A)) then isordered should return false.

This was my understanding of the original intent of what isordered should mean i.e. that two conditions must be met:

  1. eltype of A supports < etc.
  2. levels returns values in order consistent with it

(but maybe I was wrong).

But if I was right then levels and isordered must be in sync. This is different from Base.Ordered(), as Base.Ordered() is unrelated with levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the problem is that isordered was initially designed with only CategoricalArrays in mind, and for them it happens to be the case that isordered, levels and </isless are all consistent. So now to extend it we have to decide whether that should always be the case (and document it) or not. That would be the simplest and most obvious definition. But it's also quite strict as e.g. a custom array for which levels(x) == ["c", "b", "a"] wouldn't respect it. In short, I think we should know what we want to use this trait for to choose the most useful solution.

@quinnj You mentioned at #23 (comment) that Arrow.jl could use this for a custom dict-encoded type. How would that type behave? How would it implement isordered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow spec says:

/// By default, dictionaries are not ordered, or the order does not have
/// semantic meaning. In some statistical, applications, dictionary-encoding
/// is used to represent ordered categorical data, and we provide a way to
/// preserve that metadata here
isOrdered: bool;

So it's not really doing much by itself other than preserving the metadata. The DictEncoded array type in Arrow.jl is basically just a PooledArray, but also includes the boolean isOrdered field; but any further "semantics" would need to be handled by the user or custom comparison code (or in the example of converting to a CategoricalArray, it could query the DataAPI.isordered property).

Copy link
Member Author

@bkamins bkamins Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with both. The original definition came from CategoricalArrays.jl which are special, so we can be more flexible here I think (in which case isordered is giving us roughly the same as Ordered trait in Julia Base).

EDIT: so then the question is if we need both (the benefit of having it in DataAPI.jl is that we can change things more easily than in Julia Base).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Arrow.DictEncoded doesn't want to expose the custom order of levels using generic API, then I don't think it makes sense for it to implement isordered. Otherwise that function will be useless as it won't allow writing generic code that can do something with levels: you would know there's an order, but you wouldn't be able to retrieve it.

One issue with Base.OrderStyle is that it's supposed to work when passing only the type, but for CategoricalArray it's a runtime property...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Arrow.DictEncoded does want to expose the custom order of levels? I'd like to implement DataAPI.refarray, DataAPI.refpool, DataAPI.levels and DataAPI.isordered on DictEncoded, so it can participate in all the optimizations built around those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I thought you didn't want that as you said:

Personally, I don't think we should require eltype of A supports < etc. and levels returns values in order consistent with it, if we don't have to.

Then we can require isordered to be consistent with levels at least, but mention that it may be inconsistent with < and isless. But we shouldn't say that isordered should return true for types that implement levels, as it wouldn't always be the case for CategoricalArray and Arrow.DictEncoded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah, that makes sense to me. Yeah, for Arrow.DictEncoded, isordered is an independent property.

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2020

As I have commented to @nalimilan on slack. I think it is the time to rethink the API that DataAPI.jl for ref values should expose in general, as we are getting more container types supporting it. For example another function that CategoricalArrays.jl exposes is is droplevels! (or have droplevels that we do not have now that would create a new object - in case we do not require mutability which would make sense) - should this be supported in general (it would be useful to have it but I am not sure if it is feasible). Another question is if we think that DataAPI.refarray an be allowed to hold anything, or we want to be more restrictive and require that it holds positive integers + require that DataAPI.refpool should support AbstractVector interface then (such changes would make the API more useful in practice as then you could do many more common operations not caring about the concrete type that implements the API).

@Nosferican
Copy link
Contributor

The use case I was thinking of was in Econometrics.jl, I need to assert whether the outcome variable is ordinal or nominal for choosing the appropriate estimator. When people use CategoricalArray I can query that through isordered. The issue that has been happening is when datasets use the PooledArray which doesn't allow me to distinguish between the two cases.

src/DataAPI.jl Outdated
using the ordering of `levels(A)`.

If a type of `A` implements `levels` with a meaningful order then it should also
implement `isordered` as by defualt `isordered` returns `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo, default

@nalimilan
Copy link
Member

The issue that has been happening is when datasets use the PooledArray which doesn't allow me to distinguish between the two cases.

Yes but I think that's on purpose: we have split the old PooledDataArray into PooledArray and CategoricalArray to avoid the confusion between "efficiently store an array with few unique values" and "represent categorical data". People who have ordinal variables should use CategoricalArrays, not PooledArrays. What would be the point of having two packages for the same thing?

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2020

What would be the point of having two packages for the same thing?

My perspective: PooledArrays.jl is lightweight and CategoricalArrays.jl is (unfortunately, but unavoidably) heavy because of this distinction.

src/DataAPI.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

My perspective: PooledArrays.jl is lightweight and CategoricalArrays.jl is (unfortunately, but unavoidably) heavy because of this distinction.

I'm not sure that's true anymore now that we got rid of most invalidations.

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2020

I'm not sure that's true anymore now that we got rid of most invalidations.

By "heavy" I did not mean problematic for Julia compiler, but rather that it has much more complex logic (e.g. the code for merging levels between two CategoricalArrays, having a custom CategoricalValue type etc.).

@Nosferican
Copy link
Contributor

I am not opposed to having different representations (e.g., efficient storage of few values vs categorical). What would be useful is to have any representation that does not support or uses an ordinal property to be queryable about it. Basically, provide an API that allows to get the information: does this representation offer this information / property and if it does how to query it.

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2020

Maybe you should just assume that your users will be using scientific types from https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/?

@Nosferican
Copy link
Contributor

Likely not. Currently, the docs is to just use CategoricalArrays but every so often a user encounters an error due to using PooledArray. I am aiming to make the code more general and friendly rather than restricted / with more assumptions. However, having the API here might help clean up the MLJScientificTypes code.

@nalimilan
Copy link
Member

OK, then I think we should go with what I proposed above at #26 (comment).

Test whether entries in `A` can be compared using `<`, `>` and similar operators,
using the ordering of `levels(A)`.

If a type of `A` implements `levels` with a meaningful order then it should also
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - could you please comment in the implementation how your proposal should be incorporated? Thank you!

@Nosferican
Copy link
Contributor

Nosferican commented Jan 26, 2021

Notice that OrderedCollections.jl also defines isordered for ordered collections like OrderedDict.

@nalimilan
Copy link
Member

Good catch. Luckily it's not exported (JuliaCollections/OrderedCollections.jl#45).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

isordered
4 participants