How to compare to value? #363

tbenst · 2021-09-02T23:55:42Z

I'm not sure how to do x .> "Young" Is there a way to make Young a CategoricalValue?

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], levels=["Young", "Middle", "Old"], ordered=true)
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

julia> x[1] > x[2] 
true

julia> x .> x[2]
4-element BitVector:
 1
 0
 1
 0

julia> x .> "Young" # I'm not sure how to do this?
ERROR: ArgumentError: cannot compare a `CategoricalValue` to value `v` of type `CategoricalValue{String, UInt32}`

The text was updated successfully, but these errors were encountered:

tbenst · 2021-09-03T00:14:48Z

is there any reason not to implement:

function >(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v > catv_y
end

function <(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v < catv_y
end


function ==(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v < catv_y
end

bkamins · 2021-09-03T08:42:22Z

See #346 and use:

julia>  x = CategoricalArray(["Old", "Young", "Middle", "Young"], levels=["Young", "Middle", "Old"], ordered=true)       
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

julia> x .< CategoricalValue("Young", x)
4-element BitVector:
 0
 0
 0
 0

julia> x .< CategoricalValue("Middle", x)
4-element BitVector:
 0
 1
 0
 1

tbenst · 2021-09-03T19:04:24Z

@bkamins thanks! That’s a bit more ergonomic but still fairly unwieldy—is there a reason to not do CategoricalValue("Middle", x) automatically on the users behalf? I’m a new user of this package, but the default behavior of R for factors/categoricals is to support such comparisons with strings, I think it’s quite ergonomic and follows principle of least surprise

bkamins · 2021-09-03T21:32:00Z

As explained in linked #346 doing it automatically leads to non transitivity of <, which breaks a basic contract of such operation.

An example from R:

> x <- ordered("1", levels=c("1", "2"))
> y <- ordered("1", levels=c("2", "1"))
> x < "2" && "2" < y
[1] TRUE

so should imply that x < y but we get:

> x < y
Error in Ops.ordered(x, y) : level sets of factors are different

The problem is that, as opposed to R Julia exposes CategoricalValue type, which can be stored in any container.

tbenst · 2021-09-04T00:32:04Z

Hm, I'm not sure I agree--I think R is principled for the example you give. There are two categories-- a) co-product of String and levels=c("1", "2"), and b) co-product of String and levels=c("2,"1"). Transitivity only holds within a category. x < y tries to compare objects from two different categories which is indeed an error.

Edit: to clarify, I think it is also principled to only allow comparison within the same categorical pool. I just wanted to point out that from a category theory standpoint, x < "2" and "2" < y are not the same function, they are two different morphisms. It would seem that CategoricalArrays supports total orders with CategoricalValue objects, while R allows for partial orders over the co-product / Union type including string. Both designs maintain transitivity, but there may be other reasons to prefer one over the other.

bkamins · 2021-09-04T05:43:07Z

Consider the following in Julia (this is not doable in R):

julia> x = categorical([1], levels=[1,2], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 1

julia> y = categorical([1], levels=[2,1], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 1

julia> z = Any[x[1], 2, y[1]]
3-element Vector{Any}:
  CategoricalValue{Int64, UInt32} 1 (1/2)
 2
  CategoricalValue{Int64, UInt32} 1 (2/2)

And now if you called:

issorted(z)

under your rules you would get true, while currently you get:

julia> issorted(z)
ERROR: ArgumentError

Also if you called sort(z) under your rule the result be undefined (it would either error or could produce different results depending on the sorting algorithm).

Now you might ask if things like z variable happen in practice. The answer is AFAICT yes. If I remember correctly in MLJ.jl ecosystem it is possible to get vectors mixing CategoricalValues from different pools in a single container in some operations.

Having said that, I keep the issue open as when doing the current design we were aware that it is inconvenient. However - design wise - it is better to be restrictive (as we are now) and then consider loosening the rules as this approach is non breaking. Let us discuss here the pros and cons.

tbenst · 2021-09-04T06:53:05Z

Thanks for the thoughtful discussion! I’ll reply to your comment here to consolidate.

I completely agree with you on isless, and on sorting. These need to be total orders.

However, note that < is explicitly for partial orders, and is not used for sorting. That same documentation goes on to say Types with a partial order should implement <. Here are the docs for <:

<(x, y)

Less-than comparison operator. Falls back to isless. Because of the behavior of
floating-point NaN values, this operator implements a partial order.

Implementation

New numeric types with a canonical partial order should implement this function
for two arguments of the new type. Types with a canonical total order should
implement isless instead. (x < y) | (x == y)

As such, I think the contract in Base explicitly endorses < for partial orders. So I think the case of comparing strings and CategoricalValue with < is a design decision that can be considered, but isless cannot be implemented in this fashion for the excellent points you made.

bkamins · 2021-09-04T07:47:03Z

This is an excellent point. Indeed the benefit of having isless and < separated in Julia could be employed here.

#346 was implemented to support isless and I did not spend enough thought on < in this context. Indeed non-categorical value could be treated as kind of NaN for floats (with the difference that all thre values: true, false, or missing could be returned in this case).

@nalimilan - what do you think?

nalimilan · 2021-09-04T10:45:51Z

I'm glad to hear that some people would want to use < with strings. I wasn't sure before removing it whether users really cared about it.

That said, AFAICT the docstring for < implies that transitivity is required, since that's one of the conditions for the existence of a partial order. (< on floating point is transitive, since comparison involving NaN always return false.) In practice, we could probably break this assumption without it being too much of an issue, but this deserves careful consideration. Originally, we allowed both isless and < mixing CategoricalValue and strings, and it took a few years to realize that the behavior of isless could be problematic in some contexts.

BTW, R is well-known for being full of dangerous corner cases (in particular in operations involving factors), so I wouldn't take it as a good reference for this kind of thing.

bkamins · 2021-09-04T11:41:36Z

@ablaom - could you please cross check/comment how breaking transitivity of < with categorical values would play with MLJ.jl ecosystem? Thank you! (we would still keep isless transitive)

tbenst · 2021-09-04T19:07:21Z

I’m still a bit puzzled by the comments that such a change could break transitivity. Transitivity is maintained, it’s just now a partial order: if x > “2” and y < “2”, then either x > y == true or x > y == missing [or errors or …]

edit: I suppose our confusion arises from how we differ in our consideration of categorical([1], levels=[1,2], ordered=true) and categorical([1], levels=[1,2], ordered=true). I think the fact that they are the same type in Julia is really a type system limitation, for performance reasons? I would argue that they are not the same type in a Type Theory sense, and hence x >1 “2” and y >2 “2” makes no prediction that x >1 y or x >2 y, since there are two different relations at play

oops and yes, meant to use levels=[2,1] for one of them

nalimilan · 2021-09-04T19:20:18Z

The transitivity issue that I'm referring to is this:

julia> x = categorical(["a", "b", "c"], levels=["c", "b", "a"], ordered=true)
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"

julia> x[3] < "b" && "b" < x[1]
true

julia> x[3] == "c" && x[1] == "a"
true

julia> "c" < "a"
false

tbenst · 2021-09-04T19:25:03Z

Ah I was editing my comment right as you posted that. Multiple dispatch is hiding that there are multiple methods at play there, transitivity is not broken for an individual method. So a question becomes, is transitivity supposed to be maintained at a method level or a function level? I think method level makes more sense—these are different relations represented by same symbol. We could rewrite to make this point more explicit that transitivity is not implied since these are different relations

julia> x[3] <1 "b" && "b" <1 x[1]
true

julia> x[3] ==1 "c" && x[1] ==1 "a"
true

julia> "c" <2 "a"
false

nalimilan · 2021-09-04T19:25:13Z

edit: I suppose our confusion arises from how we differ in our consideration of categorical([1], levels=[1,2], ordered=true) and categorical([1], levels=[1,2], ordered=true). I think the fact that they are the same type in Julia is really a type system limitation, for performance reasons?

I guess you meant to use different levels between the two arrays? If so yes, the decision to avoid making levels part of the type is due to performance (compilation cost) considerations, but also on practical considerations (you wouldn't be able to add, remove or reorder levels without first creating a new array).

I would argue that they are not the same type in a Type Theory sense, and hence x >1 “2” and y >2 “2” makes no prediction that x >1 y or x >2 y, since there are two different relations at play

Type theory is nice and all, but in practical terms what matters whether we think that people might use < and get confusing results or not.

nalimilan · 2021-09-04T19:30:29Z

So a question becomes, is transitivity supposed to be maintained at a method level or a function level? I think method level makes more sense—these are different relations represented by same symbol. We could rewrite to make this point more explicit that transitivity is not implied since these are different relations

In Julia, all methods defined for a function are supposed to follow the same API contract. Otherwise it's impossible to write generic code reliably. Of course we can make some minor exceptions to this rule, but that requires careful consideration. Otherwise, we could perfectly introduce a different function and/or operator.

tbenst · 2021-09-04T19:32:14Z

Transitivity is broken on current implementation (although maybe a bug?)

julia> x = categorical(["a", "b", "c"], levels=["c", "b", "a"], ordered=true)

julia> y = categorical(["a", "b", "c"], levels=["a","b","c"], ordered=true);

julia> x[1] == y[1] # y1 >= x1
true

julia> y[2] > y[1] # y2 >= y1
true

julia> y[2] > x[1] # y2 >= x1
ERROR: ArgumentError: CategoricalValue objects from unequal pools cannot be tested for order
Stacktrace:
 [1] <(x::CategoricalValue{String, UInt32}, y::CategoricalValue{String, UInt32})
   @ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:172
 [2] >(x::CategoricalValue{String, UInt32}, y::CategoricalValue{String, UInt32})
   @ Base ./operators.jl:305
 [3] top-level scope
   @ REPL[6]:1

nalimilan · 2021-09-04T19:54:18Z

Well yes, but at least it throws an error, so it's much less dangerous than silently returning a misleading result. There's no good solution here anyway: should y[2] > x[1] be equivalent to y[2] > y[1] == true or to x[2] > x[1] == false?

tbenst · 2021-09-04T20:24:39Z

I agree that the current implementation is reasonable! I'm more just pointing out the folly of aiming for transitivity for the comparison functions, across all methods. It's not currently the case in Julia nor is it reasonable to expect, imho. here's a silent "error"

julia> x[1] == "a"
true

julia> x[2] == "b"
true

julia> x[2] < x[1]
true

julia> "b" < "a"
false

nalimilan · 2021-09-11T12:57:43Z

Note that the full error message is relatively explicit after fixing the typo at #366:

julia> x .> "Young"
ERROR: ArgumentError: cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first

nalimilan mentioned this issue Sep 5, 2021

Adding levels in CategoricalValue call #364

Open

nalimilan mentioned this issue Mar 6, 2022

LabeledArray and CategoricalArray junyuan-chen/ReadStatTables.jl#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compare to value? #363

How to compare to value? #363

tbenst commented Sep 2, 2021

tbenst commented Sep 3, 2021 •

edited

Loading

bkamins commented Sep 3, 2021

tbenst commented Sep 3, 2021

bkamins commented Sep 3, 2021

tbenst commented Sep 4, 2021 •

edited

Loading

bkamins commented Sep 4, 2021 •

edited

Loading

tbenst commented Sep 4, 2021

bkamins commented Sep 4, 2021

nalimilan commented Sep 4, 2021

bkamins commented Sep 4, 2021

tbenst commented Sep 4, 2021 •

edited

Loading

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021

nalimilan commented Sep 4, 2021

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021 •

edited

Loading

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021 •

edited

Loading

nalimilan commented Sep 11, 2021

How to compare to value? #363

How to compare to value? #363

Comments

tbenst commented Sep 2, 2021

tbenst commented Sep 3, 2021 • edited Loading

bkamins commented Sep 3, 2021

tbenst commented Sep 3, 2021

bkamins commented Sep 3, 2021

tbenst commented Sep 4, 2021 • edited Loading

bkamins commented Sep 4, 2021 • edited Loading

tbenst commented Sep 4, 2021

bkamins commented Sep 4, 2021

nalimilan commented Sep 4, 2021

bkamins commented Sep 4, 2021

tbenst commented Sep 4, 2021 • edited Loading

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021

nalimilan commented Sep 4, 2021

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021 • edited Loading

nalimilan commented Sep 4, 2021

tbenst commented Sep 4, 2021 • edited Loading

nalimilan commented Sep 11, 2021

tbenst commented Sep 3, 2021 •

edited

Loading

tbenst commented Sep 4, 2021 •

edited

Loading

bkamins commented Sep 4, 2021 •

edited

Loading

tbenst commented Sep 4, 2021 •

edited

Loading

tbenst commented Sep 4, 2021 •

edited

Loading

tbenst commented Sep 4, 2021 •

edited

Loading