`LabeledArray` and `CategoricalArray` #4

nalimilan · 2022-03-06T17:27:46Z

I've just discovered this package, it's really cool! I was precisely willing to implement something like this.

Something which I've been wondering for some time based on my experience in R is the relationship between LabeledArray and CategoricalArray (or labelled and factor in R). As I see it, value labels are just the way Stata, SAS and SPSS deal with categorical variables. Unfortunately that's a leaky abstraction which doesn't really allow knowing whether a variable is supposed to be categorical or continuous (see e.g. this discussion). In R, I find the divide between labelled vectors and factors to be annoying as it creates a schism which makes data handling even more complex than it needs to be. Maybe we can do better in Julia thanks to its powerful type system and by learning from previous experiences (since the fundamental design isn't set in stone as in R and Stata)?

Of course currently CategoricalArray isn't able to store variables with value labels as it only allows for consecutive reference codes starting from 1 -- so it loses the underlying values if they do not fit that scheme. But it could store an additional field giving the mapping from each level to a custom code. Do you think this would allow merging LabeledArray and CategoricalArray?

I know it's possible in Stata to assign value labels only to some values. This does not seem too problematic as a level could be generated automatically by calling string on the value. I've also read that some variables may have labels attached to some values even though they are truly continuous, but I couldn't find examples of this. Probably value labels are used for continuous variables only to attach labels to missing values, which makes more sense than attaching labels to arbitrary numeric values. Maybe Likert scales are an intermediate case which can be considered as continuous with some assumptions, but treating them as categorical by default sounds safer.

Thanks in advance for your feedback! My experience designing CategoricalArrays is that it's super hard to get right (and issues keep being spotted from time to time), so I figure we'd better join forces if that's possible.

Cc: @bkamins, @pdeffebach

The text was updated successfully, but these errors were encountered:

junyuan-chen · 2022-03-06T18:22:24Z

Hi @nalimilan! I am glad to see you here and thank for your work on CategoricalArray.

The core idea behind LabeledArray is to separate out two tasks: 1) the representation of data with labels and 2) the usage of such data which vary based on what users eventually want to do. In my own experience, it often happens that the values for both the labels and the associated integer code are meaningful and need to be remain unchanged. Whether labels are assigned or not does not always have something to do with whether the data will be used as categorical data in the statistical sense. Since both CategoricalArray and PooledArray only treat the labels as meaningful, I decided to let LabeledArray to fill that gap.

In some sense, what I did is simply letting users themselves have the options to decide how they want to proceed. I feel like that's something that add flexibility without introducing complexity for handling all hard-to-imagine usage scenarios.

bkamins · 2022-03-06T19:18:19Z

@junyuan-chen - for my better understanding of your design. What is the value added of LabelledArray over having e.g. a Vector with attached Dict as metadata with value to label mapping and using the get method of that Dict when needed?

junyuan-chen · 2022-03-06T19:41:21Z

Hi, @bkamins! That's a great question.

You are correct that LabeledArray is really just an Array bundled with a Dict that maps values to labels. The value added is simply to make the usage of it similar to the experience with Stata. It saves the step to switch between values and labels when comparing values, as comparing a LabeledValue to either the label or the value only needs the same syntax. So, something likes the following works:

julia> lv = LabeledValue(1,Dict(1=>"a"))
1 => a

julia> lv==1
true

julia> lv=="a"
true

This is really just some syntax sugar. But, the whole point when I initially thought about it is just something that is simple so there is no need to type more every time such a comparison is needed.

bkamins · 2022-03-06T20:24:13Z

This was exactly the reason of my question. The point is that with this syntactic sugar the transitivity of == is broken since:
if 1 == lv && lv == "a" is true this does not imply 1 == "a" is true.

I understand that this is quite convenient in interactive use, but my fear was that some production code could assume transitivity of == on your type and then silently produce incorrect results.

What do you think about it?

My question is guided by the following reasoning:

With @nalimilan we think that your package is very nice.
However, over the years of maintaining JuliaData packages, we have learned to be very careful when some functionality is non-standard. The current handling of == by LabeledValue is non-standard, so I think it is worth to discuss this design before the package gets more widespread use (of course we fully respect that this is your package and you are free to design it the way you wish).

junyuan-chen · 2022-03-06T20:36:17Z

Thank you (@bkamins) for your comment! I really appreciate your effort.

Yes, I agree that the behavior of == is nonstandard. The way I defined it has really made it something else that should not really be called ==. I will keep that in mind, although I do not have an immediate solution right out of my head (except getting a new name for such kind of comparisons).

nalimilan · 2022-03-06T22:18:14Z

Thanks for the quick reply! Indeed it would be safer to avoid defining == in a non-transitive way. Maybe requiring users to write something like label(lv) == "a" would be acceptable in terms of convenience? We had a similar issue with < in CategoricalArrays (JuliaData/CategoricalArrays.jl#363) and I disallowed comparing a CategoricalValue with a non-CategoricalValue using < for this reason.

The core idea behind LabeledArray is to separate out two tasks: 1) the representation of data with labels and 2) the usage of such data which vary based on what users eventually want to do. In my own experience, it often happens that the values for both the labels and the associated integer code are meaningful and need to be remain unchanged. Whether labels are assigned or not does not always have something to do with whether the data will be used as categorical data in the statistical sense. Since both CategoricalArray and PooledArray only treat the labels as meaningful, I decided to let LabeledArray to fill that gap.

In some sense, what I did is simply letting users themselves have the options to decide how they want to proceed. I feel like that's something that add flexibility without introducing complexity for handling all hard-to-imagine usage scenarios.

I see the goal, though in practice what differences does it make compared with a CategoricalArray that would allow retrieving the original value? The main one I can see is that == for CategoricalValue compares the label while for LabeledValue it compares the underlying value (except for strings as discussed above). Other than that they are very similar, as e.g. you cannot do math on LabeledValue even if they are numeric, and print(lv) and string(lv) give the label. If CategoricalArrays allowed getting a zero-copy iterator/view of the underlying values using e.g. values(ca), and the value behind a particular CategoricalValue using e.g. value(cv), then the result would be very close. Am I missing something?

junyuan-chen · 2022-03-06T22:32:16Z

If CategoricalArrays allowed getting a zero-copy iterator/view of the underlying values using e.g. values(ca), and the value behind a particular CategoricalValue using e.g. value(cv), then the result would be very close. Am I missing something?

You are right. If the original values are retained, then CategoricalArray would be sufficient for getting the job done.

nalimilan · 2022-03-07T08:29:02Z

OK, great. Then I'll try to add that feature in a CategoricalArrays branch and I'll let you know so we can see whether it would fit ReadStatTables's needs.

junyuan-chen · 2022-03-07T08:37:21Z

Awesome! @nalimilan

junyuan-chen · 2022-11-28T04:49:09Z

I am closing this issue due to the changes made here.

The transitivity of == has been fixed. LabeledArrays are also no longer stored inside a ReadStatTable. They are now just wrappers created when retrieving a data column via Tables.getcolumn.

nalimilan · 2022-11-28T09:00:14Z

Great!

nalimilan mentioned this issue Oct 2, 2022

Import R data frame attributes as metadata JuliaData/RData.jl#93

Merged

junyuan-chen mentioned this issue Nov 28, 2022

Improve LabeledArray and documentation #11

Merged

junyuan-chen closed this as completed Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LabeledArray` and `CategoricalArray` #4

`LabeledArray` and `CategoricalArray` #4

nalimilan commented Mar 6, 2022 •

edited

Loading

junyuan-chen commented Mar 6, 2022

bkamins commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

bkamins commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

nalimilan commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

nalimilan commented Mar 7, 2022

junyuan-chen commented Mar 7, 2022

junyuan-chen commented Nov 28, 2022

nalimilan commented Nov 28, 2022

LabeledArray and CategoricalArray #4

LabeledArray and CategoricalArray #4

Comments

nalimilan commented Mar 6, 2022 • edited Loading

junyuan-chen commented Mar 6, 2022

bkamins commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

bkamins commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

nalimilan commented Mar 6, 2022

junyuan-chen commented Mar 6, 2022

nalimilan commented Mar 7, 2022

junyuan-chen commented Mar 7, 2022

junyuan-chen commented Nov 28, 2022

nalimilan commented Nov 28, 2022

`LabeledArray` and `CategoricalArray` #4

`LabeledArray` and `CategoricalArray` #4

nalimilan commented Mar 6, 2022 •

edited

Loading