Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabeledArray and CategoricalArray #4

Closed
nalimilan opened this issue Mar 6, 2022 · 11 comments
Closed

LabeledArray and CategoricalArray #4

nalimilan opened this issue Mar 6, 2022 · 11 comments

Comments

@nalimilan
Copy link
Contributor

nalimilan commented Mar 6, 2022

I've just discovered this package, it's really cool! I was precisely willing to implement something like this.

Something which I've been wondering for some time based on my experience in R is the relationship between LabeledArray and CategoricalArray (or labelled and factor in R). As I see it, value labels are just the way Stata, SAS and SPSS deal with categorical variables. Unfortunately that's a leaky abstraction which doesn't really allow knowing whether a variable is supposed to be categorical or continuous (see e.g. this discussion). In R, I find the divide between labelled vectors and factors to be annoying as it creates a schism which makes data handling even more complex than it needs to be. Maybe we can do better in Julia thanks to its powerful type system and by learning from previous experiences (since the fundamental design isn't set in stone as in R and Stata)?

Of course currently CategoricalArray isn't able to store variables with value labels as it only allows for consecutive reference codes starting from 1 -- so it loses the underlying values if they do not fit that scheme. But it could store an additional field giving the mapping from each level to a custom code. Do you think this would allow merging LabeledArray and CategoricalArray?

I know it's possible in Stata to assign value labels only to some values. This does not seem too problematic as a level could be generated automatically by calling string on the value. I've also read that some variables may have labels attached to some values even though they are truly continuous, but I couldn't find examples of this. Probably value labels are used for continuous variables only to attach labels to missing values, which makes more sense than attaching labels to arbitrary numeric values. Maybe Likert scales are an intermediate case which can be considered as continuous with some assumptions, but treating them as categorical by default sounds safer.

Thanks in advance for your feedback! My experience designing CategoricalArrays is that it's super hard to get right (and issues keep being spotted from time to time), so I figure we'd better join forces if that's possible.

Cc: @bkamins, @pdeffebach

@junyuan-chen
Copy link
Owner

Hi @nalimilan! I am glad to see you here and thank for your work on CategoricalArray.

The core idea behind LabeledArray is to separate out two tasks: 1) the representation of data with labels and 2) the usage of such data which vary based on what users eventually want to do. In my own experience, it often happens that the values for both the labels and the associated integer code are meaningful and need to be remain unchanged. Whether labels are assigned or not does not always have something to do with whether the data will be used as categorical data in the statistical sense. Since both CategoricalArray and PooledArray only treat the labels as meaningful, I decided to let LabeledArray to fill that gap.

In some sense, what I did is simply letting users themselves have the options to decide how they want to proceed. I feel like that's something that add flexibility without introducing complexity for handling all hard-to-imagine usage scenarios.

@bkamins
Copy link

bkamins commented Mar 6, 2022

@junyuan-chen - for my better understanding of your design. What is the value added of LabelledArray over having e.g. a Vector with attached Dict as metadata with value to label mapping and using the get method of that Dict when needed?

@junyuan-chen
Copy link
Owner

Hi, @bkamins! That's a great question.

You are correct that LabeledArray is really just an Array bundled with a Dict that maps values to labels. The value added is simply to make the usage of it similar to the experience with Stata. It saves the step to switch between values and labels when comparing values, as comparing a LabeledValue to either the label or the value only needs the same syntax. So, something likes the following works:

julia> lv = LabeledValue(1,Dict(1=>"a"))
1 => a

julia> lv==1
true

julia> lv=="a"
true

This is really just some syntax sugar. But, the whole point when I initially thought about it is just something that is simple so there is no need to type more every time such a comparison is needed.

@bkamins
Copy link

bkamins commented Mar 6, 2022

This was exactly the reason of my question. The point is that with this syntactic sugar the transitivity of == is broken since:
if 1 == lv && lv == "a" is true this does not imply 1 == "a" is true.

I understand that this is quite convenient in interactive use, but my fear was that some production code could assume transitivity of == on your type and then silently produce incorrect results.

What do you think about it?

My question is guided by the following reasoning:

  1. With @nalimilan we think that your package is very nice.
  2. However, over the years of maintaining JuliaData packages, we have learned to be very careful when some functionality is non-standard. The current handling of == by LabeledValue is non-standard, so I think it is worth to discuss this design before the package gets more widespread use (of course we fully respect that this is your package and you are free to design it the way you wish).

@junyuan-chen
Copy link
Owner

Thank you (@bkamins) for your comment! I really appreciate your effort.

Yes, I agree that the behavior of == is nonstandard. The way I defined it has really made it something else that should not really be called ==. I will keep that in mind, although I do not have an immediate solution right out of my head (except getting a new name for such kind of comparisons).

@nalimilan
Copy link
Contributor Author

Thanks for the quick reply! Indeed it would be safer to avoid defining == in a non-transitive way. Maybe requiring users to write something like label(lv) == "a" would be acceptable in terms of convenience? We had a similar issue with < in CategoricalArrays (JuliaData/CategoricalArrays.jl#363) and I disallowed comparing a CategoricalValue with a non-CategoricalValue using < for this reason.

The core idea behind LabeledArray is to separate out two tasks: 1) the representation of data with labels and 2) the usage of such data which vary based on what users eventually want to do. In my own experience, it often happens that the values for both the labels and the associated integer code are meaningful and need to be remain unchanged. Whether labels are assigned or not does not always have something to do with whether the data will be used as categorical data in the statistical sense. Since both CategoricalArray and PooledArray only treat the labels as meaningful, I decided to let LabeledArray to fill that gap.

In some sense, what I did is simply letting users themselves have the options to decide how they want to proceed. I feel like that's something that add flexibility without introducing complexity for handling all hard-to-imagine usage scenarios.

I see the goal, though in practice what differences does it make compared with a CategoricalArray that would allow retrieving the original value? The main one I can see is that == for CategoricalValue compares the label while for LabeledValue it compares the underlying value (except for strings as discussed above). Other than that they are very similar, as e.g. you cannot do math on LabeledValue even if they are numeric, and print(lv) and string(lv) give the label. If CategoricalArrays allowed getting a zero-copy iterator/view of the underlying values using e.g. values(ca), and the value behind a particular CategoricalValue using e.g. value(cv), then the result would be very close. Am I missing something?

@junyuan-chen
Copy link
Owner

If CategoricalArrays allowed getting a zero-copy iterator/view of the underlying values using e.g. values(ca), and the value behind a particular CategoricalValue using e.g. value(cv), then the result would be very close. Am I missing something?

You are right. If the original values are retained, then CategoricalArray would be sufficient for getting the job done.

@nalimilan
Copy link
Contributor Author

OK, great. Then I'll try to add that feature in a CategoricalArrays branch and I'll let you know so we can see whether it would fit ReadStatTables's needs.

@junyuan-chen
Copy link
Owner

Awesome! @nalimilan

@junyuan-chen
Copy link
Owner

I am closing this issue due to the changes made here.

The transitivity of == has been fixed. LabeledArrays are also no longer stored inside a ReadStatTable. They are now just wrappers created when retrieving a data column via Tables.getcolumn.

@nalimilan
Copy link
Contributor Author

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants