Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unique() behaviour, add unique!() #358

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Jun 23, 2021

A draft of a solution for #357, this PR renames unique(::CategoricalArray) to uniquelevels(::CategoricalArray) as it better reflects what this method does
This PR restores the Base.unique() behavior -- makes unique(x::CategoricalArray) return the categorical array of the same type as x and containing the unique values of x in the order of their first appearance.
The internal implementation of unique() is cleaned up and optimized a bit (avoid intermediate array sorting, generalize to AbstractCategoricalArray).

The PR also adds Base.unique!(x::CategoricalArray).

The new unique() is not yet covered by unit tests. I'll do that if the overall approach would be considered ok.

If you need the old unique(x::CategoricalArray) behavior, use unwrap.(unique(x)).
As this is backward-incompatible change, the new minor version would be required.

Needs to be rebased once #359 lands.

Fixes #357, #129.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this then. Unfortunately this is breaking and we released 0.10 recently, so I guess the PR will have to remain unmerged for a while. But better finish it anyway.

src/array.jl Outdated Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
src/array.jl Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
@ablaom
Copy link

ablaom commented Jul 13, 2021

This does sound like a good solution.

@alyst alyst marked this pull request as ready for review July 27, 2021 13:24
@alyst alyst changed the title Add uniquelevels(), fix unique() behaviour fix unique() behaviour, add unique!() Jul 27, 2021
@alyst
Copy link
Contributor Author

alyst commented Aug 13, 2021

gentle bump

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just waiting for #359 (and then for the next breaking release).

src/CategoricalArrays.jl Outdated Show resolved Hide resolved
src/array.jl Outdated
@@ -854,7 +841,15 @@ end
Drop levels which do not appear in categorical array `A` (so that they will no longer be
returned by [`levels`](@ref DataAPI.levels)).
"""
droplevels!(A::CategoricalArray) = levels!(A, intersect(levels(A), uniquelevels(A)))
function droplevels!(A::CategoricalArray)
# FIXME temporary code until PR#359 is merged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note so that we don't forget this before merging (x-ref #359).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just rebased the PR against the updated droplevels!() and squashed the commits, so it should be ready for merging (when it's time for a new minor release).

so it conforms to the semantics of the Base.unique()

This is a breaking change that requires a new minor release.
@Moelf
Copy link

Moelf commented Nov 22, 2024

bump

@nalimilan
Copy link
Member

This is breaking so it requires tagging a new "major" release. Given the disruption it will create in the ecosystem, I'd like to be sure it's worth it. But so far we don't have lots of breaking changes to make: breaking

@bkamins
Copy link
Member

bkamins commented Nov 22, 2024

The list of breaking things is in breaking . The major decision is with levels which is related to unique.

There are the following aspects:

  • it would be nice to have levels and unique consistent
  • on the other hand there is a risk that a lot of code uses levels exactly to "unwrap" the categorical values into levels. We could emit a warning that this behavior changed, but it is something we never did, and is fragile.
  • finally - it would require reviewing the packages depending on CategoricalArrays.jl to update their version bounds.
  • in DataFrames.jl all uses of levels and unique are safe to this change

In summary - I think we could go for it, but we should announce on Slack and Discourse that we plan to make this change and wait for a response.

The decision as for cut behavior I think is simpler and we could add this behavior while making a breaking release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unique() returns levels, not the CategoricalArray
5 participants