Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Tips for Categorical Data #11

Open
jonas-schulze opened this issue Feb 18, 2019 · 7 comments
Open

Performance Tips for Categorical Data #11

jonas-schulze opened this issue Feb 18, 2019 · 7 comments

Comments

@jonas-schulze
Copy link

In your performance tips at In [8] you mentioned:

Allowing missing as well as categorical slows down computations

I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on UInts:

julia> function StatsBase.countmap(x::CategoricalArray; alg = :auto)
           cm = countmap(x.refs, alg=alg)
           Dict(x.pool[ref] => count for (ref, count) in cm)
       end

julia> x = rand(1:10, 10^6);

julia> y = categorical(x);

julia> z = compress(y);

julia> @btime countmap(x);
  2.927 ms (8 allocations: 7.63 MiB)

julia> @btime countmap(y);
  2.141 ms (14 allocations: 3.82 MiB)

julia> @btime countmap(z);
  541.943 μs (11 allocations: 3.27 KiB)

On top of that, if we guard x.pool[ref] for ref == 0, which corresponds to a missing value, we also fix the performance loss for missing values.

My question is, where should I propose and contribute a change like the above, CategoricalArrays.jl or StatsBase.jl or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go to DataFrames.jl?

@bkamins
Copy link
Owner

bkamins commented Feb 18, 2019

countmap in this tutorial is just an example.

Regarding your proposal it should rather go to CategoricalArrays.jl as otherwise it would be type piracy (also because you use direct field access which should not be done outside CategoricalArrays.jl).

@nalimilan - do you think it is worth to add this specific method (maybe a review of other standard methods would be in place?).

@jonas-schulze
Copy link
Author

countmap in this tutorial is just an example.

I know, but from the top of my head, it should also be trivial to optimize map, sum and prod.

it should rather go to CategoricalArrays.jl as otherwise it would be type piracy

Right. But how would one cope with the fact, that CategoricalArrays.jl otherwise does not rely on or need StatsBase.jl in any way. Every user of CategoricalArrays.jl would inherit that "blind" dependency.

If you give me some pointers on where to put these patches, I'd be happy to help. 🙂

@nalimilan
Copy link

FreqTables also has an optimized method, but I couldn't add it to StatsBase's countmap to avoid having it depend on CategoricalArrays. Maybe we could use @require, though. The same issue applies to PooledArrays.

@bkamins
Copy link
Owner

bkamins commented Feb 18, 2019

Having thought about it my conclusion is that it should be acceptable even to add StatsBase.jl as a dependency of CategoricalArrays.jl. The reason is that CategoricalArray is a statistics-oriented type.

This opposed to PooledArrays.jl which are general and should not depend on StatsBase.jl I think.

But in both cases we could consider using @require. My only question with this would be wouldn't then the behavior depend on code loading sequence?

@nalimilan
Copy link

But in both cases we could consider using @require. My only question with this would be wouldn't then the behavior depend on code loading sequence?

Not AFAIK, that's the strength of Requires.jl.

One thing to keep in mind is that StatsBase is supposed to be merged into Statistics at some point, and then it won't be possible to use @require I guess. But at that point it won't be a problem for CategoricalArrays and PooledArrays to depend on Statistics.

@xiaodaigh
Copy link
Contributor

Does the invention of DataAPI.jl change anything here.

@nalimilan
Copy link

Yes, it should be possible to add an optimized code path using refarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants