-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Tips for Categorical Data #11
Comments
Regarding your proposal it should rather go to CategoricalArrays.jl as otherwise it would be type piracy (also because you use direct field access which should not be done outside CategoricalArrays.jl). @nalimilan - do you think it is worth to add this specific method (maybe a review of other standard methods would be in place?). |
I know, but from the top of my head, it should also be trivial to optimize
Right. But how would one cope with the fact, that If you give me some pointers on where to put these patches, I'd be happy to help. 🙂 |
FreqTables also has an optimized method, but I couldn't add it to StatsBase's |
Having thought about it my conclusion is that it should be acceptable even to add StatsBase.jl as a dependency of CategoricalArrays.jl. The reason is that This opposed to PooledArrays.jl which are general and should not depend on StatsBase.jl I think. But in both cases we could consider using |
Not AFAIK, that's the strength of Requires.jl. One thing to keep in mind is that StatsBase is supposed to be merged into Statistics at some point, and then it won't be possible to use |
Does the invention of DataAPI.jl change anything here. |
Yes, it should be possible to add an optimized code path using |
In your performance tips at
In [8]
you mentioned:I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on
UInt
s:On top of that, if we guard
x.pool[ref]
forref == 0
, which corresponds to amissing
value, we also fix the performance loss for missing values.My question is, where should I propose and contribute a change like the above,
CategoricalArrays.jl
orStatsBase.jl
or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go toDataFrames.jl
?The text was updated successfully, but these errors were encountered: