-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
table creation extreme slowdown #327
Comments
Here is a profile/flamegraph produced by StatProfilerHTML.jl of just the IndexedTable creation using 4 vectors. |
It may be what's changed is that now by default StructArrays collects categorical values into categorical arrays, and it looks like categorical arrays get some performance penalty somehow upon collection. Does the time spent in I definitely don't understand CategoricalArrays well enough, but performance seems fine at least on the latest version. Can you produce some sample dataset (generate at random with some snippet) that iterates rows where As a stop-gap, you can always do |
Thanks. Even without the
|
Here is a MWE with some benchmarking. I should mention that JuliaDB is version using JuliaDB, BenchmarkTools, CategoricalArrays
function tabletest()
names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
loci = "loc".*string.(collect(1:150000))
loci = fill.(loci, 3) |> Base.Iterators.flatten |> collect
genotypes = fill((1,2), 450000)
return table((names = names, loci = loci, genotypes = genotypes), pkey = :names)
end
function tabletest_cat()
names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
loci = "loc".*string.(collect(1:150000))
loci = fill.(loci, 3) |> Base.Iterators.flatten |> collect
genotypes = fill((1,2), 450000)
return table(
(names = categorical(names, compress = true),
loci = categorical(loci, compress = true),
genotypes = genotypes),
pkey = :names
)
end
Howerver, doing |
So, this seems to be the simplest way to reproduce: julia> using CategoricalArrays, StructArrays
julia> N = 500_000;
julia> values = categorical(string.(1:N));
julia> sa = StructArray((values=values,));
julia> sa[1:N] # takes forever I think the issue is that normally CategoricalArrays has a fast Maybe |
Interesting. What changes could have been made in the last 2-3 weeks that would have caused this? This issue is only recent on my end. |
No idea! Probably some change either on StructArrays or CategoricalArrays (I think CategoricalArrays released a new version recently). |
Would there be merit to me posting this there as well? |
It probably makes sense to file an issue on StructArrays to discuss non-scalar getindex. |
This is due to the fact that now CategoricalArrays merge pools from source and destination Another solution which could work in the meantime and would be a good idea anyway is to have StructArrays call a higher-level function on the underlying arrays, like EDIT: Also note that CategoricalArrays has been designed with the principle that the number of levels is supposed to be very small compared to the number of elements. If that's not the case, you'd better use another structure. |
I've been thinking about that as well. What happens now is that the overloads of This seems a similar approach to e.g. The reason of my design decision is that if some transformation is needed (for example from linear to Cartesian index or vice versa) it can be performed already at If I redirected A second concern is that it wouldn't be completely obvious (to me at least) from the array signature how to tell whether it is a "scalar getindex" or a "multi-indices getindex", because in the second case I would want to wrap the result into a |
What I did in CategoricalArrays is that I called Another solution would be to override Regarding the computation of linear/cartesian indices, I think it would be worth checking whether the compiler is able to generate efficient code for the scalar case: the generated code is quite simple (a few additions), so it may well be able to eliminate redundant conversions. Anyway, for the non-scalar case, I'm not sure it's big deal to make the conversion multiple times, since the cost should be negligible compared to allocating the new slice. What matters is that the right kind of indices is used for each underlying array, and for this computing indices for each array can be a big win. If compiler optimizations for scalars aren't good enough, what you could do is to call |
I've run into a strange (and tremendous) performance hit with JuliaDB/IndexedTables this past week. Previously, with a custom reading function using JuliaDB, CSV, and CategoricalArrays, it took ~1.6sec to read in a Genepop file. A few days ago I started working on the package again after updating the deps and that same file reader now clocks in at 120 seconds. I've been trying to figure out where this ~100x performance drop came from, and while my diagnosing skills are limited, I've come to a few conclusions.
The original benchmark:
Here is the table in question, and the function benchmarked to produce it as a DataFrame:
when calling table from JuliaDB/IndexedTable to convert a long-format DataFrame into an IndexedTable, it's a pretty instantaneous process, even if the DataFrame is 4col x 500,000rows.
DataFrame.columns => table (i.e. table(name = df.name, population = df.population...)
However, a few weeks ago, all of this took ~1.6sec and didn't require the intervention of DataFrames. Using DataFrames to get things functional before creating an IndexedTable isn't a realistic long term solution. Hopefully this information can help inform someone with more understanding of JuliaDB to identify what's going on. I can provide whatever code/profiles are necessary.
The text was updated successfully, but these errors were encountered: