Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme slowdown in IndexedTable creation with CategoricalArrays #128

Open
pdimens opened this issue May 14, 2020 · 0 comments
Open

Extreme slowdown in IndexedTable creation with CategoricalArrays #128

pdimens opened this issue May 14, 2020 · 0 comments

Comments

@pdimens
Copy link

pdimens commented May 14, 2020

Hello, after a discussion with @piever , he suggested I open an issue with StructArrays. A recent update to CategoricalArrays has created an extreme slowdown of the IndexedTables/JuliaDB table() function. The full details of that conversation can be found on this open issue posted on JuliaDB.

The short version is (with a MWE):

using JuliaDB, BenchmarkTools, CategoricalArrays

function tabletest()
    names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
    loci = "loc".*string.(collect(1:1500))
    loci = fill.(loci, 300) |> Base.Iterators.flatten |> collect
    genotypes = fill((1,2), 450000)

    return table((names = names, loci = loci, genotypes = genotypes), pkey = :names)
end

function tabletest_cat()
    names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
    loci = "loc".*string.(collect(1:1500))
    loci = fill.(loci, 300) |> Base.Iterators.flatten |> collect
    genotypes = fill((1,2), 450000)

    return table(
            (names = categorical(names, compress = true), 
            loci = categorical(loci, compress = true), 
            genotypes = genotypes),
            pkey = :names
        )
end
julia> @benchmark tabletest()
BenchmarkTools.Trial: 
  memory estimate:  79.81 MiB
  allocs estimate:  900109
  --------------
  minimum time:     246.040 ms (7.45% GC)
  median time:      270.271 ms (12.39% GC)
  mean time:        280.085 ms (15.80% GC)
  maximum time:     436.054 ms (44.68% GC)
  --------------
  samples:          18
  evals/sample:     1

Howerver, doing @benchmark tabletest_cat() has been running for several hours and hadn't finished by the time I killed the job, which is probably not a good sign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant