-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always keep missing in pool #65
base: main
Are you sure you want to change the base?
Conversation
This is WIP to show the changes so that pools can be shared between `PooledArray`s allowing and disallowing missing values. It seems we can have them without much complications.
@@ -30,13 +30,13 @@ end | |||
|
|||
mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N} | |||
refs::RA | |||
pool::Vector{T} | |||
invpool::Dict{T,R} | |||
pool::Vector{Union{Missing,T}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of think we should just keep this T
; and let Union{Missing, T}
be provided as the T
. That way if you don't have missing
values, you're not penalized. It seems like to so simillar_missing
, we could provide a function that just re-wraps the refs
, pool
, invpool
, etc in a new PooledArray
struct w/ Union{T, Missing}
, right?
But maybe you're right that it would be too expensive to have to change pool::Vector{T}
and invpool::Dict{T, R}
to support missing
as well in the similar_missing
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to avoid having to pay the following cost:
julia> y = Dict{Int, Int}(1:10^8 .=> 1:10^8);
julia> @time convert(Dict{Union{Int, Missing}, Int}, y);
3.684057 seconds (253.19 k allocations: 4.514 GiB, 6.35% gc time, 2.88% compilation time)
(it would also reduce the memory burden of making this operation)
With this PR it would be avoided.
My original preference was different, i.e. to never allow Missing
in the refpool
and invrefpool
but have 0
to be a ref that is interpreted as missing
without making a lookup, but @nalimilan thought it would be overly complex. That is why I have made a draft here to discuss the design.
This would have a consequence similar to what we have in CategoricalArrays.jl:
julia> x = categorical(["a"])
1-element CategoricalArray{String,1,UInt32}:
"a"
julia> resize!(x, 2)
2-element CategoricalArray{String,1,UInt32}:
"a"
#undef
julia> convert(CategoricalArray{Union{String,Missing},1,UInt32}, x)
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
"a"
missing
(so magically #undef
becomes missing
which is probably undesirable)
There is no rush to make a decision here. Let us think and decide what is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah neither solution is simple. Maybe it would be possible to make convert
faster for dicts when the target eltype is a supertype of the source eltype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this has been an open issue in Julia Base for a long time. Who do you think is best to ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. I guess we (you :-p) would have to come up with at least a draft PR to get people's attention. But this may be quite easy to do. convert
calls this constructor when the origin type is the same as the destination:
https://github.com/JuliaLang/julia/blob/be6887fa9ffa3d08315c93f1551c41b44dd3d720/base/dict.jl#L92-L95
I wonder whether what's needed is just to call convert
on the keys and value when the types don't match. What do you think?
Assuming that works, do you think it would be an acceptable replacement for this PR or would it still not be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we loose COW benefits. However, since we are not fully convinced what is best I would park this PR for now, so that @quinnj can finish CSV.jl update without this functionality (which is more pressing). We can go back to this PR and decide what to do later (especially that if we change the internals probably 2.0 release would be needed).
And maybe in the mean time what @nalimilan proposes to do in Julia Base would be implemented (I cannot do it as I do not know how to do it unfortunately as I know too little about internal design related to JuliaLang/julia#26681)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid the best person to implement no-copy convert
for Array{Union{T, Missing}}
is also @quinnj! :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj - if you do not have time for this (which I assume is the case) - can you please point me to the part of the sources that would have to be changed? (I assume it is somewhere in the C codes)
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the relevant code is in https://github.com/JuliaLang/julia/blob/master/src/array.c. In the jl_array_typetagdata function, it returns the "selector bytes" for an isbits-union array. This is where we lengthen the requested array size to account for selectory bytes. We also have to ensure the selector bytes are always initialized to zero, since otherwise you could get selector bytes pointing to something not just garbage, but totally wrong.
So the operation we'd need is convert(::Type{Vector{Union{T, S}}}, x::Vector{T}) where {T, S}
, right? And we're ok if the two arrays potentially share data? So it seems we would add a check on the julia-side for if T
is isbitstype
and S
is isbitstype
, and if so, call a new array.c routine.
This is where it gets a little hairy though, because we have array_resize_buffer
, but that takes an existing array to resize it. We'd need to use some of the same logic in that routine and make a new jl_convert_to_isbitsunion
method that:
- asserts the input array eltype is
isbitstype
- asserts new desired
Union
eltype is alsoisbitstype
- checks that the existing array data isn't
a->flags.isshared
(means we can't resize it; like an mmap buffer) - then probably called the same gc method as this line to resize existing array data to new size (which is just old array size + length(old_array))
- then I think we could just call
jl_ptr_to_array_1d
with our realloc'd pointer? that will take care of making the new array and using the realloc'd data pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I was afraid it would not be that simple. What we need is only convert(::Type{Vector{Union{T, Missing}}}, x::Vector{T}) where {T}
but T
may be any type (also non-bits, e.g. String
). So I guess it would require separate paths for T
being bits type and being non-bits type.
However, this in general shows the following problem. Assume that we have x
and y
sharing the same data. Where eltype of x
is T
and of y
is Union{Missing, T}
. What will happen to x
if we do y[1] = missing
? My intuition that this operation is super problematic - right? (i.e. would lead to an undefined state of x[1]
)
So the conclusion is that these arrays may not share data. Therefore we have an easier task of adding an internal constructor in PooledArrays.jl for Dict
type that instead of:
function Dict{K,V}(kv) where V where K
h = Dict{K,V}()
haslength(kv) && sizehint!(h, Int(length(kv))::Int)
for (k,v) in kv
h[k] = v
end
return h
end
would copy an internal structure of source Dict
and only do promotion of eltype of keys
field in the source dict.
So we should have a function like:
function widen_dict_missing(d::Dict{K,V}) where {K, V}
@assert !(Missing <: K) # this is handled otherwise in PooledArrays.jl
K′ = Union{K, Missing}
return Dict{K′,V}(copy(d.slots), Vector{K′}(d.keys), copy(d.vals),
d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
end
and the performance would be:
julia> d = Dict(1:10^7 .=> 1:10^7);
julia> @time Dict{Union{Int,Missing}, Int}(d);
0.282199 seconds (7 allocations: 288.001 MiB)
julia> @time Dict{Union{Int,Missing}, Int}(d);
0.301356 seconds (7 allocations: 288.001 MiB)
julia> @time Dict{Union{Int,Missing}, Int}(d);
0.313754 seconds (7 allocations: 288.001 MiB)
julia> @time widen_dict_missing(d);
0.080447 seconds (7 allocations: 288.000 MiB)
julia> @time widen_dict_missing(d);
0.078461 seconds (7 allocations: 288.000 MiB)
julia> @time widen_dict_missing(d);
0.085039 seconds (7 allocations: 288.000 MiB)
the benefit would be ~3x speedup.
src/PooledArrays.jl
Outdated
@@ -448,7 +468,8 @@ Base.convert(::Type{Array}, pa::PooledArray{T, R, N}) where {T, R, N} = convert( | |||
|
|||
Base.@propagate_inbounds function Base.getindex(A::PooledArray, I::Int) | |||
idx = DataAPI.refarray(A)[I] | |||
iszero(idx) && throw(UndefRefError()) | |||
# if eltype(A) does not allow Missing then also 1 is an error | |||
idx <= !(eltype(A) <: Missing) && throw(UndefRefError()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I think something like the following should be faster (as it avoids the core subtyping algorithm):
@assert Base.nonmissingtype(T) === T || !isone(idx)
I think I like doing an @assert
here instead since if idx == 1
, then it means something is corrupt, not #undef
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this also. I wanted to squeeze it into one check (to reduce the cost of getindex
).
However, first we should decide if we go for the design I proposed here or the design similar to what CategoricalArrays.jl does (as the other approach does not have this issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other approach is having 0
ref be missing
, rigth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The issue is the following:
julia> x = categorical(["a", "b"])
2-element CategoricalArray{String,1,UInt32}:
"a"
"b"
julia> y = categorical(["a", missing])
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
"a"
missing
julia> xs = similar(x)
2-element CategoricalArray{String,1,UInt32}:
#undef
#undef
julia> ys = similar(y)
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
missing
missing
so this would be breaking (i.e. it is impossible to have #undef
in CategoricalArray
if its eltype allows Missing
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then you have:
julia> DataAPI.refpool(ys)
1-element CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String, UInt32}}, CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}} with indices 0:0:
missing
which means one needs a custom type to handle non 1-based indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, I don't think that's awful behavior, but indeed a bit inconsistent. I mean, not letting users get #undef
isn't the worst, IMO, but it does feel like cheating a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we'd have a similar problem with DataAPI.refpool
on PooledArray w/ no missing
values, right? Because pool[1]
would be missing
, even though the PooledArray doesn't allow missing
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
pool[1]
would be missing, even though the PooledArray doesn't allowmissing
s?
The option is to have a custom refpool
type (like CategoricalArrays.jl)
Hmmm, this approach may complicate construction a bit; for example: function _invert(d::Dict{K,V}) where {K,V}
d1 = Vector{K}(undef, length(d))
for (k, v) in d
d1[v] = k
end
return d1
end it doesn't seem like we can rely on |
You mean when one would use I always thought it was kind of reserved for internal use only |
Well, internal use and CSV.read 😛 . I think there may be other cases, like data ingestion, where this should be considered: i.e. the I can ensure |
Ok, I'm finishing up the CSV refactor and I think the approach in this PR would work well for how things look on the CSV.jl side. That is, it's nice to have |
But is this PR and release of PooledArrays.jl then required for your changes, or it is OK to do it later? (I am asking, because @nalimilan was not convinced which approach would be best) |
@quinnj - given you work with this branch I will finalize this PR so that it is not DRAFT. However, I still do not assume we would merge it, as we need to make sure this is a sound design (and I understand the reservations that @nalimilan has - in particular I will have to check if we do not have significant performance regressions anywhere). |
OK. There are two issues with the approach proposed by me in this PR:
|
Ok, and the alternative approach would be to have ref of |
The alternatives are (and I comment on the consequences):
|
If |
Yes - I have thought about it but it would require that in DataFrames.jl we:
That is why I would be hesitant to do it
Unfortunately this is my general conclusion also (and that we should just use a faster way to create a new refpool/invrefpool). @quinnj - would you be OK in CSV.jl with not having this change? |
Not related to this particular user case, but I'd love it to see |
This is quite a long discussion, apologies for not having read all but Bogumil pointed me here from a Slack thread where a user reported a performance hit in Here on a relatively small data set (~60k rows) with 3 pooled columns ( |
Yes, there are two issues as I explain in https://bkamins.github.io/julialang/2023/11/17/pooling.html:
|
This is WIP to show the changes so that pools can be shared between
PooledArray
s allowing and disallowing missing values. It seems we can have them without much complications.Please comment if we would be OK with this design. Then I will finalize if it is OK. In general this will have some small performance penalty (the usual cost of allowing
Missing
ineltype
).