Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results for argmax with multithreaded parsing #1128

Closed
yurivish opened this issue Mar 1, 2024 · 1 comment
Closed

Incorrect results for argmax with multithreaded parsing #1128

yurivish opened this issue Mar 1, 2024 · 1 comment

Comments

@yurivish
Copy link

yurivish commented Mar 1, 2024

Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.

Julia has a function called argmax which does this.

argmax(itr)

Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.

Unfortunately, the results I was getting back were incorrect – I called argmax on an integer column and got back an answer that was a valid index, but not the index of the maximum value.

After some digging, I was able to reduce the issue to a small example, which you can find below.

julia> using SentinelArrays

julia> arrays = [
         [18, 70, 92, 15, 65],
         [25, 14, 95, 54, 57]
       ];

julia> cv = ChainedVector(arrays);

julia> argmax(cv)
95

julia> argmax(collect(cv))
8

Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax. It returns the maximum value, rather than its index.

The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.

This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect the column before calling argmax, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1) I also see the correct results.

I saw that SentinelArrays has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.

@mkitti
Copy link
Member

mkitti commented Jul 5, 2024

As of SentinelArrays.jl version 1.4.4, this specific issue should be fixed.

JuliaData/SentinelArrays.jl#99

Test engineering is on going.

julia> using SentinelArrays

julia> arrays = [
                [18, 70, 92, 15, 65],
                [25, 14, 95, 54, 57]
              ];

julia> cv = ChainedVector(arrays);

julia> argmax(cv)
8

julia> argmax(collect(cv))
8

Thank you for filing the issue.

@quinnj quinnj closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants