You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.
Julia has a function called argmax which does this.
argmax(itr)
Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.
Unfortunately, the results I was getting back were incorrect – I called argmax on an integer column and got back an answer that was a valid index, but not the index of the maximum value.
After some digging, I was able to reduce the issue to a small example, which you can find below.
Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax. It returns the maximum value, rather than its index.
The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.
This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect the column before calling argmax, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1) I also see the correct results.
I saw that SentinelArrays has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.
The text was updated successfully, but these errors were encountered:
Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.
Julia has a function called argmax which does this.
Unfortunately, the results I was getting back were incorrect – I called
argmax
on an integer column and got back an answer that was a valid index, but not the index of the maximum value.After some digging, I was able to reduce the issue to a small example, which you can find below.
Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements
argmax
. It returns the maximum value, rather than its index.The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.
This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I
collect
the column before callingargmax
, then correct value is returned. If I load the file in withCSV.read(..., ntasks=1)
I also see the correct results.I saw that
SentinelArrays
has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.The text was updated successfully, but these errors were encountered: