-
Notifications
You must be signed in to change notification settings - Fork 1.9k
DataFrame doesn't decode boolean arrays correctly from Arrow #7115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ericstj @luisquintanilla @michaelgsharp I investigated the defect and it appeared that it is related to the same issue as was discussed in #7094. So I see 2 possible solutions:
I may take this task into development. Please suggest what approach is preferred? |
It sounds like option 1 is preferable. Is the only side-effect a breaking change that makes PrimitiveDataFrameColumn abstract? Do we know if anyone would actually be trying to create instances of it? cc @eerhardt |
Looks like here is one. We haven't officially "1.0"d this library yet right? Since Or introducing a new We had to do something similar in Arrow where
|
Another side effect is performance. New implementation working with individual bits will be slower, what I think is fine taking into account 8 times benefit in memory usage. Except cases, when Boolean column is used in API, for example in filtering and cloning, like in DataFrame I think the next step can be rethinking of this API as it's already quite slow (due to BooleanDataFrameColumn uses bit operations when accessing validitybitmap) (see #6164) and memory consuming. Also current implementation is not straightforward and incorrectly works with null values (#6820). For example, new API can be implemented as extension over IDataView: So, instead of var boolFilter = df["timestamp"].ElementwiseGreaterThanOrEqual(unixStartTime);
var hourlydata = df.Filter(boolFilter);
var boolFilter2 = hourlydata["timestamp"].ElementwiseLessThan(unixEndTime);
hourlydata = hourlydata.Filter(boolFilter2); more simple code can be used: var hourlydata = df.Filter("timestamp", x => x >= unixStartTime && x < unixEndTime).ToDataFrame(); |
Isn't the way we handle nulls basically doing the 1bit for each Boolean thing? Would we be able to use the new BooleanDataFrameColumn to be our implementation for our nulls? |
Jake, you are right, currently we store null information in validity buffer using 1 bit per value - that is the reason why using BooleanDataFrameColumn takes more time, that just using the list of boolean values (as we also have to deal with this validity buffer) - and, to my mind, should be avoided in filtering API. Agree, that extracting some code from |
System Information (please complete the following information):
Describe the bug
Creating a dataframe from an arrow record batch where a column is a boolean array produces incorrect results (and occasionally even throws exceptions).
To Reproduce
Run:
Expected behavior
Above test passes
The text was updated successfully, but these errors were encountered: