-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Avoid scatter operation in ExpressionOrExpression case evaluation method
#18444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb could you do your benchmark thing for this one as well? Looks like about -30% in my own local testing. |
|
🤖 |
821e50d to
2b34d81
Compare
|
Force push of 2b34d81 was just a squash to make this one easier to cherry-pick locally. No code changes. |
|
🤖: Benchmark completed Details
|
|
Benchmark results confirm 20-27% on the relevant benchmarks. That's about the percentage of time I was seeing in the profiler assigned to the scatter operation. |
|
For anyone else following along, here are the relevant benchmarks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pepijnve -- I think the code makes sense to me and the benchmark results check out ✅
The only thing I am not sure about is the new merge / zip kernel -- I think we may be able to reuse zip upstream in arrow-rs after the next arrow release
| filter.filter(array) | ||
| } | ||
|
|
||
| fn merge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an implementation of zip to me (rather than merge). It seems like it would be better to use consistent terminology if this is indeed the case
This looks very similar to the fancy new code that @rluvaton added to arrow for zip with scalars (will be released in arrow 57.1.0):
If you agree it is the same, perhaps we can either avoid adding this method to DataFusion or else we can add a comment that says we can revert to using just zip once apache/arrow-rs#8653 is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost the same as zip, but different enough that it's necessary. Without this implementation you can't avoid the scatter step.
I've added a test case to show the difference. The short version is that merge([true, false, true], [A, C], [B]) will get you [A, B, C] while zip would return an error stating all arrays should have the same length.
I agree that these two merge kernels would be better off in arrow-rs which is why I made PR apache/arrow-rs#8753.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rluvaton's work on zip only covers the case of two scalar inputs BTW. That's why I chose to delegate to plain zip in that case. array/array, scalar/array and array/scalar still needs the specific logic here.
The subtle difference between this and zip is in
let falsy_length = start - filled;
let falsy_end = falsy_offset + falsy_length;
mutable.extend(1, falsy_offset, falsy_end);
falsy_offset = falsy_end;
vs
mutable.extend(1, filled, start);
where zip is using the slice indices from the mask directly, merge only uses the length of the slices and tracks the amount taken from truthy and falsy separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I realised this morning I had chosen a rather poor example in the arrow-rs PR. I've updated it to illustrate the truthy/falsy length difference.
| } | ||
| }; | ||
|
|
||
| let optimize_filter = batch.num_columns() > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth checking if there are any nested types (e.g. structarrays) and optimize the filter in that case too -- this is done elsewhere (maybe in the filter kernel itself 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The logic that handles that isn't pub in arrow-rs unfortunately. I can duplicate it here if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed for now. Would it be useful to make this a method of DataType? I can prepare an arrow-rs PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DataType method didn't make much sense to me after all since the predicate is very much tied to the actual filter implementation logic. I went for apache/arrow-rs#8782 instead.
|
Thanks again @pepijnve |
Which issue does this PR close?
Rationale for this change
The
ExpressionOrExpressioncase evaluation method currently useszipto combine thethenandelseresults for a batch. This requires a scatter operation to ensure the partial results are correctly lined up for thezipalgorithm.By using a custom
mergealgorithm, this scatter step can be avoided.What changes are included in this PR?
Are these changes tested?
Covered by existing case tests
Are there any user-facing changes?
No