-
Notifications
You must be signed in to change notification settings - Fork 1.7k
#17972 Restore case expr/expr optimisation while ensuring lazy evaluation #17973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
cf78119
b782628
eed9a34
c8524d3
f9f67c5
b849738
efbd205
480a747
c8186ec
aff3f18
99a32fc
e1a4f79
7e85be2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -23,7 +23,7 @@ use std::sync::Arc; | |||||
|
|
||||||
| use crate::utils::scatter; | ||||||
|
|
||||||
| use arrow::array::{ArrayRef, BooleanArray}; | ||||||
| use arrow::array::{make_builder, ArrayBuilder, ArrayRef, BooleanArray}; | ||||||
| use arrow::compute::filter_record_batch; | ||||||
| use arrow::datatypes::{DataType, Field, FieldRef, Schema}; | ||||||
| use arrow::record_batch::RecordBatch; | ||||||
|
|
@@ -97,14 +97,26 @@ pub trait PhysicalExpr: Any + Send + Sync + Display + Debug + DynEq + DynHash { | |||||
| batch: &RecordBatch, | ||||||
| selection: &BooleanArray, | ||||||
| ) -> Result<ColumnarValue> { | ||||||
| let tmp_batch = filter_record_batch(batch, selection)?; | ||||||
| let selection_count = selection.true_count(); | ||||||
|
|
||||||
| let tmp_result = self.evaluate(&tmp_batch)?; | ||||||
| if batch.num_rows() == 0 || selection_count == batch.num_rows() { | ||||||
|
||||||
| if batch.num_rows() == 0 || selection_count == batch.num_rows() { | |
| if selection_count == batch.num_rows() { |
(i'd assume batch.num_rows() == 0 implies that also selection_count == 0. If it is not the case, we have a more permissive if condition, but requires a code comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no assertion for this in place and as far as I can tell (but I did not verify yet), you can call evaluate_selection today with a mismatch between batch.num_rows() and selection.len() and it will do something. I haven't tested this yet, so I'm not 100% sure what the outcome would be.
I'll write an extra unit test to cover this.
The idea here is to avoid any extra work whatsoever in the trivial cases. Not sure what a useful, non-trivial comment for that would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @findepi I've added some unit tests to help define the behaviour of evaluate_selection. Some are currently still failing. Could you guys take a look at the tests to see if what they're asserting is correct?
The issues are all related to record batch / selection vector size mismatches. I don't think the current behaviour makes sense tbh (which is also present on main). I would expect to either get an error or as many output rows as there were input rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken the liberty of adding the size mismatch check and returning an execution error in case of mismatch. Within the DataFusion library case is the only client of this API and for that this change should be fine. The behaviour in case of mismatch was pretty weird, so I doubt people would be making active use of this. You never know of course. I'll add this to the user visible changes list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a couple of tests guide by code coverage.expr_or_expr is definitely well covered.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -155,10 +155,7 @@ impl CaseExpr { | |
| && else_expr.as_ref().unwrap().as_any().is::<Literal>() | ||
| { | ||
| EvalMethod::ScalarOrScalar | ||
| } else if when_then_expr.len() == 1 | ||
| && is_cheap_and_infallible(&(when_then_expr[0].1)) | ||
| && else_expr.as_ref().is_some_and(is_cheap_and_infallible) | ||
| { | ||
| } else if when_then_expr.len() == 1 && else_expr.is_some() { | ||
| EvalMethod::ExpressionOrExpression | ||
| } else { | ||
| EvalMethod::NoExpression | ||
|
|
@@ -425,6 +422,16 @@ impl CaseExpr { | |
| ) | ||
| })?; | ||
|
|
||
| // For the true and false/null selection vectors, bypass `evaluate_selection` and merging | ||
| // results. This avoids materializing the array for the other branch which we will discard | ||
| // entirely anyway. | ||
| let true_count = when_value.true_count(); | ||
| if true_count == batch.num_rows() { | ||
| return self.when_then_expr[0].1.evaluate(batch); | ||
| } else if true_count == 0 { | ||
| return self.else_expr.as_ref().unwrap().evaluate(batch); | ||
| } | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My reading of this code is that it will still evaluate the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. There's no way to avoid that. This particular bit of code is both an optimisation and a correctness thing. From a performance point of view, we already know the selection vector is redundant, so there's really no point in calling For correctness, what's being avoid here is calling either There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
isn't this reintroducing the bug that was fixed in #15384, just in a big more complex wrapping? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, why do you think that's the case? If you write The original issue was that in the example above for a single row with a non-null There are two ways to fix this:
The earlier fix had the effect of 2. as well, just in a less explicit way. The fix here does the same but adds the necessary checks in the explicit expr/expr code path. The code that's being seen as an optimisation is intended to prevent calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I woke up early this morning to the realisation that the actual bug was a subtlety in the implementation of I believe this properly addresses the original evaluation problem. All SQL logic tests pass even when commenting out the optimisation for true and false selection vectors in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the change in (They look nice but my only concern is additional code complexity which is harder to cover with SLT tests. Now that we have branching here, we should have a bunch of SLT cases that clearly exercise all-true, all-false, some-true situations.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, more SLTs! I will add some for the various cases you mentioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@findepi Yes I would prefer to keep these early outs since they're pretty trivial and I think they're appropriate for the Calling evaluate_selection still has to produce a value –it can't return None– so you pay at least a non-zero cost for calling it unnecessarily. If it's obvious that it will not perform any useful work, it makes sense to avoid it IMO. Just for context, the queries I'm working on are quite case heavy. Any work we can save in the inner loop of the queries seems worthwhile. |
||
| // Treat 'NULL' as false value | ||
| let when_value = match when_value.null_count() { | ||
| 0 => Cow::Borrowed(when_value), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of new code which repeats logic of
filter_record_batch.What if we just changed this line only?
Note how this does not inspect
selection/selection_countdirectly, leveraging the work done byfilter_record_batch.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it, but I don't see the overlap with
filter_record_batch. AFAICT there are no checks to avoid creating a new record batch. What this code is doing is preparing an empty result value whilefilter_record_batchhas optimised code to an empty record batch if the filter is all-false.