-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8366815: C2: Delay Mod/Div by constant transformation #27886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back hgreule! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, @SirYwell. This seems like a tricky problem. To be honest, the fix seems a bit hacky. Have you explored any alternatives to this method of delaying the optimizations?
I kicked off some testing in the meantime that I will report back upon completion.
// Keep this node as-is for now; we want Value() and | ||
// other optimizations checking for this node type to work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Keep this node as-is for now; we want Value() and | |
// other optimizations checking for this node type to work | |
// Keep this node as-is initially; we want Value() and | |
// other optimizations checking for this node type to work. |
This is a small nit: I find it confusing to talk about "for now" in a method that is called at almost every stage of the compilation. Perhaps "initially" conveys the intention of first letting other optimizations do their magic first a bit better.
Thanks for running tests. I tried delaying until post loop opts, but that prevents some vectorization and isn't really less hacky I guess. I didn't find any other good existing approach. Calculating Value before Ideal would work, but I assume that it is rarely useful, with Div/Mod being an exception. I a dream world, I guess we would have e-graphs or something similar, which would allow calculating a more precise type from different alternatives. If you can think of a better approach, please let me know. |
Thinking about it a bit more, I think your fix is too superficial. If the discovery of the constant is slightly delayed, nothing is folded again. Consider the followig program for an example: class Test {
static boolean test(int x, boolean flag) {
Integer a;
if (flag) {
a = 171384;
} else {
a = 2902;
}
return x % a >= a;
}
public static void main(String[] args) {
for (int i = 0; i < 20000; i++) {
if (test(i, false)) {
throw new RuntimeException("wrong result");
}
}
}
} In my opinion, the benefits do not outweigh the drawbacks for this PR. A better solution would probably be to delay the expansion of the Mod and Div nodes to post-loop optimizations and extend Superword to expand Div/Mod nodes to shifts. However, this is quite a bit of complexity, which raises if this complexity is worth it (@eme64 probably has opinions and/or guidance on this). |
I'm not sure about the drawbacks here, but I think optimizing this on the superword level doesn't make things less complicated. If cases where we end up idealizing before calling Value are a more general problem, I'd say it's worth to also address it on exactly that level: make sure that Value is called before Ideal. I'm just hesitant because I'm not aware of any other situations where this matters. One middle ground here would be some kind of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a bit of an offline discussion in the office yesterday. Here a summary of my thoughts.
Ordering optimizations/phases in compilers is a difficult problem, it is not at all unique to this problem or even C2, all compilers have this problem.
Doing what @SirYwell does here, with delaying to IGVN is a relatively simple fix, and it at least addresses all cases where the divisor
and the comparison
are already parse time constants. I would consider that a win already. But the solution is a bit hacky.
The alternative that was suggested: delay it to post-loop-opts. But that is equally hacky really, it would have the same kind of delay logic where it is proposed now, just with a different "destination" (IGVN vs post-loop-opts). And it has the downside of preventing auto vectorization (SuperWord does not know how to deal with Div/Mod
, no hardware I know of implements vectorized integer division, only floating division is supported). But delaying to post-loop-opts allows cases like @mhaessig showed, where control flow collapses during IGVN. We could also make a similar example where control flow collapses only during loop-opts, in some cases only after SuperWord even (though that would be very rare).
It is really difficult to handle all cases, and I don't know if we really need to. But it is hard to know which cases we should focus on.
Here a super intense solution that would be the most powerful I can think of right now:
- Delay
transform_int_divide
to post-loop-opts, so we can wait for constants to appear during IGVN and loop-opts. - That would mean we have to accept regressions for the currently vectorizing cases, or we have to do some
transform_int_divide
inside SuperWord: add anVTransform::optimize
pass somehow. This would take a "medium" amount of engineering, and it would be more C++ code to maintain and test. - Yet another possibility: during loop-opts, try to do
transform_int_divide
not just with constant divisor, but also loop-invariant divisor. We would have to find a way to do the logic oftransform_int_divide
that finds the magic constants in C2 IR instead of C++ code (there seem to be some "failure" cases in the computation, not sure if we can resolve those). If the loop has sufficient iterations, it can be profitable to do the magic constant calculation before the loop, and do only mul/shift/add inside the loop. But this seems like an optional add-on. But it would be really powerful. And it would make theVTransform::optimize
(SuperWord) step unnecessary.
So my current thinking is:
We have to do some kind of delay anyway, either to IGVN or post-loop-opts, or elsewhere. For now, IGVN is a step in the right direction. The "delay mechanism" is a bit hacky, but we use it in multiple places already (grep for record_for_igvn
). It is not @SirYwell 's fault that our delay mechanism is so hacky.
So I would vote for going with delay to IGVN for now, to at least support the parse-time constants. Then file some RFE that tracks the other ideas, and see if someone wants to pick that up (figure out a loop-opts pass that works for loop-invariant divisors, and otherwise delay to post-loop-opts).
// Keep this node as-is for now; we want Value() and | ||
// other optimizations checking for this node type to work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only need Value
done first on the Div
node, or also on uses of it?
It might be worth explaining it in a bit more detail here.
If it was just about calling Value
on the Div
first, we could probably check what Value
returns here. But I fear that is not enough, right? Because it is the Value
here that returns some range, and then some use sees that this range has specific characteristics, and can constant fold a comparison, for example. Did I get this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the main reason why I'm including Div here is mainly because of #26143; before that the DivI/LNode::Value() is actually less precise than Value on the nodes created by transform_int_divide
. With #26143, some results are more precise even for constant divisors. In such case, uses can benefit from seeing the (then) more precise range. (@ichttt found a case where the replacement fails to constant-fold, but that's just due to missing constant folding in MulHiLNode)
A secondary reason is other optimizations checking for Div inputs, though I didn't find any existing check that would actually benefit. There might be optimization opportunities that want to detect division, but that's just
Generally from what I've found the benefit is bigger for Mod nodes, because there calling Value on the replacements is significantly worse. And there we also encounter typical usages in combination with range checks.
Do you want me to expand both Div and Mod comments to cover more concrete benefits, depending on the operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would make sense to have an explanation at both ends. Your nice example with the "rounding error" of 0..1 for Div
makes a lot of sense. Seeing a similar example for Mod
(where it could be worse, you say) would also be nice 😊
You can copy the comments for the I/L cases, or only put it at one of them, and link from the other. There is an issue with a PR that refactors mod/div so that we only have one implementation each, and they can clean this up.
Thanks for the summary @eme64. I totally agree that it's a bit hacky, but the current state is the least invasive. I'd also be interested in going further steps in the same direction, but I feel like the work increases significantly more than the benefits (at least as long as we don't generalize it to also optimize for loop invariant non-constants, but that's also a lot of work). @mhaessig do you have test results already? |
Manuel and I discussed in the office a little more :) Can you show us a concrete example, where I suspect that it is the value range "truncation" on the lower bits that are lost in Because if there is a solution that just improves the But if we in the end need to build a |
One very straightforward example would be something like static boolean divFold(int a) {
return a / 100_000 >= 21475;
} which isn't folded to From my analysis, this comes from the the rounding adjustments: We need to round towards zero, so we need to add 1 (=subtract -1) for negative values. We achieve that by an right shift to produce either a 0 or a -1 and then do the subtraction with that value. ![]() The subtraction isn't aware of the relation between the param being negative and the adjustment, and as you said, to recognize that relation, you'd more or less need to recognize that these operations form a division. Now, I think this is the only case, and it's only off by 1 (and if the sign of the dividend is known, it also isn't a problem), so I'm wondering if there are any common patterns where this would be relevant, otherwise it might really make sense to just delay Mod and accept this edge case for Div. |
Thanks very much for the explanation and the nice graph 😊 That helps a lot. It also means that even for cases like @mhaessig showed above: We could still constant fold the comparison.... as long as the comparison is "relaxed enough". It might be worth having a handfull of examples like that: some that still constant fold, and some that don't because the comparison is too "sharp", and the "rounding error" too large. What do you think? |
The test cases show examples of code where
Value()
previously wasn't run because idealization took place before, resulting in less precise type analysis.Please let me know what you think.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27886/head:pull/27886
$ git checkout pull/27886
Update a local copy of the PR:
$ git checkout pull/27886
$ git pull https://git.openjdk.org/jdk.git pull/27886/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27886
View PR using the GUI difftool:
$ git pr show -t 27886
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27886.diff
Using Webrev
Link to Webrev Comment