-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges #22852
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back dlunden! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
/contributor add rcastanedalo |
@dlunde |
@dlunde |
@dlunde |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good analysis, Daniel! Given the presence of overlapping memory phis (memory phis that are placed in the same block and include aliasing memory slices), the general idea of this fix seems reasonable to me. As a more fundamental solution, it would be worth investigating (perhaps separately) the root cause of this overlap and exploring whether it is feasible to enforce disjointness (an invariant apparently assumed by the original PhaseCFG::insert_anti_dependences
algorithm), at least during code generation.
Does the comment above the definition of initial_mem
require any update as part of this change?
src/hotspot/share/opto/gcm.cpp
Outdated
@@ -769,6 +769,30 @@ Block* PhaseCFG::insert_anti_dependences(Block* LCA, Node* load, bool verify) { | |||
} | |||
} | |||
worklist_def_use_mem_states.push(nullptr, initial_mem); | |||
Block* initial_mem_block = get_block_for_node(initial_mem); | |||
if (load->in(0) && initial_mem_block != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the conditions needed for initial_mem_block == nullptr
? (I did a quick run with an additional assertion and could not find any). It would be great to narrow down this to understand better the completeness of the fix and convince ourselves we are not leaving interesting cases unaddressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more of a sanity check, because I did notice memory nodes without a block elsewhere in the graph at this stage of compilation. But, you are right. It seems initial memory for loads here always have a block. I replaced the check with an assert and reran tests, and it looks fine.
src/hotspot/share/opto/gcm.cpp
Outdated
Block* b = early; | ||
// Stop searching when we run out of dominators (b == nullptr) or when we | ||
// step past the initial memory block (b == initial_mem_block->_idom). | ||
while (b != nullptr && b != initial_mem_block->_idom) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can b
be nullptr
here? The highest up we can go in the dominator tree is the start block, whose immediate dominator is the root block, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you are right. b
is never nullptr
and the second condition on its own is sufficient. I've replaced the first condition with an assert.
// step past the initial memory block (b == initial_mem_block->_idom). | ||
while (b != nullptr && b != initial_mem_block->_idom) { | ||
if (b == initial_mem_block && !initial_mem->is_Phi()) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a brief code comment here explaining why the early break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, now added!
break; | ||
} | ||
for (uint i = 0; i < b->number_of_nodes(); ++i) { | ||
Node* n = b->get_node(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sanity (and efficiency), we might want to (in a separate RFE, and if at all possible) make GCM insert Phi nodes to blocks before any Mach node, and add an early break here. Meanwhile, could you add a comment here explaining that we have to traverse the entire block because Phi nodes might be interleaved with Mach nodes as LCM may not have run yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've now added a comment. It seems that Phi nodes are usually predictably located close to the beginning of blocks, but there are exceptions. Sounds good to investigate separately, I'll create an RFE.
for (uint i = 0; i < b->number_of_nodes(); ++i) { | ||
Node* n = b->get_node(i); | ||
if (n->is_memory_phi() && C->can_alias(n->adr_type(), load_alias_idx)) { | ||
worklist_def_use_mem_states.push(nullptr, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may push <nullptr, initial_mem>
again here, is that an issue? If not, maybe add a comment explaining why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a problem because the push
method includes duplication handling for Phis. However, I think it is better for both clarity and efficiency to just check n != initial_mem
here explicitly (now added).
if (use_mem_state == initial_mem // root (exclusive) of tree we are searching | ||
|| op == Op_MergeMem // internal node of tree we are searching | ||
if (def_mem_state == nullptr // root of a tree we are searching | ||
|| op == Op_MergeMem // internal node of tree we are searching | ||
) { | ||
def_mem_state = use_mem_state; // It's not a possibly interfering store. | ||
if (use_mem_state == initial_mem) | ||
initial_mem = nullptr; // only process initial memory once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course. In the old version, we ensure that we only add the children of initial_mem
once by checking if use_mem_state == initial_mem
and then setting initial_mem = nullptr
after the first occurrence. In the new version, we make use of the fact that it is equivalent to instead simply check if def_mem_state == nullptr
. Entries on the worklist have def_mem_state == nullptr
iff they are initial memory states that were added before the main worklist loop. We never add any entries to the worklist with def_mem_state == nullptr
within the worklist loop itself, so we are guaranteed to process the children of entries with def_mem_state == nullptr
only once (and we no longer need to set initial_mem = nullptr
).
The simplification is especially important now that we can have multiple initial memory states. If we would do it the old way, we would instead need to check if use_mem_state
is a member of the set of initial memory states, and also keep track of which initial memory states we have already visited.
Thanks for the review @robcasloz!
Yes, I think it could be useful to investigate further. Based on my observations while working on this issue, the overlapping memory Phis likely result from loop peeling. However, the Phi overlap is not the sole cause of this issue, as the second example demonstrates. I suggest we write an RFE and investigate in a separate issue.
Yes, thanks. Added now. |
// | ||
// In some cases, there are other relevant initial memory states besides | ||
// initial_mem. In such cases, we are rather dealing with multiple trees and | ||
// their fringes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I look at these comments here (I reviewed a change by Roland a few months back, so my memory is coming back)...
I see that the load is supposed to be scheduled before any Memory state modifying nodes include Store and Phi
that is (transitively via any MergeMem) below the initial_mem
.
In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load
, and the 106 Phi
? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi
?
Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @eme64!
In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load, and the 106 Phi? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi?
Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.
Yes, Phis only result in LCA changes at the block level and we never add anti-dependence edges directly between the load and Phi nodes. In example 1, we do mark the last block in between the path from 107 Phi to 106 Phi (which is B27) for raising the LCA. However, when doing the joint LCA raising operation later on (raise_LCA_above_marks
), we start at the original LCA and stop when we reach the early block (B20). Therefore, we never even consider B27. My very first attempt at solving this issue was to try and identify some dominance relation between the early block and the blocks for and in between 107 Phi and 106 Phi and use this information to force the LCA to the early block. This kind of worked at the block level, but we still need to identify somehow that we need an anti-dependence edge to 64 membar_release. Otherwise, it can happen that the load is scheduled correctly in the early block, but incorrectly (after an overwriting store) within the block (easily verified with -XX:+StressLCM
).
And in example 2, we should schedule before the Phi as well:
Why don't we do that?
Same here as above, we do tag both B19 and B21 for raising the LCA, but never consider them in raise_LCA_above_marks
since they are above the early block B9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never add anti-dependence edges directly between the load and Phi nodes
Ah interesting. Do you know why we do not do that? Would that generate worse code? Because it seems to me that would add fewer edges, and would probably require a smaller traversal. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that anti-dependence edges are only relevant for local scheduling (within blocks). Because Phis merge control-flow paths (by definition at the start of blocks), I would say it makes little sense to add an anti-dependence edge to Phi nodes. Does it make sense semantically to schedule loads before Phi nodes within a block? I don't think so, but I may be wrong.
I think what you are getting at is, at the block level, whether or not it is possible to raise the LCA above the Phi itself, rather than before the relevant inputs. That would make the scheduling less conservative. Have a look at this comment in the source. There are previous attempts at making the LCA raising less conservative in this manner (see, e.g., JDK-8192992), but it turns out to be quite tricky to get right. It is definitely an issue separate from the present one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?
I mean your comment says there are other cases, but it does not really reassure me that we have all cases covered 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another naming question:
other relevant initial memory states
What does the initial
mean to you? To me, it is just the memory state of the load. initial
only because we also look at other memory states.
If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state
or alike. That would make sense: We have multiple trees, starting at root_memory_state
each. initial_mem
is one of them. initial
only refers to things in the initial block, where the initial_mem
is located.
Maybe you have a different definition - but it would be good to have a clear one stated in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?
Do you mean a general proof sketch of the approach in PhaseCFG::insert_anti_dependences
? This changeset does not really change the approach itself, but just extends the search to more initial memory states.
but it does not really reassure me that we have all cases covered
I am not fully reassured we cover all cases either, but we at least now cover more cases than before (and specifically the cases that appear for this issue). See also the earlier comment by Roberto on investigating whether we can somehow enfore in earlier phases the invariant that the original version of PhaseCFG::insert_anti_dependences
assumed: that we only need to search from one initial memory state. I would suggest doing this in a separate RFE.
What does the initial mean to you?
To me it means the initial memory states that we start our searches at. I'm fine with your renaming suggestion, calling them roots. I would perhaps suggest that we then also say input_mem
instead of initial_mem
to signify that it is the actual input to the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting closer to understanding what you are doing 😅
I have some more questions and suggestions.
I think @rwestrel should also have a look at this, he has recently fixed a bug in this code.
// | ||
// In some cases, there are other relevant initial memory states besides | ||
// initial_mem. In such cases, we are rather dealing with multiple trees and | ||
// their fringes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another naming question:
other relevant initial memory states
What does the initial
mean to you? To me, it is just the memory state of the load. initial
only because we also look at other memory states.
If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state
or alike. That would make sense: We have multiple trees, starting at root_memory_state
each. initial_mem
is one of them. initial
only refers to things in the initial block, where the initial_mem
is located.
Maybe you have a different definition - but it would be good to have a clear one stated in the comments.
src/hotspot/share/opto/gcm.cpp
Outdated
if (b == initial_mem_block && !initial_mem->is_Phi()) { | ||
// If we are in the initial memory block, and initial_mem is not itself | ||
// a Phi, no Phis in the block can be initial memory states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused when I read this. As said above, we need a clear definition of initial
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why no Phis in the block can be initial memory states.
? I'm probably missing something obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem
, and therefore irrelevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem, and therefore irrelevant?
Yes, exactly. I'll clarify it in the comment.
Co-authored-by: Emanuel Peter <[email protected]>
Issue Summary
When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks.
Example 1
Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 after a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value).
Example 2
There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9.
Changeset
PhaseCFG::insert_anti_dependences
to also search for anti dependences from Phi nodes according to the above issue summary. We only apply the additional search when loads have an explicit control input, as the issue seems to only appear for such loads.TestGCMLoadPlacement.java
.For reference, I have also developed an alternative changeset that solves the issue in a different way (without using dominator information). However, using dominator information leads to a smaller and less invasive fix.
Testing
tier1
totier4
(and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.Progress
Issue
Contributors
<[email protected]>
<[email protected]>
<[email protected]>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22852/head:pull/22852
$ git checkout pull/22852
Update a local copy of the PR:
$ git checkout pull/22852
$ git pull https://git.openjdk.org/jdk.git pull/22852/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 22852
View PR using the GUI difftool:
$ git pr show -t 22852
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22852.diff
Using Webrev
Link to Webrev Comment