Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges #22852

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dlunde
Copy link
Member

@dlunde dlunde commented Dec 20, 2024

Issue Summary

When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks.

Example 1

Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 after a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value).

failure-graph-1
failure-blocks-1

Example 2

There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9.

failure-graph-2
failure-blocks-2

Changeset

  • Update PhaseCFG::insert_anti_dependences to also search for anti dependences from Phi nodes according to the above issue summary. We only apply the additional search when loads have an explicit control input, as the issue seems to only appear for such loads.
  • Clean up an old leftover comment relating to loop Phis.
  • Add mulitple new regression tests in TestGCMLoadPlacement.java.

For reference, I have also developed an alternative changeset that solves the issue in a different way (without using dominator information). However, using dominator information leads to a smaller and less invasive fix.

Testing

  • GitHub Actions
  • tier1 to tier4 (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
  • Performance testing using DaCapo, SPECjbb, and SPECjvm on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. There are no significant improvements nor regressions.
  • C2 compilation time testing using DaCapo. There appears to be no significant regression.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges (Bug - P3)

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22852/head:pull/22852
$ git checkout pull/22852

Update a local copy of the PR:
$ git checkout pull/22852
$ git pull https://git.openjdk.org/jdk.git pull/22852/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22852

View PR using the GUI difftool:
$ git pr show -t 22852

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22852.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 20, 2024

👋 Welcome back dlunden! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 20, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot changed the title 8333393 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges Dec 20, 2024
@dlunde
Copy link
Member Author

dlunde commented Dec 20, 2024

/contributor add rcastanedalo
/contributor add chagedorn
/contributor add thartmann

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 20, 2024
@openjdk
Copy link

openjdk bot commented Dec 20, 2024

@dlunde
Contributor Roberto Castañeda Lozano <[email protected]> successfully added.

@openjdk
Copy link

openjdk bot commented Dec 20, 2024

@dlunde
Contributor Christian Hagedorn <[email protected]> successfully added.

@openjdk
Copy link

openjdk bot commented Dec 20, 2024

@dlunde
Contributor Tobias Hartmann <[email protected]> successfully added.

@openjdk
Copy link

openjdk bot commented Dec 20, 2024

@dlunde The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Dec 20, 2024

Webrevs

Copy link
Contributor

@robcasloz robcasloz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good analysis, Daniel! Given the presence of overlapping memory phis (memory phis that are placed in the same block and include aliasing memory slices), the general idea of this fix seems reasonable to me. As a more fundamental solution, it would be worth investigating (perhaps separately) the root cause of this overlap and exploring whether it is feasible to enforce disjointness (an invariant apparently assumed by the original PhaseCFG::insert_anti_dependences algorithm), at least during code generation.

Does the comment above the definition of initial_mem require any update as part of this change?

src/hotspot/share/opto/gcm.cpp Outdated Show resolved Hide resolved
@@ -769,6 +769,30 @@ Block* PhaseCFG::insert_anti_dependences(Block* LCA, Node* load, bool verify) {
}
}
worklist_def_use_mem_states.push(nullptr, initial_mem);
Block* initial_mem_block = get_block_for_node(initial_mem);
if (load->in(0) && initial_mem_block != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the conditions needed for initial_mem_block == nullptr? (I did a quick run with an additional assertion and could not find any). It would be great to narrow down this to understand better the completeness of the fix and convince ourselves we are not leaving interesting cases unaddressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more of a sanity check, because I did notice memory nodes without a block elsewhere in the graph at this stage of compilation. But, you are right. It seems initial memory for loads here always have a block. I replaced the check with an assert and reran tests, and it looks fine.

src/hotspot/share/opto/gcm.cpp Outdated Show resolved Hide resolved
Block* b = early;
// Stop searching when we run out of dominators (b == nullptr) or when we
// step past the initial memory block (b == initial_mem_block->_idom).
while (b != nullptr && b != initial_mem_block->_idom) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can b be nullptr here? The highest up we can go in the dominator tree is the start block, whose immediate dominator is the root block, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you are right. b is never nullptr and the second condition on its own is sufficient. I've replaced the first condition with an assert.

// step past the initial memory block (b == initial_mem_block->_idom).
while (b != nullptr && b != initial_mem_block->_idom) {
if (b == initial_mem_block && !initial_mem->is_Phi()) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief code comment here explaining why the early break?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, now added!

break;
}
for (uint i = 0; i < b->number_of_nodes(); ++i) {
Node* n = b->get_node(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sanity (and efficiency), we might want to (in a separate RFE, and if at all possible) make GCM insert Phi nodes to blocks before any Mach node, and add an early break here. Meanwhile, could you add a comment here explaining that we have to traverse the entire block because Phi nodes might be interleaved with Mach nodes as LCM may not have run yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've now added a comment. It seems that Phi nodes are usually predictably located close to the beginning of blocks, but there are exceptions. Sounds good to investigate separately, I'll create an RFE.

for (uint i = 0; i < b->number_of_nodes(); ++i) {
Node* n = b->get_node(i);
if (n->is_memory_phi() && C->can_alias(n->adr_type(), load_alias_idx)) {
worklist_def_use_mem_states.push(nullptr, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may push <nullptr, initial_mem> again here, is that an issue? If not, maybe add a comment explaining why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a problem because the push method includes duplication handling for Phis. However, I think it is better for both clarity and efficiency to just check n != initial_mem here explicitly (now added).

Comment on lines -794 to -799
if (use_mem_state == initial_mem // root (exclusive) of tree we are searching
|| op == Op_MergeMem // internal node of tree we are searching
if (def_mem_state == nullptr // root of a tree we are searching
|| op == Op_MergeMem // internal node of tree we are searching
) {
def_mem_state = use_mem_state; // It's not a possibly interfering store.
if (use_mem_state == initial_mem)
initial_mem = nullptr; // only process initial memory once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course. In the old version, we ensure that we only add the children of initial_mem once by checking if use_mem_state == initial_mem and then setting initial_mem = nullptr after the first occurrence. In the new version, we make use of the fact that it is equivalent to instead simply check if def_mem_state == nullptr. Entries on the worklist have def_mem_state == nullptr iff they are initial memory states that were added before the main worklist loop. We never add any entries to the worklist with def_mem_state == nullptr within the worklist loop itself, so we are guaranteed to process the children of entries with def_mem_state == nullptr only once (and we no longer need to set initial_mem = nullptr).

The simplification is especially important now that we can have multiple initial memory states. If we would do it the old way, we would instead need to check if use_mem_state is a member of the set of initial memory states, and also keep track of which initial memory states we have already visited.

@dlunde
Copy link
Member Author

dlunde commented Jan 7, 2025

Thanks for the review @robcasloz!

Given the presence of overlapping memory phis (memory phis that are placed in the same block and include aliasing memory slices), the general idea of this fix seems reasonable to me. As a more fundamental solution, it would be worth investigating (perhaps separately) the root cause of this overlap and exploring whether it is feasible to enforce disjointness (an invariant apparently assumed by the original PhaseCFG::insert_anti_dependences algorithm), at least during code generation.

Yes, I think it could be useful to investigate further. Based on my observations while working on this issue, the overlapping memory Phis likely result from loop peeling. However, the Phi overlap is not the sole cause of this issue, as the second example demonstrates. I suggest we write an RFE and investigate in a separate issue.

Does the comment above the definition of initial_mem require any update as part of this change?

Yes, thanks. Added now.

//
// In some cases, there are other relevant initial memory states besides
// initial_mem. In such cases, we are rather dealing with multiple trees and
// their fringes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I look at these comments here (I reviewed a change by Roland a few months back, so my memory is coming back)...
I see that the load is supposed to be scheduled before any Memory state modifying nodes include Store and Phi that is (transitively via any MergeMem) below the initial_mem.

In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load, and the 106 Phi? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi?

Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in example 2, we should schedule before the Phi as well:
image

Why don't we do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @eme64!

In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load, and the 106 Phi? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi?

Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.

Yes, Phis only result in LCA changes at the block level and we never add anti-dependence edges directly between the load and Phi nodes. In example 1, we do mark the last block in between the path from 107 Phi to 106 Phi (which is B27) for raising the LCA. However, when doing the joint LCA raising operation later on (raise_LCA_above_marks), we start at the original LCA and stop when we reach the early block (B20). Therefore, we never even consider B27. My very first attempt at solving this issue was to try and identify some dominance relation between the early block and the blocks for and in between 107 Phi and 106 Phi and use this information to force the LCA to the early block. This kind of worked at the block level, but we still need to identify somehow that we need an anti-dependence edge to 64 membar_release. Otherwise, it can happen that the load is scheduled correctly in the early block, but incorrectly (after an overwriting store) within the block (easily verified with -XX:+StressLCM).

And in example 2, we should schedule before the Phi as well:
Why don't we do that?

Same here as above, we do tag both B19 and B21 for raising the LCA, but never consider them in raise_LCA_above_marks since they are above the early block B9.

Copy link
Contributor

@eme64 eme64 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never add anti-dependence edges directly between the load and Phi nodes

Ah interesting. Do you know why we do not do that? Would that generate worse code? Because it seems to me that would add fewer edges, and would probably require a smaller traversal. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that anti-dependence edges are only relevant for local scheduling (within blocks). Because Phis merge control-flow paths (by definition at the start of blocks), I would say it makes little sense to add an anti-dependence edge to Phi nodes. Does it make sense semantically to schedule loads before Phi nodes within a block? I don't think so, but I may be wrong.

I think what you are getting at is, at the block level, whether or not it is possible to raise the LCA above the Phi itself, rather than before the relevant inputs. That would make the scheduling less conservative. Have a look at this comment in the source. There are previous attempts at making the LCA raising less conservative in this manner (see, e.g., JDK-8192992), but it turns out to be quite tricky to get right. It is definitely an issue separate from the present one!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?

I mean your comment says there are other cases, but it does not really reassure me that we have all cases covered 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming question:
other relevant initial memory states
What does the initial mean to you? To me, it is just the memory state of the load. initial only because we also look at other memory states.

If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state or alike. That would make sense: We have multiple trees, starting at root_memory_state each. initial_mem is one of them. initial only refers to things in the initial block, where the initial_mem is located.

Maybe you have a different definition - but it would be good to have a clear one stated in the comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?

Do you mean a general proof sketch of the approach in PhaseCFG::insert_anti_dependences? This changeset does not really change the approach itself, but just extends the search to more initial memory states.

but it does not really reassure me that we have all cases covered

I am not fully reassured we cover all cases either, but we at least now cover more cases than before (and specifically the cases that appear for this issue). See also the earlier comment by Roberto on investigating whether we can somehow enfore in earlier phases the invariant that the original version of PhaseCFG::insert_anti_dependences assumed: that we only need to search from one initial memory state. I would suggest doing this in a separate RFE.

What does the initial mean to you?

To me it means the initial memory states that we start our searches at. I'm fine with your renaming suggestion, calling them roots. I would perhaps suggest that we then also say input_mem instead of initial_mem to signify that it is the actual input to the load.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting closer to understanding what you are doing 😅

I have some more questions and suggestions.

I think @rwestrel should also have a look at this, he has recently fixed a bug in this code.

src/hotspot/share/opto/gcm.cpp Outdated Show resolved Hide resolved
//
// In some cases, there are other relevant initial memory states besides
// initial_mem. In such cases, we are rather dealing with multiple trees and
// their fringes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming question:
other relevant initial memory states
What does the initial mean to you? To me, it is just the memory state of the load. initial only because we also look at other memory states.

If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state or alike. That would make sense: We have multiple trees, starting at root_memory_state each. initial_mem is one of them. initial only refers to things in the initial block, where the initial_mem is located.

Maybe you have a different definition - but it would be good to have a clear one stated in the comments.

Comment on lines 791 to 793
if (b == initial_mem_block && !initial_mem->is_Phi()) {
// If we are in the initial memory block, and initial_mem is not itself
// a Phi, no Phis in the block can be initial memory states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused when I read this. As said above, we need a clear definition of initial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why no Phis in the block can be initial memory states.? I'm probably missing something obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem, and therefore irrelevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem, and therefore irrelevant?

Yes, exactly. I'll clarify it in the comment.

src/hotspot/share/opto/gcm.cpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants