8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges #22852

dlunde · 2024-12-20T18:26:55Z

Issue Summary

When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks.

Example 1

Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 after a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value).

Example 2

There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9.

Changeset

Update PhaseCFG::insert_anti_dependences to also search for anti dependences from Phi nodes according to the above issue summary. We only apply the additional search when loads have an explicit control input, as the issue seems to only appear for such loads.
Clean up an old leftover comment relating to loop Phis.
Add mulitple new regression tests in TestGCMLoadPlacement.java.

For reference, I have also developed an alternative changeset that solves the issue in a different way (without using dominator information). However, using dominator information leads to a smaller and less invasive fix.

Testing

GitHub Actions
tier1 to tier4 (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
Performance testing using DaCapo, SPECjbb, and SPECjvm on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. There are no significant improvements nor regressions.
C2 compilation time testing using DaCapo. There appears to be no significant regression.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges (Bug - P3)

Contributors

Roberto Castañeda Lozano <[email protected]>
Christian Hagedorn <[email protected]>
Tobias Hartmann <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22852/head:pull/22852
$ git checkout pull/22852

Update a local copy of the PR:
$ git checkout pull/22852
$ git pull https://git.openjdk.org/jdk.git pull/22852/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22852

View PR using the GUI difftool:
$ git pr show -t 22852

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22852.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2024-12-20T18:27:51Z

👋 Welcome back dlunden! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-12-20T18:28:21Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

dlunde · 2024-12-20T18:29:16Z

/contributor add rcastanedalo
/contributor add chagedorn
/contributor add thartmann

openjdk · 2024-12-20T18:29:59Z

@dlunde
Contributor Roberto Castañeda Lozano <[email protected]> successfully added.

openjdk · 2024-12-20T18:30:37Z

@dlunde
Contributor Christian Hagedorn <[email protected]> successfully added.

openjdk · 2024-12-20T18:31:00Z

@dlunde
Contributor Tobias Hartmann <[email protected]> successfully added.

openjdk · 2024-12-20T18:31:31Z

@dlunde The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2024-12-20T18:34:47Z

Webrevs

robcasloz

Good analysis, Daniel! Given the presence of overlapping memory phis (memory phis that are placed in the same block and include aliasing memory slices), the general idea of this fix seems reasonable to me. As a more fundamental solution, it would be worth investigating (perhaps separately) the root cause of this overlap and exploring whether it is feasible to enforce disjointness (an invariant apparently assumed by the original PhaseCFG::insert_anti_dependences algorithm), at least during code generation.

Does the comment above the definition of initial_mem require any update as part of this change?

src/hotspot/share/opto/gcm.cpp

robcasloz · 2024-12-23T10:52:35Z

src/hotspot/share/opto/gcm.cpp

@@ -769,6 +769,30 @@ Block* PhaseCFG::insert_anti_dependences(Block* LCA, Node* load, bool verify) {
    }
  }
  worklist_def_use_mem_states.push(nullptr, initial_mem);
+  Block* initial_mem_block = get_block_for_node(initial_mem);
+  if (load->in(0) && initial_mem_block != nullptr) {


What would be the conditions needed for initial_mem_block == nullptr? (I did a quick run with an additional assertion and could not find any). It would be great to narrow down this to understand better the completeness of the fix and convince ourselves we are not leaving interesting cases unaddressed.

This was more of a sanity check, because I did notice memory nodes without a block elsewhere in the graph at this stage of compilation. But, you are right. It seems initial memory for loads here always have a block. I replaced the check with an assert and reran tests, and it looks fine.

src/hotspot/share/opto/gcm.cpp

robcasloz · 2024-12-23T10:58:56Z

src/hotspot/share/opto/gcm.cpp

+    Block* b = early;
+    // Stop searching when we run out of dominators (b == nullptr) or when we
+    // step past the initial memory block (b == initial_mem_block->_idom).
+    while (b != nullptr && b != initial_mem_block->_idom) {


Can b be nullptr here? The highest up we can go in the dominator tree is the start block, whose immediate dominator is the root block, no?

Thanks, you are right. b is never nullptr and the second condition on its own is sufficient. I've replaced the first condition with an assert.

robcasloz · 2024-12-23T10:59:39Z

src/hotspot/share/opto/gcm.cpp

+    // step past the initial memory block (b == initial_mem_block->_idom).
+    while (b != nullptr && b != initial_mem_block->_idom) {
+      if (b == initial_mem_block && !initial_mem->is_Phi()) {
+        break;


Could you add a brief code comment here explaining why the early break?

Sure, now added!

robcasloz · 2024-12-23T11:03:17Z

src/hotspot/share/opto/gcm.cpp

+        break;
+      }
+      for (uint i = 0; i < b->number_of_nodes(); ++i) {
+        Node* n = b->get_node(i);


For sanity (and efficiency), we might want to (in a separate RFE, and if at all possible) make GCM insert Phi nodes to blocks before any Mach node, and add an early break here. Meanwhile, could you add a comment here explaining that we have to traverse the entire block because Phi nodes might be interleaved with Mach nodes as LCM may not have run yet?

Sure, I've now added a comment. It seems that Phi nodes are usually predictably located close to the beginning of blocks, but there are exceptions. Sounds good to investigate separately, I'll create an RFE.

robcasloz · 2024-12-23T11:04:54Z

src/hotspot/share/opto/gcm.cpp

+      for (uint i = 0; i < b->number_of_nodes(); ++i) {
+        Node* n = b->get_node(i);
+        if (n->is_memory_phi() && C->can_alias(n->adr_type(), load_alias_idx)) {
+          worklist_def_use_mem_states.push(nullptr, n);


We may push <nullptr, initial_mem> again here, is that an issue? If not, maybe add a comment explaining why.

It is not a problem because the push method includes duplication handling for Phis. However, I think it is better for both clarity and efficiency to just check n != initial_mem here explicitly (now added).

robcasloz · 2024-12-23T11:06:05Z

src/hotspot/share/opto/gcm.cpp

-    if (use_mem_state == initial_mem    // root (exclusive) of tree we are searching
-        || op == Op_MergeMem    // internal node of tree we are searching
+    if (def_mem_state == nullptr // root of a tree we are searching
+        || op == Op_MergeMem     // internal node of tree we are searching
        ) {
      def_mem_state = use_mem_state;   // It's not a possibly interfering store.
-      if (use_mem_state == initial_mem)
-        initial_mem = nullptr;  // only process initial memory once


Could you explain these changes?

Yes, of course. In the old version, we ensure that we only add the children of initial_mem once by checking if use_mem_state == initial_mem and then setting initial_mem = nullptr after the first occurrence. In the new version, we make use of the fact that it is equivalent to instead simply check if def_mem_state == nullptr. Entries on the worklist have def_mem_state == nullptr iff they are initial memory states that were added before the main worklist loop. We never add any entries to the worklist with def_mem_state == nullptr within the worklist loop itself, so we are guaranteed to process the children of entries with def_mem_state == nullptr only once (and we no longer need to set initial_mem = nullptr).

The simplification is especially important now that we can have multiple initial memory states. If we would do it the old way, we would instead need to check if use_mem_state is a member of the set of initial memory states, and also keep track of which initial memory states we have already visited.

dlunde · 2025-01-07T18:00:21Z

Thanks for the review @robcasloz!

Given the presence of overlapping memory phis (memory phis that are placed in the same block and include aliasing memory slices), the general idea of this fix seems reasonable to me. As a more fundamental solution, it would be worth investigating (perhaps separately) the root cause of this overlap and exploring whether it is feasible to enforce disjointness (an invariant apparently assumed by the original PhaseCFG::insert_anti_dependences algorithm), at least during code generation.

Yes, I think it could be useful to investigate further. Based on my observations while working on this issue, the overlapping memory Phis likely result from loop peeling. However, the Phi overlap is not the sole cause of this issue, as the second example demonstrates. I suggest we write an RFE and investigate in a separate issue.

Does the comment above the definition of initial_mem require any update as part of this change?

Yes, thanks. Added now.

eme64 · 2025-01-08T14:00:24Z

src/hotspot/share/opto/gcm.cpp

+  //
+  // In some cases, there are other relevant initial memory states besides
+  // initial_mem. In such cases, we are rather dealing with multiple trees and
+  // their fringes.


If I look at these comments here (I reviewed a change by Roland a few months back, so my memory is coming back)...
I see that the load is supposed to be scheduled before any Memory state modifying nodes include Store and Phi that is (transitively via any MergeMem) below the initial_mem.

In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load, and the 106 Phi? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi?

Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.

And in example 2, we should schedule before the Phi as well:

Why don't we do that?

Thanks for the comments @eme64!

In you example1, why do we therefore not put an anti-dependency edge betweeen the 183 load, and the 106 Phi? Would that not be enough to ensure the load is scheduled before the other memory affecting nodes further below 106 Phi?

Or is the issue that this traversal is somehow restricted to blocks - I don't remember that from last time...
I'll keep reading the changes now.

Yes, Phis only result in LCA changes at the block level and we never add anti-dependence edges directly between the load and Phi nodes. In example 1, we do mark the last block in between the path from 107 Phi to 106 Phi (which is B27) for raising the LCA. However, when doing the joint LCA raising operation later on (raise_LCA_above_marks), we start at the original LCA and stop when we reach the early block (B20). Therefore, we never even consider B27. My very first attempt at solving this issue was to try and identify some dominance relation between the early block and the blocks for and in between 107 Phi and 106 Phi and use this information to force the LCA to the early block. This kind of worked at the block level, but we still need to identify somehow that we need an anti-dependence edge to 64 membar_release. Otherwise, it can happen that the load is scheduled correctly in the early block, but incorrectly (after an overwriting store) within the block (easily verified with -XX:+StressLCM).

And in example 2, we should schedule before the Phi as well:
Why don't we do that?

Same here as above, we do tag both B19 and B21 for raising the LCA, but never consider them in raise_LCA_above_marks since they are above the early block B9.

we never add anti-dependence edges directly between the load and Phi nodes

Ah interesting. Do you know why we do not do that? Would that generate worse code? Because it seems to me that would add fewer edges, and would probably require a smaller traversal. What do you think?

My understanding is that anti-dependence edges are only relevant for local scheduling (within blocks). Because Phis merge control-flow paths (by definition at the start of blocks), I would say it makes little sense to add an anti-dependence edge to Phi nodes. Does it make sense semantically to schedule loads before Phi nodes within a block? I don't think so, but I may be wrong.

I think what you are getting at is, at the block level, whether or not it is possible to raise the LCA above the Phi itself, rather than before the relevant inputs. That would make the scheduling less conservative. Have a look at this comment in the source. There are previous attempts at making the LCA raising less conservative in this manner (see, e.g., JDK-8192992), but it turns out to be quite tricky to get right. It is definitely an issue separate from the present one!

Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?

I mean your comment says there are other cases, but it does not really reassure me that we have all cases covered 😅

Another naming question:
other relevant initial memory states
What does the initial mean to you? To me, it is just the memory state of the load. initial only because we also look at other memory states.

If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state or alike. That would make sense: We have multiple trees, starting at root_memory_state each. initial_mem is one of them. initial only refers to things in the initial block, where the initial_mem is located.

Maybe you have a different definition - but it would be good to have a clear one stated in the comments.

Hmm, ok. I think the description here should be made more clear then, and explain what the strategy is, and attempt a kind of informal proof why this is an ok approach. What do you think?

Do you mean a general proof sketch of the approach in PhaseCFG::insert_anti_dependences? This changeset does not really change the approach itself, but just extends the search to more initial memory states.

but it does not really reassure me that we have all cases covered

I am not fully reassured we cover all cases either, but we at least now cover more cases than before (and specifically the cases that appear for this issue). See also the earlier comment by Roberto on investigating whether we can somehow enfore in earlier phases the invariant that the original version of PhaseCFG::insert_anti_dependences assumed: that we only need to search from one initial memory state. I would suggest doing this in a separate RFE.

What does the initial mean to you?

To me it means the initial memory states that we start our searches at. I'm fine with your renaming suggestion, calling them roots. I would perhaps suggest that we then also say input_mem instead of initial_mem to signify that it is the actual input to the load.

eme64

I'm getting closer to understanding what you are doing 😅

I have some more questions and suggestions.

I think @rwestrel should also have a look at this, he has recently fixed a bug in this code.

src/hotspot/share/opto/gcm.cpp

eme64 · 2025-01-09T12:40:10Z

src/hotspot/share/opto/gcm.cpp

+  //
+  // In some cases, there are other relevant initial memory states besides
+  // initial_mem. In such cases, we are rather dealing with multiple trees and
+  // their fringes.


Another naming question:
other relevant initial memory states
What does the initial mean to you? To me, it is just the memory state of the load. initial only because we also look at other memory states.

If my definition is correct, then I wonder if talking about "other initial memory states" makes sense, or if they should have a different name for clarity. Maybe root_memory_state or alike. That would make sense: We have multiple trees, starting at root_memory_state each. initial_mem is one of them. initial only refers to things in the initial block, where the initial_mem is located.

Maybe you have a different definition - but it would be good to have a clear one stated in the comments.

eme64 · 2025-01-09T12:43:31Z

src/hotspot/share/opto/gcm.cpp

+      if (b == initial_mem_block && !initial_mem->is_Phi()) {
+        // If we are in the initial memory block, and initial_mem is not itself
+        // a Phi, no Phis in the block can be initial memory states.


I'm confused when I read this. As said above, we need a clear definition of initial.

Can you explain why no Phis in the block can be initial memory states.? I'm probably missing something obvious.

Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem, and therefore irrelevant?

Is it that Phi's could exist for aliasing memory, but it would be "above" initial_mem, and therefore irrelevant?

Yes, exactly. I'll clarify it in the comment.

src/hotspot/share/opto/gcm.cpp

Co-authored-by: Emanuel Peter <[email protected]>

First review version

e154c9f

openjdk bot changed the title ~~8333393~~ 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges Dec 20, 2024

openjdk bot added the rfr Pull request is ready for review label Dec 20, 2024

openjdk bot added the hotspot-compiler [email protected] label Dec 20, 2024

robcasloz reviewed Dec 23, 2024

View reviewed changes

Updates after comments

9ec33d5

eme64 reviewed Jan 8, 2025

View reviewed changes

eme64 suggested changes Jan 9, 2025

View reviewed changes

dlunde and others added 3 commits January 9, 2025 17:51

Update src/hotspot/share/opto/gcm.cpp

fcd8bae

Co-authored-by: Emanuel Peter <[email protected]>

Update after comments

008abfd

Fix comma splice in comment

7390f51

8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges #22852

Are you sure you want to change the base?

8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges #22852

Conversation

dlunde commented Dec 20, 2024 • edited by openjdk bot Loading

Issue Summary

Example 1

Example 2

Changeset

Testing

Progress

Issue

Contributors

Reviewing

bridgekeeper bot commented Dec 20, 2024

openjdk bot commented Dec 20, 2024

dlunde commented Dec 20, 2024 • edited Loading

openjdk bot commented Dec 20, 2024

openjdk bot commented Dec 20, 2024

openjdk bot commented Dec 20, 2024

openjdk bot commented Dec 20, 2024

mlbridge bot commented Dec 20, 2024 • edited Loading

Webrevs

robcasloz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlunde commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlunde commented Dec 20, 2024 •

edited by openjdk bot

Loading

dlunde commented Dec 20, 2024 •

edited

Loading

mlbridge bot commented Dec 20, 2024 •

edited

Loading

eme64 Jan 9, 2025 •

edited

Loading