Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2279] improvement(spark): Trigger the upstream rewrite when the read stage fails #2281

Merged
merged 3 commits into from
Dec 20, 2024

Conversation

yl09099
Copy link
Contributor

@yl09099 yl09099 commented Dec 9, 2024

What changes were proposed in this pull request?

If the current Reader fails to obtain Shuffle data, it does not trigger the upstream Stage to rewrite the data. If a Shuffle Server fails, it does not trigger Stage retry.

Why are the changes needed?

Fix: #2279

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

@yl09099
Copy link
Contributor Author

yl09099 commented Dec 9, 2024

Wait a minute. I'm not ready for this.

Copy link

github-actions bot commented Dec 9, 2024

Test Results

 2 966 files  ±0   2 966 suites  ±0   6h 28m 22s ⏱️ + 2m 9s
 1 097 tests ±0   1 095 ✅ ±0   2 💤 ±0  0 ❌ ±0 
13 750 runs  ±0  13 720 ✅ ±0  30 💤 ±0  0 ❌ ±0 

Results for commit 93e9e5d. ± Comparison against base commit a1d3252.

♻️ This comment has been updated with latest results.

@jerqi
Copy link
Contributor

jerqi commented Dec 13, 2024

Wait a minute. I'm not ready for this.

If this is not ready to review, you can mark this as draft.

@jerqi jerqi marked this pull request as draft December 13, 2024 08:09
@yl09099
Copy link
Contributor Author

yl09099 commented Dec 13, 2024

Wait a minute. I'm not ready for this.

If this is not ready to review, you can mark this as draft.

ok

@yl09099 yl09099 marked this pull request as ready for review December 14, 2024 09:08
@@ -553,16 +576,36 @@ public void incPartitionFetchFailure(int stageAttempt, int partition) {
});
}

public int getPartitionFetchFailureNum(int stageAttempt, int partition) {
public boolean getPartitionFetchFailureNum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a weird return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a weird return value.

The number of failures is compared to the maximum number of failures in the configuration, and I moved the comparison logic to the internal implementation, so I only need to return whether fetchfailed is thrown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not proper name if we don't change the method name.

}
});
}

public void setClearedMapTrackerBlock(boolean isCleared) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a better method name?

@@ -489,10 +509,13 @@ private static class RssShuffleStatus {
private final ReentrantReadWriteLock.WriteLock writeLock = lock.writeLock();
private final int[] partitions;
private int stageAttempt;
// Whether the Shuffle result has been cleared for the current number of attempts.
private boolean isClearedMapTrackerBlock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isClearedMapTrackerBlock -> hasClearedMapTrackerBlock

}
});
}

public void clearedMapTrackerBlock(boolean isCleared) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't the parameter isCleared. It's rebundant.

});
}

public boolean isClearedMapTrackerBlock() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isClearedMapTrackerBlock -> hasClearedMapTrackerBlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above has been changed.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 23.52941% with 26 lines in your changes missing coverage. Please review.

Project coverage is 52.30%. Comparing base (ac89c19) to head (c58aa9f).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...fle/shuffle/manager/ShuffleManagerGrpcService.java 23.52% 24 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2281      +/-   ##
============================================
+ Coverage     51.78%   52.30%   +0.52%     
- Complexity     2966     3386     +420     
============================================
  Files           479      518      +39     
  Lines         22566    27900    +5334     
  Branches       2068     2628     +560     
============================================
+ Hits          11686    14594    +2908     
- Misses        10140    12345    +2205     
- Partials        740      961     +221     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerqi
Copy link
Contributor

jerqi commented Dec 18, 2024

cc @maobaolong

@jerqi jerqi merged commit e7e191b into apache:master Dec 20, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] The upstream rewrite is triggered when the read stage fails.
3 participants