Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR] Server heartbeat to coordinator while unregister shuffle #2132

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

maobaolong
Copy link
Member

What changes were proposed in this pull request?

Server send heartbeat to coordinator while receive unregister shuffle request.

Why are the changes needed?

Without this PR, server could not heartbeat the updated app info after unregister this app, so the coordinator and dashboard could display the outdate information.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested by a tiny spark job executed by spark-shell

  • shell command line
bin/spark-shell  --master  spark://localhost:7077  --deploy-mode client --conf spark.rss.coordinator.quorum=localhost:19999   --conf spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager  --conf spark.rss.storage.type=LOCALFILE --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.rss.test.mode.enable=true --conf spark.rss.client.type=GRPC_NETTY --conf spark.default.parallelism=16  -i test.scala
  • test.scala
val data = sc.parallelize(Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5), ("A", 6), ("A", 7),("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7), ("A", 7)));
val result = data.reduceByKey(_ + _);
result.collect().foreach(println);
System.exit(0);
image

Copy link

github-actions bot commented Sep 20, 2024

Test Results

 2 950 files   -  6   2 950 suites   - 6   6h 14m 38s ⏱️ - 12m 37s
 1 092 tests ± 0   1 089 ✅  -  1   2 💤 ±0  1 ❌ +1 
13 673 runs   - 12  13 642 ✅  - 13  30 💤 ±0  1 ❌ +1 

For more details on these failures, see this check.

Results for commit 770177a. ± Comparison against base commit c1bfa04.

♻️ This comment has been updated with latest results.

@maobaolong maobaolong force-pushed the heartbeatWhileUnregister branch from 8d6a4b0 to 3e4ce0c Compare September 22, 2024 05:49
@zuston
Copy link
Member

zuston commented Sep 26, 2024

I hope the heartbeat could not be invoked by other thread like the unrelated unregister operations. If you want the latest info, you can decrease the heartbeat interval of server -> coordinator. WDYT? @maobaolong

@maobaolong
Copy link
Member Author

@zuston The affect of decrease interval can make it better but not a fundamental solution, the latest info could be lost also, and the frequency heartbeat could make Coordinator busy.

And could you elaborate a little more on what concern from your side if heartbeat invoked by another thread?

@zuston
Copy link
Member

zuston commented Sep 26, 2024

@zuston The affect of decrease interval can make it better but not a fundamental solution, the latest info could be lost also, and the frequency heartbeat could make Coordinator busy.

And could you elaborate a little more on what concern from your side if heartbeat invoked by another thread?

External client operation will control the frequency of internal heartbeat, this is unreasonable and danger.

@maobaolong
Copy link
Member Author

@zuston Hmm, well, is there another way to send the latest info?

Without the latest info, the app information is not completed. Do you think it could be better if I create a new RPC method to talk to coordinator for the latest info?

@maobaolong maobaolong force-pushed the heartbeatWhileUnregister branch from 3e4ce0c to 886bd76 Compare October 22, 2024 12:20
@maobaolong
Copy link
Member Author

@zuston Thanks for the discussion offline. I think I need to introduce the motivation of this PR.

The motivation of this PR aimed to resolve the scenario for our regression test, it is a simple test with tiny shuffle data as the description given, without this PR, the coordinator cannot collect any application information.

For the production scenario, it could be a little better, since lots of app have a huge mount of shuffle data, but the risky of lost last heartbeat still exist.

As a tradeoff, I add a configure option(default to false to keep the original behavior), it can be set to true if we need to get the exact information of application, and keep to false by default.

@maobaolong
Copy link
Member Author

ping @zuston

@@ -719,6 +719,11 @@ public class ShuffleServerConf extends RssBaseConf {
.booleanType()
.defaultValue(false)
.withDescription("Whether to enable app detail log");
public static final ConfigOption<Boolean> SERVER_TRIGGER_REPORT_WHILE_UNREGISTER_ENABLED =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this option rename to rss.server.heartbeatReportOnUnregisterEnabled ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@maobaolong maobaolong force-pushed the heartbeatWhileUnregister branch from 0b12740 to 7c3289c Compare November 19, 2024 07:08
@maobaolong
Copy link
Member Author

@zuston Thanks for your suggestion, renamed by last commit . PTAL

@maobaolong maobaolong force-pushed the heartbeatWhileUnregister branch from 9e8042a to 770177a Compare November 19, 2024 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants