Skip to content

Conversation

@liviazhu
Copy link
Contributor

@liviazhu liviazhu commented Dec 9, 2025

What changes were proposed in this pull request?

Maintenance thread pool timeout set to minimum of the timeout from SQLConf and 60 seconds. Additionally reduce test time by doing the following:

  • Replace Thread.sleep with eventually
  • Reduce shuffle partitions/iterations for extremely slow tests (e.g. "SPARK-51358: Snapshot uploads in $providerName are properly reported to the coordinator")

Why are the changes needed?

Test is very slow currently (>12 min)

Does this PR introduce any user-facing change?

No. Maintenance thread pool timeout change is set by internal conf so does not affect users.

How was this patch tested?

Testing only. Ran 20 times to ensure that test is not flaky. New test time is ~5min.

Was this patch authored or co-authored using generative AI tooling?

No

// Wait a while for tasks to respond to being cancelled
if (!threadPool.awaitTermination(60, TimeUnit.SECONDS)) {
// To avoid long test times, use minimum of timeout value or 60 seconds
if (!threadPool.awaitTermination(Math.min(60, shutdownTimeout),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timeout is set to be very low - lets say 1s, we would now wait only for a total of 2s before we can confirm the tpool shutdown. Should we really repurpose the timeout here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal config so should only be set lower than 300s in tests, so shouldn't affect production queries. From running the test locally it looks like sometimes the thread pool isn't shut down even after 60s, so not sure if waiting less time really makes a huge difference.

// Verify only the partitions in badPartitions doesn't have a snapshot
verifySnapshotUploadEvents(coordRef, query, badPartitions)
verifySnapshotUploadEvents(coordRef, query2, badPartitions)
eventually(timeout(5.seconds)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this will be enough ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the test 20 times and did not see any flakiness. Also the previous Thread.sleep(500) is 500 milliseconds which is far shorter than 5 seconds.

withCoordinatorAndSQLConf(
sc,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.SHUFFLE_PARTITIONS.key -> "3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much difference does this make ? Should we do it in other tests also ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the test with the largest runtime so only decided to reduce the shuffle partitions here. It reduces the test time around 16% (from 6s to 5s).

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the misleading PR title. Although the initial intention was about testing time, this PR introduce a major behavior change in src/main, @liviazhu .

@liviazhu
Copy link
Contributor Author

Please change the misleading PR title. Although the initial intention was about testing time, this PR introduce a major behavior change in src/main, @liviazhu .

@dongjoon-hyun The only way that functionality changes is by setting the internal conf ("spark.sql.streaming.stateStore.maintenanceShutdownTimeout") lower than 300s (the current default) since we are using the min value between this and the existing 60s. Currently, the only place this can change is in tests. Given this, does this still count as a behavior change?

If it still counts as a behavior change, if I instead added another internal conf to configure this second timeout, would that be okay? Thanks!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 10, 2025

When you touch src/main, you cannot say like that.

If it still counts as a behavior change

As you already described the behavior change like the main body comment, you knew that, right?

- // Wait a while for tasks to respond to being cancelled
+ // To avoid long test times, use minimum of timeout value or 60 seconds

BTW, it seems that you are confused the review comment, I'm not against your code change. I'm simply asking to revise the PR title clearly. No need to make another code change by mentioning StateStore.MaintenanceThreadPool.stop behavior change clearly instead of saying about testing, @liviazhu .

if I instead added another internal conf to configure this second timeout, would that be okay?

@liviazhu liviazhu changed the title [SPARK-54655] [SS] Reduce test time for StateStoreCoordinatorSuite [SPARK-54655] [SS] Reduce test time for StateStoreCoordinatorSuite and modify StateStore.MaintenanceThreadPool.stop to wait minimum of 60s and maintenanceShutdownTimeout Dec 10, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54655] [SS] Reduce test time for StateStoreCoordinatorSuite and modify StateStore.MaintenanceThreadPool.stop to wait minimum of 60s and maintenanceShutdownTimeout [SPARK-54655] [SS] Make StateStore.MaintenanceThreadPool.stop to wait minimum of 60s and maintenanceShutdownTimeout Dec 10, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54655] [SS] Make StateStore.MaintenanceThreadPool.stop to wait minimum of 60s and maintenanceShutdownTimeout [SPARK-54655][SS] Make StateStore.MaintenanceThreadPool.stop to wait minimum of 60s and maintenanceShutdownTimeout Dec 10, 2025
@liviazhu liviazhu closed this Dec 10, 2025
@liviazhu
Copy link
Contributor Author

Resolved by #53432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants