[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

anishshri-db · 2025-03-05T07:51:51Z

What changes were proposed in this pull request?

Fix maintenance pool shutdown handling issue causing long test times

Why are the changes needed?

Some of the snapshot lag verification tests were taking a long time. This was because of the maintenance thread pool shutdown getting stuck due to a driver RPC with the coordinator which could already be destroyed.

Exception in thread "state-store-maintenance-thread-3" Exception in thread "state-store-maintenance-thread-0" java.lang.InterruptedException
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:248)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
        at org.apache.spark.util.SparkThreadUtils$.awaitResultNoSparkExceptionConversion(SparkThreadUtils.scala:60)
        at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:45)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:519)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:107)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
        at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:161)
        at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$verifyIfStoreInstanceActive$1(StateStore.scala:1293)
        at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$verifyIfStoreInstanceActive$1$adapted(StateStore.scala:1293)
        at scala.Option.map(Option.scala:230)

Before the change: > 6m

23:40:57.292 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 6f26b37e-1743-45e1-ab42-d85c5b0c6ded. Cannot find active jobs for it.
23:40:57.295 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 6f26b37e-1743-45e1-ab42-d85c5b0c6ded. Cannot find active jobs for it.
[info] *** Test still running after 4 minutes, 58 seconds: suite name: RocksDBStateStoreIntegrationSuite, test name: SPARK-51097: Verify snapshot lag metrics are updated correctly with RocksDBStateStoreProvider (with changelog checkpointing).

After the change: ~21s

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests

Was this patch authored or co-authored using generative AI tooling?

No

…ng test times

anishshri-db · 2025-03-05T07:55:46Z

@HeartSaVioR - PTAL, thx !

[SPARK-51397] Fix maintenance pool shutdown handling issue causing lo…

3da93b9

…ng test times

anishshri-db changed the title ~~[SPARK-51397] Fix maintenance pool shutdown handling issue causing long test times~~ [SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times Mar 5, 2025

github-actions bot added SQL STRUCTURED STREAMING labels Mar 5, 2025

Fix issue

ab0449b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

anishshri-db commented Mar 5, 2025 •

edited

Loading

anishshri-db commented Mar 5, 2025

[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

Are you sure you want to change the base?

[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

Conversation

anishshri-db commented Mar 5, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

anishshri-db commented Mar 5, 2025

anishshri-db commented Mar 5, 2025 •

edited

Loading