Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times #50168

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

anishshri-db
Copy link
Contributor

@anishshri-db anishshri-db commented Mar 5, 2025

What changes were proposed in this pull request?

Fix maintenance pool shutdown handling issue causing long test times

Why are the changes needed?

Some of the snapshot lag verification tests were taking a long time. This was because of the maintenance thread pool shutdown getting stuck due to a driver RPC with the coordinator which could already be destroyed.

Exception in thread "state-store-maintenance-thread-3" Exception in thread "state-store-maintenance-thread-0" java.lang.InterruptedException
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:248)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
        at org.apache.spark.util.SparkThreadUtils$.awaitResultNoSparkExceptionConversion(SparkThreadUtils.scala:60)
        at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:45)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:519)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:107)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
        at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:161)
        at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$verifyIfStoreInstanceActive$1(StateStore.scala:1293)
        at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$verifyIfStoreInstanceActive$1$adapted(StateStore.scala:1293)
        at scala.Option.map(Option.scala:230)

Before the change: > 6m

23:40:57.292 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 6f26b37e-1743-45e1-ab42-d85c5b0c6ded. Cannot find active jobs for it.
23:40:57.295 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 6f26b37e-1743-45e1-ab42-d85c5b0c6ded. Cannot find active jobs for it.
[info] *** Test still running after 4 minutes, 58 seconds: suite name: RocksDBStateStoreIntegrationSuite, test name: SPARK-51097: Verify snapshot lag metrics are updated correctly with RocksDBStateStoreProvider (with changelog checkpointing).

After the change: ~21s

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@anishshri-db anishshri-db changed the title [SPARK-51397] Fix maintenance pool shutdown handling issue causing long test times [SPARK-51397][SS] Fix maintenance pool shutdown handling issue causing long test times Mar 5, 2025
@anishshri-db
Copy link
Contributor Author

@HeartSaVioR - PTAL, thx !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant