[SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator #50123

zecookiez · 2025-03-01T00:56:14Z

What changes were proposed in this pull request?

This PR adds detection logic + logging to detect delays in snapshot uploads across all state store instances. The main snapshot upload reporting logic is done through RPC calls from RocksDB.scala to the StateStoreCoordinator, so that events are not dependent on streaming query progress reports.

Why are the changes needed?

This allows us to enable observability through dashboards and alerts, helping us understand the frequency of lag in production.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Five new tests are added in StateStoreCoordinatorSuite, while taking consideration join and non-joining stateful queries. One of these test is used to verify that the snapshot lag check is only done if changelog checkpointing is enabled.

Was this patch authored or co-authored using generative AI tooling?

No

…eCoordinator

...re/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinator.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

ericm-db · 2025-03-03T17:53:55Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinator.scala

+            stateStoreSnapshotVersions.getOrElse(storeProviderId, SnapshotUploadEvent(-1, 0))
+          logWarning(
+            s"State store falling behind $storeProviderId " +
+            s"(current: $snapshotEvent, latest: $latestSnapshot)"


Maybe you can log the number of versions behind/time since last upload?

Also - if this is the log line we have to look for, maybe you can add a prefix that is easy to grep for.

Good idea, added StateStoreCoordinator Snapshot Lag as a prefix for now. Thanks!

...re/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinator.scala

...c/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinatorSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

...re/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinator.scala

...c/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinatorSuite.scala

liviazhu-db · 2025-03-03T19:28:02Z

...c/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinatorSuite.scala

+          .start()
+        inputData.addData(1, 2, 3)
+        query.processAllAvailable()
+        inputData.addData(1, 2, 3)


Why are you doing this twice rather than all together?

It's more that I need to use multiple query.processAllAvailable() to commit and progress to a new version, but I also didn't want to provide 0 data. I'll add a comment to make this more clear 👍

Great thanks!

liviazhu-db · 2025-03-03T19:34:50Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

@@ -38,9 +38,14 @@ import org.apache.spark.sql.types.StructType
 import org.apache.spark.unsafe.Platform
 import org.apache.spark.util.{NonFateSharingCache, Utils}

+/** Trait representing the different events reported from RocksDB instance */
+trait RocksDBEventListener {


Why RocksDBEventListener? Are we not adding this to HDFS later too?

RocksDB state stores are a bit more special because the event is getting reported within the state store's rocksdb instance, whereas HDFS just reports straight from the state store itself. This trait would not be needed for the HDFS state stores

Got it. Can you make the docstring a bit more descriptive?

SPARK-51358 Introduce snapshot upload lag detection through StateStor…

958b491

…eCoordinator

github-actions bot added SQL STRUCTURED STREAMING labels Mar 1, 2025

SPARK-51358 Make test less flaky

7ffadd8

ericm-db reviewed Mar 3, 2025

View reviewed changes

liviazhu-db reviewed Mar 3, 2025

View reviewed changes

zecookiez added 4 commits March 4, 2025 07:58

SPARK-51358 Update logging and event listener init

3c6a5f9

SPARK-51358 Remove log

6056856

SPARK-51358 Remove setListener

41eaba4

SPARK-51358 Remove setListener call

4117326

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator #50123

[SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator #50123

zecookiez commented Mar 1, 2025

ericm-db Mar 3, 2025

ericm-db Mar 3, 2025

zecookiez Mar 4, 2025

liviazhu-db Mar 3, 2025

zecookiez Mar 3, 2025

liviazhu-db Mar 4, 2025

liviazhu-db Mar 3, 2025

zecookiez Mar 3, 2025

liviazhu-db Mar 4, 2025

[SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator #50123

Are you sure you want to change the base?

[SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator #50123

Conversation

zecookiez commented Mar 1, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment