Register SLM run before snapshotting to save stats #110216

parkertimmins · 2024-06-27T00:21:44Z

The SLM health indicator relies on the policyMetadata.getInvocationsSinceLastSuccess to determine when the last several snapshots have failed.

If a snapshot fails and the master is shutdown before setting invocationsSinceLastSuccess, the fact that failur occurred will be lost. To solve this, before snapshotting, register in the cluster state that an slm snapshot is about to be run. If the run fails, and storing the failure information in the cluster state also fails due to a master shutdown, during the next snapshot run the fact that there was a failure can be determined because the failed run will be registered in the cluster state.

Before snapshotting, store in the cluster state that an slm snapshot is about to be run. If the run fails, and storing the failure information in the cluster state also fails, during the next snapshot run the fact that there was a failure can be inferred from the presence of an already registered run.

parkertimmins · 2024-06-27T00:22:53Z

This PR is still needs some work, but I'd like to get some feedback to make sure it's going in the right direction!

parkertimmins · 2024-06-27T00:25:56Z

...plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java

+        assertTrue(latch.await(1, TimeUnit.MINUTES));
+
+        // restart master so failure stat is lost
+        // TODO this relies on a race condition. The node restart must happen before stats are stored in cluster state, but this is not guaranteed.


There is a race condition here that I haven't figured how to remove. If the failure is recorded in the cluster state:

elasticsearch/x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

Line 131 in ff7f00e

submitUnbatchedTask(

before the master is restarted, the failure stat will not be lost. I'd like to test this without a race condition, but not sure how to do so.

parkertimmins · 2024-06-27T00:40:08Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+
+            SnapshotLifecyclePolicyMetadata.Builder newPolicyMetadata = SnapshotLifecyclePolicyMetadata.builder(policyMetadata);
+            if (preRegisteredRuns > numSnapshotsRunning) {
+                final long unrecordedFailures = preRegisteredRuns - numSnapshotsRunning;


Counting the number of currently running snapshots and using this to update preRegisteredRuns worries me a bit. A simpler alternative would be the following: If there are concurrently running snapshots for the same policy, just don't update invocationsSinceLastSuccess. I think that snapshots can only run concurrently due to a manual rather than scheduled run. It does not seem terrible to just bail out in this case and not get the benefits of this PR.

parkertimmins · 2024-06-27T00:46:56Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+                newPolicyMetadata.setInvocationsSinceLastSuccess(policyMetadata.getInvocationsSinceLastSuccess() + unrecordedFailures);
+
+                // There are likely scenarios where inc/decrements to preRegisteredRuns can be lost, so lower bound to 1 before run.
+                long newPreRegisteredRuns = Math.min(1, preRegisteredRuns + 1 - unrecordedFailures);


It seem there are likely situations where preRegisteredRuns and numSnapshotsRunning are inconsistent and thus unrecordedFailures gets a wrong value. Setting newPreRegisteredRuns to a minimum value of 1 should mitigate this, but doesn't solve it.

One alternative to improve this would be to store a set of pre-registered snapshot ids rather than a single count. Then remove all snapshot ids that are currently running from the set. Use the size of the remaining set to increment invocationsSinceLastSuccess.

parkertimmins requested a review from dakrone June 27, 2024 00:22

elasticsearchmachine added the v8.15.0 label Jun 27, 2024

parkertimmins requested a review from jbaiera June 27, 2024 00:22

parkertimmins commented Jun 27, 2024

View reviewed changes

parkertimmins marked this pull request as ready for review June 28, 2024 14:06

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register SLM run before snapshotting to save stats #110216

Register SLM run before snapshotting to save stats #110216

parkertimmins commented Jun 27, 2024

parkertimmins commented Jun 27, 2024

parkertimmins Jun 27, 2024

parkertimmins Jun 27, 2024

parkertimmins Jun 27, 2024

Register SLM run before snapshotting to save stats #110216

Are you sure you want to change the base?

Register SLM run before snapshotting to save stats #110216

Conversation

parkertimmins commented Jun 27, 2024

parkertimmins commented Jun 27, 2024

parkertimmins Jun 27, 2024

Choose a reason for hiding this comment

parkertimmins Jun 27, 2024

Choose a reason for hiding this comment

parkertimmins Jun 27, 2024

Choose a reason for hiding this comment