-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register SLM run before snapshotting to save stats #110216
base: main
Are you sure you want to change the base?
Conversation
Before snapshotting, store in the cluster state that an slm snapshot is about to be run. If the run fails, and storing the failure information in the cluster state also fails, during the next snapshot run the fact that there was a failure can be inferred from the presence of an already registered run.
This PR is still needs some work, but I'd like to get some feedback to make sure it's going in the right direction! |
assertTrue(latch.await(1, TimeUnit.MINUTES)); | ||
|
||
// restart master so failure stat is lost | ||
// TODO this relies on a race condition. The node restart must happen before stats are stored in cluster state, but this is not guaranteed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race condition here that I haven't figured how to remove. If the failure is recorded in the cluster state:
elasticsearch/x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java
Line 131 in ff7f00e
submitUnbatchedTask( |
before the master is restarted, the failure stat will not be lost. I'd like to test this without a race condition, but not sure how to do so.
|
||
SnapshotLifecyclePolicyMetadata.Builder newPolicyMetadata = SnapshotLifecyclePolicyMetadata.builder(policyMetadata); | ||
if (preRegisteredRuns > numSnapshotsRunning) { | ||
final long unrecordedFailures = preRegisteredRuns - numSnapshotsRunning; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counting the number of currently running snapshots and using this to update preRegisteredRuns worries me a bit. A simpler alternative would be the following: If there are concurrently running snapshots for the same policy, just don't update invocationsSinceLastSuccess
. I think that snapshots can only run concurrently due to a manual rather than scheduled run. It does not seem terrible to just bail out in this case and not get the benefits of this PR.
newPolicyMetadata.setInvocationsSinceLastSuccess(policyMetadata.getInvocationsSinceLastSuccess() + unrecordedFailures); | ||
|
||
// There are likely scenarios where inc/decrements to preRegisteredRuns can be lost, so lower bound to 1 before run. | ||
long newPreRegisteredRuns = Math.min(1, preRegisteredRuns + 1 - unrecordedFailures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem there are likely situations where preRegisteredRuns
and numSnapshotsRunning
are inconsistent and thus unrecordedFailures
gets a wrong value. Setting newPreRegisteredRuns
to a minimum value of 1 should mitigate this, but doesn't solve it.
One alternative to improve this would be to store a set of pre-registered snapshot ids rather than a single count. Then remove all snapshot ids that are currently running from the set. Use the size of the remaining set to increment invocationsSinceLastSuccess.
The SLM health indicator relies on the policyMetadata.getInvocationsSinceLastSuccess to determine when the last several snapshots have failed.
If a snapshot fails and the master is shutdown before setting invocationsSinceLastSuccess, the fact that failur occurred will be lost. To solve this, before snapshotting, register in the cluster state that an slm snapshot is about to be run. If the run fails, and storing the failure information in the cluster state also fails due to a master shutdown, during the next snapshot run the fact that there was a failure can be determined because the failed run will be registered in the cluster state.