Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register SLM run before snapshotting to save stats #110216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

parkertimmins
Copy link
Contributor

The SLM health indicator relies on the policyMetadata.getInvocationsSinceLastSuccess to determine when the last several snapshots have failed.

If a snapshot fails and the master is shutdown before setting invocationsSinceLastSuccess, the fact that failur occurred will be lost. To solve this, before snapshotting, register in the cluster state that an slm snapshot is about to be run. If the run fails, and storing the failure information in the cluster state also fails due to a master shutdown, during the next snapshot run the fact that there was a failure can be determined because the failed run will be registered in the cluster state.

Before snapshotting, store in the cluster state that
an slm snapshot is about to be run. If the run fails,
and storing the failure information in the cluster
state also fails, during the next snapshot run the fact
that there was a failure can be inferred from the presence
of an already registered run.
@parkertimmins
Copy link
Contributor Author

This PR is still needs some work, but I'd like to get some feedback to make sure it's going in the right direction!

assertTrue(latch.await(1, TimeUnit.MINUTES));

// restart master so failure stat is lost
// TODO this relies on a race condition. The node restart must happen before stats are stored in cluster state, but this is not guaranteed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition here that I haven't figured how to remove. If the failure is recorded in the cluster state:

before the master is restarted, the failure stat will not be lost. I'd like to test this without a race condition, but not sure how to do so.


SnapshotLifecyclePolicyMetadata.Builder newPolicyMetadata = SnapshotLifecyclePolicyMetadata.builder(policyMetadata);
if (preRegisteredRuns > numSnapshotsRunning) {
final long unrecordedFailures = preRegisteredRuns - numSnapshotsRunning;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting the number of currently running snapshots and using this to update preRegisteredRuns worries me a bit. A simpler alternative would be the following: If there are concurrently running snapshots for the same policy, just don't update invocationsSinceLastSuccess. I think that snapshots can only run concurrently due to a manual rather than scheduled run. It does not seem terrible to just bail out in this case and not get the benefits of this PR.

newPolicyMetadata.setInvocationsSinceLastSuccess(policyMetadata.getInvocationsSinceLastSuccess() + unrecordedFailures);

// There are likely scenarios where inc/decrements to preRegisteredRuns can be lost, so lower bound to 1 before run.
long newPreRegisteredRuns = Math.min(1, preRegisteredRuns + 1 - unrecordedFailures);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem there are likely situations where preRegisteredRuns and numSnapshotsRunning are inconsistent and thus unrecordedFailures gets a wrong value. Setting newPreRegisteredRuns to a minimum value of 1 should mitigate this, but doesn't solve it.

One alternative to improve this would be to store a set of pre-registered snapshot ids rather than a single count. Then remove all snapshot ids that are currently running from the set. Use the size of the remaining set to increment invocationsSinceLastSuccess.

@parkertimmins parkertimmins marked this pull request as ready for review June 28, 2024 14:06
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:triage Requires assignment of a team area label v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants