Add DR and Failure scenario handling for Pauseless Ingestion #14794
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pauseless Ingestion Failure Resolution
Please refer to PR: #14741 for happy path. This PR aims to only cover the failure scenarios. Once the above one is merged a better diff covering only failures will be visible.
To view only diff covering failure scenarios, for the time being, refer to:
Summary
This PR aims to provide ways to resolve the failure scenarios that we can encounter during pauseless ingestion. The detailed list of failure scenarios can be found here: link along with the failure handling strategies: link
Following sequence diagrams summarizes the failure scenarios and the resolution.
Failure Scenarios & Resolution Approaches
Failures encountered during the commit protocol can be categorized into two types: recoverable and unrecoverable failures.
Recoverable failures are those in which at least one of the servers retains the segment on disk.
Unrecoverable failures occur when none of the servers have the segment on disk.
Recoverable Failures
Recoverable failures will be addressed through RealtimeSegmentValidationManager. This approach will handle scenarios such as upload failures and incomplete commit protocol executions.
The controller or server can run into issues in between any of the steps of the commit protocol as listed below:
Request Type: COMMIT_START
Request Type: COMMIT_END_METADATA
4. Update Segment ZK metadata for the committing segment (seg__0__0):
- Change status to DONE.
- Update deepstore url.
- Any additional metadata.
The RealtimeSegmentValidationManager figures out which step of the commit protocol failed and how can it be fixed. This is very similar to how commit protocol failures were handled before with some minor changes.
Non-recoverable Failures
These failures require ingesting the segment again from upstream, followed by build, upload and ZK metadata update.