Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DR and Failure scenario handling for Pauseless Ingestion #14794

Open
wants to merge 10 commits into
base: pauseless_ingestion_without_failure_scenarios
Choose a base branch
from

Conversation

KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Jan 10, 2025

Pauseless Ingestion Failure Resolution

Please refer to PR: #14741 for happy path. This PR aims to only cover the failure scenarios. Once the above one is merged a better diff covering only failures will be visible.

To view only diff covering failure scenarios, for the time being, refer to:

Summary

This PR aims to provide ways to resolve the failure scenarios that we can encounter during pauseless ingestion. The detailed list of failure scenarios can be found here: link along with the failure handling strategies: link

Following sequence diagrams summarizes the failure scenarios and the resolution.
Screenshot 2025-01-03 at 2 53 46 PM
Screenshot 2025-01-03 at 2 54 45 PM

Failure Scenarios & Resolution Approaches

Failures encountered during the commit protocol can be categorized into two types: recoverable and unrecoverable failures.

Recoverable failures are those in which at least one of the servers retains the segment on disk.

Unrecoverable failures occur when none of the servers have the segment on disk.

Recoverable Failures

Recoverable failures will be addressed through RealtimeSegmentValidationManager. This approach will handle scenarios such as upload failures and incomplete commit protocol executions.

The controller or server can run into issues in between any of the steps of the commit protocol as listed below:

Request Type: COMMIT_START

  1. Update the Segment ZK metadata for the committing segment (seg__0__0)
    • Change status to COMMITTING
    • Set endOffset
  2. Create Segment ZK metadata for the new segment (seg__0__1) with status IN_PROGRESS
  3. Update the Ideal State for the:
    • Committing segment (seg__0__0) to ONLINE
    • New/ Consuming segment (seg__0__1) to CONSUMING

Request Type: COMMIT_END_METADATA
4. Update Segment ZK metadata for the committing segment (seg__0__0):
- Change status to DONE.
- Update deepstore url.
- Any additional metadata.

The RealtimeSegmentValidationManager figures out which step of the commit protocol failed and how can it be fixed. This is very similar to how commit protocol failures were handled before with some minor changes.

Non-recoverable Failures

These failures require ingesting the segment again from upstream, followed by build, upload and ZK metadata update.

@KKcorps KKcorps requested a review from Jackie-Jiang January 10, 2025 18:52
@KKcorps KKcorps added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes ingestion real-time labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature ingestion real-time release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants