Skip to content

Resolve failures pauseless ingestion#3

Closed
9aman wants to merge 438 commits intomasterfrom
resolve-failures-pauseless-ingestion
Closed

Resolve failures pauseless ingestion#3
9aman wants to merge 438 commits intomasterfrom
resolve-failures-pauseless-ingestion

Conversation

@9aman
Copy link
Copy Markdown
Owner

@9aman 9aman commented Jan 3, 2025

Pauseless Ingestion Failure Resolution

PR for happy path: apache#14741
Please refer to the above PR for more details before reviewing this PR that covers resolution for failure scenarios.

Summary

This PR aims to provide ways to resolve the failure scenarios that we can encounter during pauseless ingestion. The detailed list of failure scenarios can be found here: link along with the failure handling strategies: link

Following sequence diagrams summarizes the failure scenarios and the resolution.
Screenshot 2025-01-03 at 2 53 46 PM
Screenshot 2025-01-03 at 2 54 45 PM

Failure Scenarios & Resolution Approaches

Failures encountered during the commit protocol can be categorized into two types: recoverable and unrecoverable failures.

Recoverable failures are those in which at least one of the servers retains the segment on disk.

Unrecoverable failures occur when none of the servers have the segment on disk.

Recoverable Failures

Recoverable failures will be addressed through RealtimeSegmentValidationManager. This approach will handle scenarios such as upload failures and incomplete commit protocol executions.

The controller or server can run into issues in between any of the steps of the commit protocol as listed below:

Request Type: COMMIT_START

  1. Update the Segment ZK metadata for the committing segment (seg__0__0)
    • Change status to COMMITTING
    • Set endOffset
  2. Create Segment ZK metadata for the new segment (seg__0__1) with status IN_PROGRESS
  3. Update the Ideal State for the:
    • Committing segment (seg__0__0) to ONLINE
    • New/ Consuming segment (seg__0__1) to CONSUMING

Request Type: COMMIT_END_METADATA
4. Update Segment ZK metadata for the committing segment (seg__0__0):
- Change status to DONE.
- Update deepstore url.
- Any additional metadata.

The RealtimeSegmentValidationManager figures out which step of the commit protocol failed and how can it be fixed. This is very similar to how commit protocol failures were handled before with some minor changes.

Non-recoverable Failures

These failures require ingesting the segment again from upstream, followed by build, upload and ZK metadata update.

gortiz and others added 30 commits November 5, 2024 12:48
… order (apache#14391)

* order index types so servers can process them in a more deterministic order

* refine fwd_index_handler logs a bit for easy debug
…4340)

* Added support to perform task validations for plug-in tasks

* trigger build2
Fix NPE and execution when --help was specified. Remove code that was supposed to generate help message.
* [timeseries] Fix Server Selection Bug + Enforce Timeout

* pass requestId and brokerId to servers

* undo the 2 server quickstart change
…maConformingTransformer (apache#14351)

* Remove emitting null value fields during data transformation.

* Fix lint issues.

* Revise based on comments
deepthi912 and others added 23 commits December 26, 2024 18:27
Bumps software.amazon.awssdk:bom from 2.29.40 to 2.29.41.

---
updated-dependencies:
- dependency-name: software.amazon.awssdk:bom
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…afety issue in the JDBC client (apache#14723)

Use DateTimeFormatter instead of SimpleDateFormat to resolve thread safety issue in the JDBC client
…14728)

Bumps [com.puppycrawl.tools:checkstyle](https://github.com/checkstyle/checkstyle) from 10.21.0 to 10.21.1.
- [Release notes](https://github.com/checkstyle/checkstyle/releases)
- [Commits](checkstyle/checkstyle@checkstyle-10.21.0...checkstyle-10.21.1)

---
updated-dependencies:
- dependency-name: com.puppycrawl.tools:checkstyle
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps software.amazon.awssdk:bom from 2.29.41 to 2.29.43.

---
updated-dependencies:
- dependency-name: software.amazon.awssdk:bom
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1. Changing FSM
2. Changing the 3 steps performed during the commit protocol to update ZK and Ideal state
1. Changes in the commit protocol to start segment commit before the build
2. Changes in the BaseTableDataManager to ensure that the locally built segment is replaced by a downloaded one
   only when the CRC is present in the ZK Metadata
3. Changes in the download segment method to allow waited download in case of pauseless consumption
…segment commit end metadata call

Refactoing code for redability
… ingestion by moving it out of streamConfigMap
…auseless ingestion in RealtimeSegmentValidationManager
…d by RealtimeSegmentValitdationManager to fix commit protocol failures
@9aman 9aman marked this pull request as ready for review January 3, 2025 09:53
@9aman 9aman changed the base branch from pauseless_ingestion_without_failure_scenarios to master January 3, 2025 09:56
@9aman 9aman closed this Jan 3, 2025
@9aman 9aman deleted the resolve-failures-pauseless-ingestion branch January 3, 2025 09:58
@9aman 9aman restored the resolve-failures-pauseless-ingestion branch January 3, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.