Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker]If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #10

Open
wants to merge 183 commits into
base: master
Choose a base branch
from

Conversation

yyj8
Copy link
Owner

@yyj8 yyj8 commented Nov 23, 2024

…return a failure because the service may be unavailable

Fixes #xyz

Main Issue: #xyz

PIP: #xyz

Motivation

In some special scenarios, when the broker service has a deadlock, it needs to be able to automatically recover instead of requiring manual intervention. For example, when the service is deployed in a customer environment, we cannot directly manage it. If the service has a deadlock, the probe should return a failure because the service may be unavailable. The probe failure triggers a node restart to resolve the deadlock.

Modifications

Add deadlock detection in the probe. If a deadlock exists, print the thread stack and return a service unavailable exception

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…return a failure because the service may be unavailable
yyj8 and others added 28 commits November 25, 2024 22:22
…return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable.
…e should return a failure because the service may be unavailable.
…e should return a failure because the service may be unavailable. Add lastPrintThreadDumpTimestamp field to control the interval time for printing complete thread stack information.
…e should return a failure because the service may be unavailable. Add unit testing code.
…e should return a failure because the service may be unavailable. Add unit testing code.
…e should return a failure because the service may be unavailable. Add unit testing code.
…e should return a failure because the service may be unavailable. Add unit testing code, shutdown deadlock thread.
…e should return a failure because the service may be unavailable. Add unit testing code, shutdown deadlock thread.
…e should return a failure because the service may be unavailable. Modify unit testing code, use org.mockito.Mockito replaces MockServletContext.
…hes (apache#23772)

### Motivation

**Background**
There is a mechanism that repeatedly prevents the callback of ML data ledger creation:
- Start a scheduled task to check whether the creation will be timeout.
- Received a callback
  - Check whether the future(`@param ctx` of `BK.createAsync`) has been done or not.
  - If done: it means the creation has timeout before the creation is completed
  - Otherwise: it is a real callback from BK.

**Issue:**
But the timeout event will call the same callback as above, then the steps are as follows, which you ca reproduce by the test `testCreateDataLedgerTimeout`:
- Start creating a data ledger
  - Call `BK.createAsync` 
- Timeout
  - Mark the future(`@param ctx` of `BK.createAsync`) as completed exceptionally. 
  - Trigger the callback related to ledger creation. 
    - Check whether the future(`@param ctx` of `BK.createAsync`) has been done or not.
    - If done: do nothing.
- Creation is compelled.
  - Trigger the callback related to ledger creation. 
    - Check whether the future(`@param ctx` of `BK.createAsync`) has been done or not.
    - If done: do nothing.
- Issue: The callback for ledger creation will never be called.

![Screenshot 2024-12-24 at 00 14 38](https://github.com/user-attachments/assets/44ed19d2-7238-45a4-9186-c127f6ed14f7)
![Screenshot 2024-12-24 at 00 14 08](https://github.com/user-attachments/assets/349f39ff-7e98-4a09-9af2-f80082339592)


### Modifications

Fix the issue

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: x
rdhabalia and others added 29 commits March 4, 2025 22:37
…sage metadata contains the same key but with different values (apache#23927)
@github-actions github-actions bot added the PIP label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.