Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] refactor cursor read entry process to fix dead loop read issue of txn #22944

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TakaHiR07
Copy link
Contributor

@TakaHiR07 TakaHiR07 commented Jun 19, 2024

Motivation

  1. FIx the issue [Bug][broker] cursor will read in dead loop when do tailing-read with enableTransaction #22943. This is issue is serious and actually cause txn unavailable.

This pr is similar to a previous pr #14286. Since previous pr is closed, I implement it in master branch and improve some logic and add some test.

  1. Besides, this pr would also related to another issue, which is also need to be improved. [improve][broker] cursor read entry would trigger readMoreEntry() one more time when addWaitingCursor and notify #23027

Modifications

  1. Add a field named maxReadPosition in ManagedLedgerImpl and if read op enable maxReadPosition, add check hasMoreEntriesByMaxReadPosition()
  2. Add a field named waitingCursorsByMaxReadPosition in ManagedLedgerImpl, when cursor has read op in wait state, we can put this read op in to this queue. If maxReadPosition updated, we will pool it and notify this read op.
  3. In topicTransaction buffer, when any updated maxReadposition op we should sync it to ManagedLedgerImpl maxReadPosition.

Currently :

  • If readPosition <= lastConfirmedEntry && readPosition <= maxReadPosition , read immediately
  • If readPosition <= lastConfirmedEntry && readPosition > maxReadPosition , wait by max read position
  • If readPosition > lastConfirmedEntry , wait by cursor

I make many comments in the code, which maybe the concerned point.

And this pr try to retain the same process as before if disable txn. Aiming to fix the issue after enable txn.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 19, 2024
@dao-jun
Copy link
Member

dao-jun commented Jun 19, 2024

The implementation seems could work, but I'm not sure is it a good idea that maintain maxReadPosition in ML layer, since it is a messaging layer isolation policy which is for the purpose of providing READ_COMMITTED semantic.

// because readPosition > maxReadPosition

// Skip deleted entries.
skipCondition = skipCondition == null ? this::isMessageDeleted : skipCondition.or(this::isMessageDeleted);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better to simple the code since there are some code duplications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have done.

}

if (isClosed()) {
callback.readEntriesFailed(new CursorAlreadyClosedException("Cursor was already closed"), ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to release ReadEntryOp here?

Copy link
Contributor Author

@TakaHiR07 TakaHiR07 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. need to recycle the op first. But this is better fixed by another pr.

callback.readEntriesFailed(new NoMoreEntriesToReadException("Topic was terminated"), ctx);
}
} catch (Throwable t) {
callback.readEntriesFailed(new ManagedLedgerException(t), ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to release ReadEntryOp?

// At this point we registered for notification and still there were no more available
// entries by maxReadPosition.
// If the managed ledger was indeed terminated, we need to notify the cursor
callback.readEntriesFailed(new NoMoreEntriesToReadException("Topic was terminated"), ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to release ReadEntryOp?

@TakaHiR07 TakaHiR07 force-pushed the refactor_cursor_read_entry_process branch from 2760f43 to 5f68a58 Compare June 19, 2024 14:20
@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Jun 20, 2024

The implementation seems could work, but I'm not sure is it a good idea that maintain maxReadPosition in ML layer, since it is a messaging layer isolation policy which is for the purpose of providing READ_COMMITTED semantic.

Firstly I also try to remove maxReadPosition in ML layer. However, if the process is hasMoreEntry() check -> maxReadPosition change -> add to waitingCursor, there is a risk that waitingCursor never notify. So ML layer need to know the maxReadPosition actual value.

@thetumbled
Copy link
Member

PTAL, thanks. @congbobo184 @liangyepianzhou

@TakaHiR07 TakaHiR07 force-pushed the refactor_cursor_read_entry_process branch 2 times, most recently from 0d0b50b to e3a93a3 Compare June 20, 2024 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants