Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix repeatedly acquired pending reads quota #23869

Merged

Conversation

poorbarcode
Copy link
Contributor

Motivation

Background

  • Broker will limit the pending reading to Bookies if enabled managedLedgerMaxReadsInFlightSize
  • [managed-ledger] Do not send duplicate reads to BK/offloaders #17241 introduced a mechanism that merges BK reading requests if we can, for example:
    • Request-1: Read entry 50~59
    • Request-2: Read entry 50~69
    • The Request-2 will wait for Request-1, and only send a real request that reads 60~69 after Request-1 is finished.

Issue

  • The Request-2 above requests repeatedly
    • It requests 50~70 when creating.
    • It requests another range 61~70 after Request-1 is finished.
  • Expected: Request-2 acquire 20 quota.
  • Exact: Request-2 acquire 40 quota.
  • You can call testPreciseLimitation to reproduce the issue

Modifications

  • Fix the issue.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.1.0 milestone Jan 20, 2025
@poorbarcode poorbarcode self-assigned this Jan 20, 2025
@poorbarcode poorbarcode changed the title [fix][broker]Fix repeated acquire pending reads quota [fix][broker]Fix repeatedly acquire pending reads quota Jan 20, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 20, 2025
@lhotari
Copy link
Member

lhotari commented Jan 23, 2025

I agree that this issue exists, however there are broader issues in the managedLedgerMaxReadsInFlightSizeInMB solution.

Sharing some context about the issues I have found and what I have currently in progress:

I have reported issues #23482, #23504, #23505 and #23506 .
These issues are related to the broker cache and managedLedgerMaxReadsInFlightSizeInMB and dispatcherMaxReadSizeBytes limits.

I have already changes to address these issues in an experimental branch, pending the submission of individual PRs. Addressing the lack of caching for replay queue messages requires broader changes to the the broker cache and those will be covered with a new PIP.

Some earlier details shared in this comment: #23524 (comment) .

Due to issues #23482 and #23506, I don't think that this PR would resolve problems alone. Regarding broker OOMEs, issue #23504 would also need to be resolved. It's possible that dispatcherDispatchMessagesInSubscriptionThread=false and managedLedgerReadEntryTimeoutSeconds=15 (some reasonable value) could be a mitigation to some issues.

@poorbarcode
Copy link
Contributor Author

@lhotari

I have already changes to address these issues in an experimental branch, pending the submission of individual PRs. Addressing the lack of caching for replay queue messages requires broader changes to the the broker cache and those will be covered with a new PIP.

Some earlier details shared in this comment: #23524 (comment) .

Due to issues #23482 and #23506, I don't think that this PR would resolve problems alone. Regarding broker OOMEs, issue #23504 would also need to be resolved. It's possible that dispatcherDispatchMessagesInSubscriptionThread=false and managedLedgerReadEntryTimeoutSeconds=15 (some reasonable value) could be a mitigation to some issues.

Agree with you, let us fix them one by one, which is easier to review, and we should add tests for each case.

@poorbarcode poorbarcode requested a review from lhotari January 23, 2025 08:30
@heesung-sn
Copy link
Contributor

I think we should use AsyncTokenBucket or Guava.RateLimiter.tryAcquire for this rate limiter.

IMHO, the current logic, requiring acquiredPermit to release is error-prone, when the caller forgets to release it. Instead, I think we better use token bucket-based one, which can automatically fill the bucket.

@lhotari
Copy link
Member

lhotari commented Jan 24, 2025

I think we should use AsyncTokenBucket or Guava.RateLimiter.tryAcquire for this rate limiter.

IMHO, the current logic, requiring acquiredPermit to release is error-prone, when the caller forgets to release it. Instead, I think we better use token bucket-based one, which can automatically fill the bucket.

@heesung-sn I agree, the current solution is problematic. However, a token bucket isn't most optimal for this use case. I have a redesigned solution the managedLedgerMaxReadsInFlightSize solution in my experimental branch (example). There are a lot of different changes in branch so it might be hard to see what it's about. I'll extract that into a clear PR when time comes.
There's first #23892 and #23894 which need reviews so that I could make further progress. Do you have a chance to review them?

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari changed the title [fix][broker]Fix repeatedly acquire pending reads quota [fix][broker] Fix repeatedly acquire pending reads quota Jan 27, 2025
@lhotari lhotari changed the title [fix][broker] Fix repeatedly acquire pending reads quota [fix][broker] Fix repeatedly acquired pending reads quota Jan 27, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.24%. Comparing base (bbc6224) to head (977b33a).
Report is 871 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23869      +/-   ##
============================================
+ Coverage     73.57%   74.24%   +0.66%     
+ Complexity    32624    32222     -402     
============================================
  Files          1877     1853      -24     
  Lines        139502   143618    +4116     
  Branches      15299    16310    +1011     
============================================
+ Hits         102638   106622    +3984     
+ Misses        28908    28610     -298     
- Partials       7956     8386     +430     
Flag Coverage Δ
inttests 26.78% <60.00%> (+2.19%) ⬆️
systests 23.16% <60.00%> (-1.16%) ⬇️
unittests 73.75% <100.00%> (+0.90%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...keeper/mledger/impl/cache/PendingReadsManager.java 86.80% <ø> (+0.14%) ⬆️
...keeper/mledger/impl/cache/RangeEntryCacheImpl.java 63.48% <100.00%> (+4.73%) ⬆️

... and 1025 files with indirect coverage changes

@lhotari lhotari merged commit 331a997 into apache:master Jan 27, 2025
64 of 67 checks passed
@lhotari
Copy link
Member

lhotari commented Jan 27, 2025

I think we should use AsyncTokenBucket or Guava.RateLimiter.tryAcquire for this rate limiter.
IMHO, the current logic, requiring acquiredPermit to release is error-prone, when the caller forgets to release it. Instead, I think we better use token bucket-based one, which can automatically fill the bucket.

@heesung-sn I agree, the current solution is problematic. However, a token bucket isn't most optimal for this use case. I have a redesigned solution the managedLedgerMaxReadsInFlightSize solution in my experimental branch (example). There are a lot of different changes in branch so it might be hard to see what it's about. I'll extract that into a clear PR when time comes. There's first #23892 and #23894 which need reviews so that I could make further progress. Do you have a chance to review them?

@heesung-sn @poorbarcode To continue the improvements to address issues with managedLedgerMaxReadsInFlightSize, I have created #23901 which addresses the problems I have referred to earlier. Please review

lhotari pushed a commit that referenced this pull request Jan 28, 2025
lhotari pushed a commit that referenced this pull request Jan 29, 2025
lhotari pushed a commit that referenced this pull request Jan 29, 2025
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jan 31, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants