[fix] Make operations on `individualDeletedMessages` in lock scope #22966

dao-jun · 2024-06-24T14:13:54Z

Motivation

In #22908 we introduced ConcurrentRoaringBitSet which is based on StampLock and RoaringBitmap to optimize the memory usage and GC pause on BitSet.

However, there is a concurrency issue on ConcurrentRoaringBitSet.

It will throw NPE when calling ConcurrentRoaringBitSet#get and ConcurrentRoaringBitSet#set in multiple threads, the situation is a little similar with #18388.

see:
RoaringBitmap#add
RoaringBitmap#get

It will throw NPE if use StampLock, the situation is a little similar with #18388

Modifications

Remove ConcurrentBitSet
Rename ConcurrentOpenLongPairRangeSet to OpenLongPairRangeSet and mark it as NotThreadSafe.
Make all the operations on ManageCursorImpl#individualDeletedMessages in ReadWriteLock scope.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

lhotari

I think that we need to find another solution. ReadWriteLock adds a lot more overhead than StampedLock.

I wonder if it would be a viable option to catch exceptions and retry with a read lock if that happens?

dao-jun · 2024-06-24T15:06:34Z

I think that we need to find another solution. ReadWriteLock adds a lot more overhead than StampedLock.

Yes, but RoaringBitmap is not designed for Concurrency at all, and the PR is a quick fix, we can make further improvements in the future.

dao-jun · 2024-06-24T15:12:07Z

I wonder if it would be a viable option to catch exceptions and retry with a read lock if that happens?

Then we may catch a lot of exceptions when a broker is in a large throughput, I'm not sure if the cost is less than RWLock or not.

lhotari · 2024-06-24T15:23:44Z

I wonder if it would be a viable option to catch exceptions and retry with a read lock if that happens?

Then we may catch a lot of exceptions when a broker is in a large throughput, I'm not sure if the cost is less than RWLock or not.

That's a valid concern, we should investigate the different choices and experiment.

lhotari · 2024-06-24T15:26:30Z

I think that we should revert the migration to RoaringBitSet in branch-3.0, branch-3.2 and branch-3.3 so that we don't need to rush with the solution.

lhotari · 2024-06-24T16:51:49Z

I reverted the changes in branch-3.0, branch-3.2 and branch-3.3. Here's the PR to revert the change in master branch: #22968 . It's better to have a fresh start with a proper fix that is validated so that it doesn't cause performance regressions and also addresses the concurrency issues. The concern about switching to ReadWriteLock is about it causing a performance regression. It's possible that it's not a valid concern, but let's validate that before applying the solution.

dao-jun · 2024-06-24T18:33:39Z

I did a less rigorous test:

    @Test
    public void test() {
        long start = System.currentTimeMillis();
        CountDownLatch latch = new CountDownLatch(2);
        ConcurrentRoaringBitSet bitSet = new ConcurrentRoaringBitSet();
        new Thread(() -> {
            for (int i = 0; i < 100000000; i++) {
                bitSet.set(1);
            }
            latch.countDown();
        }).start();
        new Thread(() -> {
            for (int i = 0; i < 100000000; i++) {
                bitSet.get(1);
            }
            latch.countDown();
        }).start();

        try {
            latch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        System.out.println("Time: " + (System.currentTimeMillis() - start));
    }

I started 2 threads to call get/set methods on ReadWriteLock/StampLock based ConcurrentRoaringBitSet, each thread looping 100 million times.
For ReadWriteLock based ConcurrentRoaringBitSet, the total durations are around 9.5s
For StampLock base ConcurrentRoaringBitSet, the total durations are around 8.5s.

Maybe we don't need to worry about the performance regression?

dao-jun · 2024-06-24T18:40:59Z

When we do Readonly operations on StampLock based ConcurrentRoaringBitSet, it does faster than ReadWriteLock(about 5 times faster), but in the case we use ConcurrentRoaringBitSet is Read and Write(about 1:1).

lhotari · 2024-06-24T19:02:14Z

When we do Readonly operations on StampLock based ConcurrentRoaringBitSet, it does faster than ReadWriteLock(about 5 times faster), but in the case we use ConcurrentRoaringBitSet is Read and Write(about 1:1).

In Pulsar we have https://github.com/apache/pulsar/tree/master/microbench module with JMH. I think JMH is better for comparisons. For Pulsar, the efficiency also matters so the comparison might not be that simple.

btw. In Pulsar ConcurrentOpenLongPairRangeSet is only used in RangeSetWrapper and the only usage of that is in ManagedCursorImpl for individualDeletedMessages. In many cases, the operations on individualDeletedMessages are already protected by the ReadWriteLock field lock in ManagedCursorImpl.
It might be better to make the lock usage consistent. We wouldn't need ConcurrentRoaringBitSet in the Pulsar code base in that case as long as we document that ConcurrentOpenLongPairRangeSet isn't really thread safe. The thread safe solution could use the old solution.

dao-jun · 2024-06-25T06:06:43Z

btw. In Pulsar ConcurrentOpenLongPairRangeSet is only used in RangeSetWrapper and the only usage of that is in ManagedCursorImpl for individualDeletedMessages. In many cases, the operations on individualDeletedMessages are already protected by the ReadWriteLock field lock in ManagedCursorImpl.
It might be better to make the lock usage consistent. We wouldn't need ConcurrentRoaringBitSet in the Pulsar code base in that case as long as we document that ConcurrentOpenLongPairRangeSet isn't really thread safe. The thread safe solution could use the old solution.

It makes sense, I addressed this, PTAL

lhotari · 2024-06-25T07:32:04Z

It makes sense, I addressed this, PTAL

@dao-jun Looks good, I'll soon review in more detail. Please update the PR title and description so that it describes the motivation and modifications of this PR more accurately.

lhotari

Please use write lock for individualDeletedMessages.resetDirtyKeys(); call in buildIndividualDeletedMessageRanges method.

lhotari · 2024-06-26T07:07:01Z

Since the previous change #22908 was rollbacked by #22968, please rebase the changes.

…currency_issue # Conflicts: # pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/OpenLongPairRangeSet.java

lhotari

Rename ConcurrentOpenLongPairRangeSet to OpenLongPairRangeSet and mark it as NotThreadSafe.

I guess this change and the switch to use RoaringBitSet (in version 1.1.0) was lost in rebasing?

lhotari · 2024-06-26T07:46:10Z

Please use write lock for individualDeletedMessages.resetDirtyKeys(); call in buildIndividualDeletedMessageRanges method.

This is actually a real bug in the current implementation and needs to be fixed even if we wouldn't switch to use RoaringBitMap's RoaringBitSet.

lhotari · 2024-06-26T07:48:01Z

Rename ConcurrentOpenLongPairRangeSet to OpenLongPairRangeSet and mark it as NotThreadSafe.

I guess this change and the switch to use RoaringBitSet (in version 1.1.0) was lost in rebasing?

One possibility would be to complete this PR by switching to the non-thread version of ConcurrentOpenLongPairRangeSet using ordinary BitSet in this PR and then switch to use RoaringBitSet in a follow up PR.

It's possible that using StampedLock in ConcurrentBitSet results in similar problems as we had with StampedLock in ConcurrentRoaringBitSet.

By looking at the code of BitSet, it seems that assertions in this method could fail in ConcurrentBitSet:

   private void checkInvariants() {
        assert(wordsInUse == 0 || words[wordsInUse - 1] != 0);
        assert(wordsInUse >= 0 && wordsInUse <= words.length);
        assert(wordsInUse == words.length || words[wordsInUse] == 0);
    }

However the problems are hidden since assertions aren't commonly enabled in production.

dao-jun · 2024-06-26T08:01:03Z

Please use write lock for individualDeletedMessages.resetDirtyKeys(); call in buildIndividualDeletedMessageRanges method.

This is actually a real bug in the current implementation and needs to be fixed even if we wouldn't switch to use RoaringBitMap's RoaringBitSet.

Yes, individualDeletedMessages.resetDirtyKeys() is a WRITE operation, but it just requires a READ lock.

codecov-commenter · 2024-06-26T09:47:26Z

Codecov Report

Attention: Patch coverage is 91.48936% with 4 lines in your changes missing coverage. Please review.

Project coverage is 73.43%. Comparing base (bbc6224) to head (66b228c).
Report is 424 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #22966      +/-   ##
============================================
- Coverage     73.57%   73.43%   -0.15%     
- Complexity    32624    33219     +595     
============================================
  Files          1877     1903      +26     
  Lines        139502   142680    +3178     
  Branches      15299    15574     +275     
============================================
+ Hits         102638   104771    +2133     
- Misses        28908    29891     +983     
- Partials       7956     8018      +62

Flag	Coverage Δ
inttests	`27.79% <38.29%> (+3.21%)`	⬆️
systests	`24.76% <36.17%> (+0.44%)`	⬆️
unittests	`72.46% <91.48%> (-0.39%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...apache/bookkeeper/mledger/ManagedLedgerConfig.java	`96.38% <ø> (+0.08%)`	⬆️
.../common/util/collections/OpenLongPairRangeSet.java	`90.00% <100.00%> (ø)`
...pache/bookkeeper/mledger/impl/RangeSetWrapper.java	`94.33% <80.00%> (ø)`
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	`80.00% <92.68%> (+0.70%)`	⬆️

... and 469 files with indirect coverage changes

lhotari · 2024-06-26T15:04:15Z

LGTM, good work @dao-jun

pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/OpenLongPairRangeSet.java

…22966) (cherry picked from commit dbbb6b6)

…22966) (cherry picked from commit dbbb6b6) # Conflicts: # managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

…22966) (cherry picked from commit dbbb6b6)

…pache#22966) (cherry picked from commit dbbb6b6) (cherry picked from commit e01e90f)

Fix CurrentRoaringBitSet concurrency issue.

462c140

dao-jun added release/blocker Indicate the PR or issue that should block the release until it gets resolved ready-to-test release/3.3.1 release/3.0.6 release/3.2.4 labels Jun 24, 2024

dao-jun added this to the 3.4.0 milestone Jun 24, 2024

dao-jun requested a review from lhotari June 24, 2024 14:13

dao-jun self-assigned this Jun 24, 2024

github-actions bot added the doc-label-missing label Jun 24, 2024

apache deleted a comment from github-actions bot Jun 24, 2024

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jun 24, 2024

lhotari requested changes Jun 24, 2024

View reviewed changes

This was referenced Jun 24, 2024

[improve][broker] Optimize ConcurrentOpenLongPairRangeSet by RoaringBitmap #22908

Merged

[revert] "[improve][broker] Optimize ConcurrentOpenLongPairRangeSet by RoaringBitmap (#22908)" #22968

Merged

Fix CurrentRoaringBitSet concurrency issue.

e6f5f35

Fix CurrentRoaringBitSet concurrency issue.

c363fe9

dao-jun changed the title ~~[fix] Fix CurrentRoaringBitSet concurrency issue.~~ [fix] Make operations of individualDeletedMessages thread-safe Jun 25, 2024

dao-jun changed the title ~~[fix] Make operations of individualDeletedMessages thread-safe~~ [fix] Make operations on individualDeletedMessages in lock scope Jun 25, 2024

dao-jun added 2 commits June 25, 2024 19:18

fix checkstyle

0015162

fix code

8e741d9

lhotari requested changes Jun 26, 2024

View reviewed changes

dao-jun added 3 commits June 26, 2024 15:15

fix review comment.

2ec7952

Merge branch 'refs/heads/master' into fix/ConcurrentRoaringBitSet_con…

5010d4f

…currency_issue # Conflicts: # pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/OpenLongPairRangeSet.java

merge master

16480eb

lhotari reviewed Jun 26, 2024

View reviewed changes

merge master

af9bbc5

merge master

66b228c

lhotari approved these changes Jun 26, 2024

View reviewed changes

lhotari reviewed Jun 26, 2024

View reviewed changes

pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/OpenLongPairRangeSet.java Show resolved Hide resolved

dao-jun merged commit dbbb6b6 into apache:master Jul 3, 2024
51 checks passed

dao-jun removed the release/blocker Indicate the PR or issue that should block the release until it gets resolved label Jul 3, 2024

dao-jun deleted the fix/ConcurrentRoaringBitSet_concurrency_issue branch July 3, 2024 13:13

lhotari pushed a commit that referenced this pull request Jul 5, 2024

[fix] Make operations on individualDeletedMessages in lock scope (#…

e01e90f

…22966) (cherry picked from commit dbbb6b6)

lhotari added the cherry-picked/branch-3.0 label Jul 5, 2024

lhotari added the cherry-picked/branch-3.2 label Jul 5, 2024

lhotari pushed a commit that referenced this pull request Jul 5, 2024

[fix] Make operations on individualDeletedMessages in lock scope (#…

2e99b70

…22966) (cherry picked from commit dbbb6b6)

lhotari added the cherry-picked/branch-3.3 label Jul 5, 2024

lhotari mentioned this pull request Jul 5, 2024

[improve][broker] Use RoaringBitmap in tracking individual acks to reduce memory usage #23006

Merged

4 tasks

nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 10, 2024

[fix] Make operations on individualDeletedMessages in lock scope (a…

2d1fcf9

…pache#22966) (cherry picked from commit dbbb6b6) (cherry picked from commit e01e90f)

srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 15, 2024

[fix] Make operations on individualDeletedMessages in lock scope (a…

1db5939

…pache#22966) (cherry picked from commit dbbb6b6) (cherry picked from commit e01e90f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Make operations on `individualDeletedMessages` in lock scope #22966

[fix] Make operations on `individualDeletedMessages` in lock scope #22966

dao-jun commented Jun 24, 2024 •

edited

Loading

lhotari left a comment •

edited

Loading

dao-jun commented Jun 24, 2024

dao-jun commented Jun 24, 2024

lhotari commented Jun 24, 2024

lhotari commented Jun 24, 2024

lhotari commented Jun 24, 2024

dao-jun commented Jun 24, 2024

dao-jun commented Jun 24, 2024

lhotari commented Jun 24, 2024

dao-jun commented Jun 25, 2024

lhotari commented Jun 25, 2024

lhotari left a comment

lhotari commented Jun 26, 2024

lhotari left a comment

lhotari commented Jun 26, 2024

lhotari commented Jun 26, 2024 •

edited

Loading

dao-jun commented Jun 26, 2024

codecov-commenter commented Jun 26, 2024

lhotari commented Jun 26, 2024

[fix] Make operations on individualDeletedMessages in lock scope #22966

[fix] Make operations on individualDeletedMessages in lock scope #22966

Conversation

dao-jun commented Jun 24, 2024 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

lhotari left a comment • edited Loading

Choose a reason for hiding this comment

dao-jun commented Jun 24, 2024

dao-jun commented Jun 24, 2024

lhotari commented Jun 24, 2024

lhotari commented Jun 24, 2024

lhotari commented Jun 24, 2024

dao-jun commented Jun 24, 2024

dao-jun commented Jun 24, 2024

lhotari commented Jun 24, 2024

dao-jun commented Jun 25, 2024

lhotari commented Jun 25, 2024

lhotari left a comment

Choose a reason for hiding this comment

lhotari commented Jun 26, 2024

lhotari left a comment

Choose a reason for hiding this comment

lhotari commented Jun 26, 2024

lhotari commented Jun 26, 2024 • edited Loading

dao-jun commented Jun 26, 2024

codecov-commenter commented Jun 26, 2024

Codecov Report

lhotari commented Jun 26, 2024

[fix] Make operations on `individualDeletedMessages` in lock scope #22966

[fix] Make operations on `individualDeletedMessages` in lock scope #22966

dao-jun commented Jun 24, 2024 •

edited

Loading

lhotari left a comment •

edited

Loading

lhotari commented Jun 26, 2024 •

edited

Loading