BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

NightOwl888 · 2024-05-23T12:36:46Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Summary of the changes (Less than 80 chars)

Lucene.Net.Support.Threading.ReentrantLock: Fixed the implementation so it prioritizes new threads obtaining the lock over waiting threads.

Fixes #935. Closes #886.

Description

This has been a known issue for some time, however as we have had the DocumentsWriter working reliably on a single thread most users have not worried about it until it was reported in #935.

There were 2 issues that were causing test failures explained in #935 (comment). The second issue turned out to be much more involved to work out how to address even though the most recent solution is actually very simple and lightweight. Instead of calling UninterruptableMonitor.Enter() in the Lock() method, we call UninterruptableMonitor.TryEnter() in a loop that executes Thread.Yield(). This allows other threads to acquire the lock even though there are waiting threads.

Granted, while this approach seems to reliably pass the tests, it may be a bit naïve of an implementation. While it doesn't peg my CPU and seems to run well on Azure DevOps, I am not sure whether this is the appropriate solution for production scenarios. Suggestions are welcome as to how to improve this. Do note there were 2 prior attempts:

6f2e129 - Uses a timeout to ensure the lock is acquired, but it still has to wait until the queue schedules it until it runs.
89b01e6 - Uses ManualResetEventSlim and a Queue<T> to manage the waiting threads. This is a more complete implementation and even passed many of the Apache Harmony tests, but it comes at a pretty steep performance cost. Maybe there is a way to improve this, though.

…63e10c that made the documents writer non-concurrent (but threadsafe)

…): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See apache#935.

…ntLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes apache#935. Closes apache#886.

…at(1000)] to try to reproduce on Azure DevOps (to be reverted).

…readState() method because it has no callers. This simplifies the design of ReentrantLock, since we don't need to artificially keep track of "queued threads".

…ninterruptableMonitor on cascaded methods.

…er test (to be reverted)

… Enter(). Enter() causes deadlocks in other tests, so need to localize this change to DocumentsWriterFlushControl.

…lush(): Use ReentrantLock.Lock() instead of ReentrantLock.TryLock() because Java relies on "barging" behavior instead of returning false when the current thread isn't next in the queue. We cannot do that, but we can wait for the lock to become available instead.

…ryLock() that accepts a timeout (TimeSpan)

…ome time for threads to reach the beginning of the wait queue. In Java, they are automatically put at the beginning of the queue, but since we cannot do that in .NET, we wait a little bit.

…iseconds to wait on whether the current process is 64 or 32 bit.

…liseconds that is used by callers of ReentrantLock.TryLock() to set the default value.

…ng TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework

…n to use unfair locking similar to how it was done in Java. We track the queue and use ManualResetEventSlim to control entry into the lock for queued tasks. Ported some of the ReentrantLock tests from Apache Harmony.

…he missing TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework" This reverts commit e5a65e9cd8bbf996fb599ff76e8d6b9f90babe4b.

…leMonitor.TryEnter() instead of UninterruptableMonitor.Enter() so we can control what happens while the thread waits. We simply call Thread.Yield() to allow TryLock() to proceed before any waiting threads. Commented tests that depend on IsLocked because the property was removed.

…n a longer test (to be reverted)" This reverts commit b30e4abb576c8bfde3337a92de927a10747f88ae.

…ed [Repeat(1000)] to try to reproduce on Azure DevOps (to be reverted)." This reverts commit d8fca410dafd1bf5529e8200034e1e2e5be83f07.

src/Lucene.Net.Tests/Support/Threading/ReentrantLockTest.cs

…itorStateException() Removed unused exception variable and added a comment to indicate success so we don't have to suppress warnings

pc-LMS · 2024-05-28T16:47:09Z

@jeme @rclabo Hello, I am reaching out to see if you have some time to review the request. I am hoping to begin using this fix if all goes well :)

rclabo · 2024-05-28T17:53:22Z

I'll try to free up some time so I can look through this towards the end of the week. I'm sure I'd learn a lot in the process, so looking forward to it.

rclabo · 2024-06-07T21:17:17Z

This week, I made good headway toward setting up an environment where I can benchmark the current master vs this PR to see if the proposed improvements do, in fact, speed up indexing through support for DocumentsWriter concurrency.

For benchmarking I'm using an approach similar to the code provided in #935 to see the level of impact this PR has on the performance issue reported there.

I will work on this more this next week and provide the results of the benchmarks once I have them. Then I will review the code in light of those benchmarks.

That's my update for this week.

rclabo · 2024-06-14T21:37:51Z

src/Lucene.Net.Tests/Support/Threading/JSR166TestCase.cs

Many method names in this file are camelCase. They should be changed to PascalCase as is the norm in the .NET world.

On further review this file appears that it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR.

rclabo · 2024-06-14T22:00:05Z

I have reviewed each of the individual commits for this PR, but since many commits overlap and change the same areas of code I still want to review the aggregate effect on each changed file by comparing it to the Java Lucene 4.8.1 file. I plan to do that on Monday.

I have also run performance tests on this PR vs the Master prior to the PR and the performance appears to be approximately the same.

pc-LMS · 2024-06-17T14:08:45Z

I have reviewed each of the individual commits for this PR, but since many commits overlap and change the same areas of code I still want to review the aggregate effect on each changed file by comparing it to the Java Lucene 4.8.1 file. I plan to do that on Monday.

I have also run performance tests on this PR vs the Master prior to the PR and the performance appears to be approximately the same.

We also ran a test and did not see any performance improvements - CPU remained at the same level and was not elevated in anyway. @NightOwl888

rclabo · 2024-06-17T17:33:10Z

src/Lucene.Net.Tests/Support/Threading/JSR166TestCase.cs

On further review this file appears that it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR.

rclabo · 2024-06-17T17:43:08Z

src/Lucene.Net.Tests/Support/Threading/ReentrantLockTest.cs

This file appears it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR. We should create our own ReentrantLockTest class that does not incorporate external code.

rclabo · 2024-06-17T18:03:59Z

src/Lucene.Net/Index/DocumentsWriterFlushControl.cs

Line 505 contains "flushQueue.Count > 0 &&", which is not found in the corresponding Java Lucene 4.8.1 code. This, however, is a pre-existing deviation rather than a change submitted in this PR. Still, it may be worth investigating why this change was introduced and whether it is still needed.

rclabo · 2024-06-17T18:11:28Z

src/Lucene.Net/Support/Threading/ReentrantLock.cs

These changes to ReentrantLock overall seem like an elegant solution to get close to parity to how the reentrantLock class works in Java (as far as I understand it based on a bunch of googling). So, in general, this feels like a great step forward. It is, however, imperative that the code be original and not incorporate any OpenJDK code or code from any non-Apache source. Please ensure that is the case.

rclabo · 2024-06-17T18:12:57Z

src/Lucene.Net/Support/Threading/UninterruptableMonitor.cs

These method should ideally all contain doc comments to explain what they do.

rclabo · 2024-06-18T15:57:37Z

@pc-LMS

We also ran a test and did not see any performance improvements - CPU remained at the same level and was not elevated in anyway.

Today I made a private branch off of of this PR and I hacked in some concurrent logging to memory deep inside the indexing code at the dwpt level. Specifically I logged to memory each time a thread entered the documentWriterPerThread.UpdateDocument(...) method and each time a thread left that method. Through this code I was able to unequivocally verify that with this PR multiple threads are able to call documentWriterPerThread.UpdateDocument(...) in parallel.

In my observations during performance testing, I could never get my processor above 45% utilization, and then only for brief spikes, but that kinda makes sense to me given the fact that indexing is pretty clearly an IO bound operation not a CPU bound operation.

Just for fun I created a parallel threading test where I had up to 20 threads reading text files in one format and writing them in another format via an input and output stream per thread. My local processor has 20 cores. This 2nd setup doesn't use Lucene.NET in any way but I wanted to see if outside of Lucene.NET I could get high utilization of my CPU when using lots of threads doing heavy IO work. And the answer, as I suspected, is no. My processor never got even above 35% utilization and that just for brief spikes.

NightOwl888 added 21 commits May 15, 2024 14:00

Lucene.Net.Index.DocumentsWriterFlushControl: Reverted changes from 9…

878261f

…63e10c that made the documents writer non-concurrent (but threadsafe)

BUG: Lucene.Net.Index.DocumentsWriterFlushControl::AddFlushableState(…

23898ad

…): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See apache#935.

Lucene.Net.Index.TestRollingUpdates::TestUpdateSameDoc(): Added [Repe…

e3501a3

…at(1000)] to try to reproduce on Azure DevOps (to be reverted).

Lucene.Net.Index.DocumentsWriterPerThreadPool: Removed MinContendedTh…

3abbb73

…readState() method because it has no callers. This simplifies the design of ReentrantLock, since we don't need to artificially keep track of "queued threads".

Lucene.Net.Support: Added aggressive inlining for ReentrantLock and U…

0884f4d

…ninterruptableMonitor on cascaded methods.

run-tests-on-os.yml: Increase blame hang timeout so we can run a long…

bbd6726

…er test (to be reverted)

Lucene.Net.Support.Threading.ReentrantLock: Use TryEnter() instead of…

37a0521

… Enter(). Enter() causes deadlocks in other tests, so need to localize this change to DocumentsWriterFlushControl.

Lucene.Net.Support.Threading.ReeentrantLock(): Added an overload of T…

f33b243

…ryLock() that accepts a timeout (TimeSpan)

Lucene.Net.Index.DocumentsWriterFlushControl: Use timeouts to allow s…

85d8023

…ome time for threads to reach the beginning of the wait queue. In Java, they are automatically put at the beginning of the queue, but since we cannot do that in .NET, we wait a little bit.

Lucene.Net.Index.DocumentsWriterFlushControl: Base the number of mill…

61bf4ac

…iseconds to wait on whether the current process is 64 or 32 bit.

Lucene.Net.Index::DocumentsWriter: Added a constant TryLockTimeoutMil…

6f2e129

…liseconds that is used by callers of ReentrantLock.TryLock() to set the default value.

Lucene.Net.Support: Added QueueExtensions class to polyfill the missi…

d18d9b7

…ng TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework

SWEEP: Lucene.Net.Index: Removed timeouts for ReentrantLock.TryLock().

8423edc

Revert "Lucene.Net.Support: Added QueueExtensions class to polyfill t…

3074564

…he missing TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework" This reverts commit e5a65e9cd8bbf996fb599ff76e8d6b9f90babe4b.

Lucene.Net.csproj: Removed dependency on Microsoft.Extensions.ObjectPool

e788218

Revert "run-tests-on-os.yml: Increase blame hang timeout so we can ru…

53b83c2

…n a longer test (to be reverted)" This reverts commit b30e4abb576c8bfde3337a92de927a10747f88ae.

Revert "Lucene.Net.Index.TestRollingUpdates::TestUpdateSameDoc(): Add…

5f7c1e9

…ed [Repeat(1000)] to try to reproduce on Azure DevOps (to be reverted)." This reverts commit d8fca410dafd1bf5529e8200034e1e2e5be83f07.

NightOwl888 requested review from jeme, paulirwin and rclabo May 23, 2024 12:42

NightOwl888 mentioned this pull request May 23, 2024

Poor multi-threaded indexing performance #935

Open

1 task

paulirwin reviewed May 23, 2024

View reviewed changes

src/Lucene.Net.Tests/Support/Threading/ReentrantLockTest.cs Outdated Show resolved Hide resolved

Lucene.Net.Support.Threading.ReentrantLockTest::TestUnlock_IllegalMon…

4e1dcc9

…itorStateException() Removed unused exception variable and added a comment to indicate success so we don't have to suppress warnings

rclabo reviewed Jun 14, 2024

View reviewed changes

rclabo requested changes Jun 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

NightOwl888 commented May 23, 2024 •

edited

Loading

pc-LMS commented May 28, 2024

rclabo commented May 28, 2024

rclabo commented Jun 7, 2024

rclabo Jun 14, 2024

rclabo Jun 17, 2024

rclabo commented Jun 14, 2024

pc-LMS commented Jun 17, 2024

rclabo Jun 17, 2024

rclabo Jun 17, 2024

rclabo Jun 17, 2024

rclabo Jun 17, 2024

rclabo Jun 17, 2024

rclabo commented Jun 18, 2024

BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

Are you sure you want to change the base?

BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

Conversation

NightOwl888 commented May 23, 2024 • edited Loading

Description

pc-LMS commented May 28, 2024

rclabo commented May 28, 2024

rclabo commented Jun 7, 2024

rclabo Jun 14, 2024

Choose a reason for hiding this comment

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo commented Jun 14, 2024

pc-LMS commented Jun 17, 2024

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo Jun 17, 2024

Choose a reason for hiding this comment

rclabo commented Jun 18, 2024

NightOwl888 commented May 23, 2024 •

edited

Loading