Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix for DocumentsWriter concurrency (fixes #935, closes #886) #940

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

NightOwl888
Copy link
Contributor

@NightOwl888 NightOwl888 commented May 23, 2024

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Summary of the changes (Less than 80 chars)

Lucene.Net.Support.Threading.ReentrantLock: Fixed the implementation so it prioritizes new threads obtaining the lock over waiting threads.

Fixes #935. Closes #886.

Description

This has been a known issue for some time, however as we have had the DocumentsWriter working reliably on a single thread most users have not worried about it until it was reported in #935.

There were 2 issues that were causing test failures explained in #935 (comment). The second issue turned out to be much more involved to work out how to address even though the most recent solution is actually very simple and lightweight. Instead of calling UninterruptableMonitor.Enter() in the Lock() method, we call UninterruptableMonitor.TryEnter() in a loop that executes Thread.Yield(). This allows other threads to acquire the lock even though there are waiting threads.

Granted, while this approach seems to reliably pass the tests, it may be a bit naïve of an implementation. While it doesn't peg my CPU and seems to run well on Azure DevOps, I am not sure whether this is the appropriate solution for production scenarios. Suggestions are welcome as to how to improve this. Do note there were 2 prior attempts:

  • 6f2e129 - Uses a timeout to ensure the lock is acquired, but it still has to wait until the queue schedules it until it runs.
  • 89b01e6 - Uses ManualResetEventSlim and a Queue<T> to manage the waiting threads. This is a more complete implementation and even passed many of the Apache Harmony tests, but it comes at a pretty steep performance cost. Maybe there is a way to improve this, though.

…63e10c that made the documents writer non-concurrent (but threadsafe)
…): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See apache#935.
…ntLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes apache#935. Closes apache#886.
…at(1000)] to try to reproduce on Azure DevOps (to be reverted).
…readState() method because it has no callers. This simplifies the design of ReentrantLock, since we don't need to artificially keep track of "queued threads".
… Enter(). Enter() causes deadlocks in other tests, so need to localize this change to DocumentsWriterFlushControl.
…lush(): Use ReentrantLock.Lock() instead of ReentrantLock.TryLock() because Java relies on "barging" behavior instead of returning false when the current thread isn't next in the queue. We cannot do that, but we can wait for the lock to become available instead.
…ome time for threads to reach the beginning of the wait queue. In Java, they are automatically put at the beginning of the queue, but since we cannot do that in .NET, we wait a little bit.
…iseconds to wait on whether the current process is 64 or 32 bit.
…liseconds that is used by callers of ReentrantLock.TryLock() to set the default value.
…ng TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework
…n to use unfair locking similar to how it was done in Java. We track the queue and use ManualResetEventSlim to control entry into the lock for queued tasks. Ported some of the ReentrantLock tests from Apache Harmony.
…he missing TryDequeue() and TryPeek() methods on netstandard2.0 and .NET Framework"

This reverts commit e5a65e9cd8bbf996fb599ff76e8d6b9f90babe4b.
…leMonitor.TryEnter() instead of UninterruptableMonitor.Enter() so we can control what happens while the thread waits. We simply call Thread.Yield() to allow TryLock() to proceed before any waiting threads. Commented tests that depend on IsLocked because the property was removed.
…n a longer test (to be reverted)"

This reverts commit b30e4abb576c8bfde3337a92de927a10747f88ae.
…ed [Repeat(1000)] to try to reproduce on Azure DevOps (to be reverted)."

This reverts commit d8fca410dafd1bf5529e8200034e1e2e5be83f07.
…itorStateException() Removed unused exception variable and added a comment to indicate success so we don't have to suppress warnings
@pc-LMS
Copy link

pc-LMS commented May 28, 2024

@jeme @rclabo Hello, I am reaching out to see if you have some time to review the request. I am hoping to begin using this fix if all goes well :)

@rclabo
Copy link
Contributor

rclabo commented May 28, 2024

I'll try to free up some time so I can look through this towards the end of the week. I'm sure I'd learn a lot in the process, so looking forward to it.

@rclabo
Copy link
Contributor

rclabo commented Jun 7, 2024

This week, I made good headway toward setting up an environment where I can benchmark the current master vs this PR to see if the proposed improvements do, in fact, speed up indexing through support for DocumentsWriter concurrency.

For benchmarking I'm using an approach similar to the code provided in #935 to see the level of impact this PR has on the performance issue reported there.

I will work on this more this next week and provide the results of the benchmarks once I have them. Then I will review the code in light of those benchmarks.

That's my update for this week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many method names in this file are camelCase. They should be changed to PascalCase as is the norm in the .NET world.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further review this file appears that it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR.

@rclabo
Copy link
Contributor

rclabo commented Jun 14, 2024

I have reviewed each of the individual commits for this PR, but since many commits overlap and change the same areas of code I still want to review the aggregate effect on each changed file by comparing it to the Java Lucene 4.8.1 file. I plan to do that on Monday.

I have also run performance tests on this PR vs the Master prior to the PR and the performance appears to be approximately the same.

@pc-LMS
Copy link

pc-LMS commented Jun 17, 2024

I have reviewed each of the individual commits for this PR, but since many commits overlap and change the same areas of code I still want to review the aggregate effect on each changed file by comparing it to the Java Lucene 4.8.1 file. I plan to do that on Monday.

I have also run performance tests on this PR vs the Master prior to the PR and the performance appears to be approximately the same.

We also ran a test and did not see any performance improvements - CPU remained at the same level and was not elevated in anyway. @NightOwl888

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further review this file appears that it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears it may be a port of a file from OpenJdk. That codebase is GNU GPL 2 licensed which is not compatible with the Apache License that appears in this file's header. The file needs to be removed from the PR. We should create our own ReentrantLockTest class that does not incorporate external code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 505 contains "flushQueue.Count > 0 &&", which is not found in the corresponding Java Lucene 4.8.1 code. This, however, is a pre-existing deviation rather than a change submitted in this PR. Still, it may be worth investigating why this change was introduced and whether it is still needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to ReentrantLock overall seem like an elegant solution to get close to parity to how the reentrantLock class works in Java (as far as I understand it based on a bunch of googling). So, in general, this feels like a great step forward. It is, however, imperative that the code be original and not incorporate any OpenJDK code or code from any non-Apache source. Please ensure that is the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These method should ideally all contain doc comments to explain what they do.

@rclabo
Copy link
Contributor

rclabo commented Jun 18, 2024

@pc-LMS

We also ran a test and did not see any performance improvements - CPU remained at the same level and was not elevated in anyway.

Today I made a private branch off of of this PR and I hacked in some concurrent logging to memory deep inside the indexing code at the dwpt level. Specifically I logged to memory each time a thread entered the documentWriterPerThread.UpdateDocument(...) method and each time a thread left that method. Through this code I was able to unequivocally verify that with this PR multiple threads are able to call documentWriterPerThread.UpdateDocument(...) in parallel.

In my observations during performance testing, I could never get my processor above 45% utilization, and then only for brief spikes, but that kinda makes sense to me given the fact that indexing is pretty clearly an IO bound operation not a CPU bound operation.

Just for fun I created a parallel threading test where I had up to 20 threads reading text files in one format and writing them in another format via an input and output stream per thread. My local processor has 20 cores. This 2nd setup doesn't use Lucene.NET in any way but I wanted to see if outside of Lucene.NET I could get high utilization of my CPU when using lots of threads doing heavy IO work. And the answer, as I suspected, is no. My processor never got even above 35% utilization and that just for brief spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor multi-threaded indexing performance Support for Ordering of Indexing with SeqNo
4 participants