forked from couchbase/kv_engine
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MB-35458 [SR]: Move SyncWrite completion to bg DurabilityCompletionTask
[[Re-apply after fixing error in DurabilityCompletionTask::run (skipping last vBucket).]] Change how SyncWrites which are Resolved and awaiting Completion are handled, by moving the final VBucket::commit() / abort() into a background task - DurabilityCompletionTask. +Background+ There are two reasons for making this change: a) Performance - specifically latency of front-end worker threads. By moving completion into a background task, we reduce the amount of work done on the thread which actually detected the SyncWrite was resolved - typically the front-end DCP threads when a DCP_SEQNO_ACK is processed. Given that we SEQNO_ACK at the end of Snapshot, A single SEQNO_ACK could result in committing multiple SyncWrites. Committing one SyncWrite is similar to a normal front-end Set operation, so there is potentially a non-trivial amount of work needed to be done when completing SyncWrites, which could tie up the front-end thread (causing other Connections to have to wait) for a noticable amount of time. b) Simplification of lock management. Doing completion in a background task simplifies lock management, for example we avoid lock inversions with earlier locks acquired during dcpSeqnoAck when attemping to later call notifySeqnoAvailable when this was done on the original thread. +Problem+ While (a) was the first reason identified for making this change (see MB-33092), (b) is the reason this change is being made now. During testing the following lock-order-inversion was seen: WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) Cycle in lock order graph: Stream::streamMutex => StreamContainer::rwlock => Stream::streamMutex The crux of the issue is the processing of DCP_SEQNO_ACKNOWLEDGED messages by the DcpProducer - this acquires the Stream::streamMutex before calling VBucket::seqnoAcknowledged(), however that function currently results in VBucket::commit() being called to synchronously complete the SyncWrite; which in turn must nodify all connected replica that a new seqno is available, requiring StreamContainer::rwlock to be acquired: Mutex StreamContainer::rwlock acquired here while holding mutex Stream::streamMutex in thread T15: ... couchbase#6 StreamContainer<std::shared_ptr<Stream> >::rlock() #7 DcpProducer::notifySeqnoAvailable(Vbid, unsigned long) ... #13 VBucket::commit(...) #14 ActiveDurabilityMonitor::commit(...) #15 ActiveDurabilityMonitor::processCompletedSyncWriteQueue() #16 ActiveDurabilityMonitor::seqnoAckReceived(...) #17 VBucket::seqnoAcknowledged(...) #18 ActiveStream::seqnoAck(...) #19 DcpProducer::seqno_acknowledged(...) ... Mutex Stream::streamMutex previously acquired by the same thread here: ... couchbase#3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) couchbase#4 ActiveStream::seqnoAck(...) couchbase#5 DcpProducer::seqno_acknowledged(...) ... This conflicts with the ordering seen when sending items out on the DCP connection - inside DcpProducer::step() where the StreamContainer::rwlock is acquired first, then ActiveStream::mutex acquired later: Mutex Stream::streamMutex acquired here while holding mutex StreamContainer::rwlock in thread T15: ... couchbase#3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) couchbase#4 ActiveStream::next() couchbase#5 DcpProducer::getNextItem() couchbase#6 DcpProducer::step(dcp_message_producers*) ... Mutex StreamContainer::rwlock previously acquired by the same thread here: #0 pthread_rwlock_rdlock <null> (libtsan.so.0+0x00000002c98b) ... couchbase#4 std::shared_lock<cb::RWLock>::shared_lock(cb::RWLock&) couchbase#5 StreamContainer<>::ResumableIterationHandle::ResumableIterationHandle() couchbase#6 StreamContainer<>::startResumable() #7 DcpProducer::getNextItem() #8 DcpProducer::step(dcp_message_producers*) ... +Solution+ The processing of resolved SyncWrites moved into a new background task. Instead of immediately processing them within ActiveDM::seqnoAckReceived(), that function notifies the new NonIO DurabilityCompletionTask that there are SyncWrites waiting for completion. DurabilityCompletionTask maintains a bool per vBucket indicating if there are SyncWrites for that vBucket pending completion. When the task is run, for each flag which is true it calls VBucket::processResolvedSyncWrites() for the associated VBucket. +Implementation Notes+ Currently there is just a single DurabilityCompletionTask (per Bucket), this was chosen as 1 task per vBucket (i.e. 1024 per Bucket) would be inefficient for our current background task scheduler (both in terms of latency to schedule each task for only one vBucket's worth of work, and in terms of managing that many tasks in the future queue). However, that does _potentially_ mean there's fewer resources (threads) available to complete SyncWrites on - previously that work could be done concurrently on all frontend threads (~O(num_cpus). Now the same work only has 1 thread available to run on (there's only a single DurabilityCompletionTask). _If_ this becomes a bottleneck we could look at increasing the number of DurabilityCompletionTask - e.g. sharding all vBuckets across multiple tasks like flusher / bgfetcher. Change-Id: I33ecfa78b03b4d2120b5d05f54984b24ce038fd8 Reviewed-on: http://review.couchbase.org/113749 Reviewed-by: Ben Huddleston <[email protected]> Tested-by: Build Bot <[email protected]>
- Loading branch information
Showing
35 changed files
with
457 additions
and
75 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
/* -*- Mode: C++; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*- */ | ||
/* | ||
* Copyright 2019 Couchbase, Inc | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#include "durability_completion_task.h" | ||
|
||
#include "ep_engine.h" | ||
#include "executorpool.h" | ||
#include "vbucket.h" | ||
|
||
#include <climits> | ||
|
||
using namespace std::chrono_literals; | ||
|
||
DurabilityCompletionTask::DurabilityCompletionTask( | ||
EventuallyPersistentEngine& engine) | ||
: GlobalTask(&engine, TaskId::DurabilityCompletionTask), | ||
pendingVBs(engine.getConfiguration().getMaxVbuckets()), | ||
vbid(0) { | ||
for (auto& vb : pendingVBs) { | ||
vb.store(false); | ||
} | ||
} | ||
|
||
bool DurabilityCompletionTask::run() { | ||
if (engine->getEpStats().isShutdown) { | ||
return false; | ||
} | ||
|
||
// Start by putting ourselves back to sleep once run() completes. | ||
// If a new VB is notified (or a VB is re-notified after it is processed in | ||
// the loop below) then that will cause the task to be re-awoken. | ||
snooze(INT_MAX); | ||
// Clear the wakeUpScheduled flag - that allows notifySyncWritesToComplete() | ||
// to wake up (re-schedule) this task if new vBuckets have SyncWrites which | ||
// need completing. | ||
wakeUpScheduled.store(false); | ||
|
||
const auto startTime = std::chrono::steady_clock::now(); | ||
|
||
// Loop for each vBucket, starting from where we previously left off. | ||
// For each vbucket, if the pending flag is set then clear it, and process | ||
// its resolved SyncWrites. | ||
for (size_t count = 0; count < pendingVBs.size(); | ||
count++, vbid = (vbid + 1) % pendingVBs.size()) { | ||
if (pendingVBs[vbid].exchange(false)) { | ||
engine->getVBucket(Vbid(vbid))->processResolvedSyncWrites(); | ||
} | ||
// Yield back to scheduler if we have exceeded the maximum runtime | ||
// for a single execution. | ||
auto runtime = std::chrono::steady_clock::now() - startTime; | ||
if (runtime > maxChunkDuration) { | ||
wakeUp(); | ||
break; | ||
} | ||
} | ||
|
||
return true; | ||
} | ||
|
||
void DurabilityCompletionTask::notifySyncWritesToComplete(Vbid vbid) { | ||
bool expected = false; | ||
if (!pendingVBs[vbid.get()].compare_exchange_strong(expected, true)) { | ||
// This VBucket transitioned from false -> true - wake ourselves up so | ||
// we can start to process the SyncWrites. | ||
expected = false; | ||
|
||
// Performance: Only wake up the task once (and don't repeatedly try to | ||
// wake if it's already scheduled to wake) - ExecutorPool::wake() isn't | ||
// super cheap so avoid it if already pending. | ||
if (wakeUpScheduled.compare_exchange_strong(expected, true)) { | ||
ExecutorPool::get()->wake(getId()); | ||
} | ||
} | ||
} | ||
|
||
const std::chrono::steady_clock::duration | ||
DurabilityCompletionTask::maxChunkDuration = 25ms; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
/* -*- Mode: C++; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*- */ | ||
/* | ||
* Copyright 2019 Couchbase, Inc | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
#pragma once | ||
|
||
#include "globaltask.h" | ||
#include <memcached/vbucket.h> | ||
|
||
/* | ||
* This task is used to complete (commit or abort) all SyncWrites which have | ||
* been resolved by each vbucket's ActiveDM. | ||
* | ||
* This is done in a separate task to reduce the amount of work done on | ||
* the thread which actually detected the SyncWrite was resolved - typically | ||
* the front-end DCP threads when a DCP_SEQNO_ACK is processed. | ||
* Given that we SEQNO_ACK at the end of Snapshot, A single SEQNO_ACK could | ||
* result in committing multiple SyncWrites, and Committing one SyncWrite is | ||
* similar to a normal front-end Set operation, we want to move this to a | ||
* background task. | ||
* | ||
* Additionally, by doing this in a background task it simplifies lock | ||
* management, for example we avoid lock inversions with earlier locks acquired | ||
* during dcpSeqnoAck when attemping to later call notifySeqnoAvailable when | ||
* this was done on the original thread. | ||
*/ | ||
class DurabilityCompletionTask : public GlobalTask { | ||
public: | ||
DurabilityCompletionTask(EventuallyPersistentEngine& engine); | ||
|
||
bool run() override; | ||
|
||
std::string getDescription() override { | ||
return "DurabilityCompletionTask"; | ||
} | ||
|
||
std::chrono::microseconds maxExpectedDuration() override { | ||
// Task shouldn't run much longer than maxChunkDuration; given we yield | ||
// after that duration - however _could_ exceed a bit given we check | ||
// the duration on each vBucket. As such add a 2x margin of error. | ||
return std::chrono::duration_cast<std::chrono::microseconds>( | ||
2 * maxChunkDuration); | ||
} | ||
|
||
/** | ||
* Notifies the task that the given vBucket has SyncWrite(s) ready to | ||
* be completed. | ||
* If the given vBucket isn't already pending, then will wake up the task | ||
* for it to run. | ||
*/ | ||
void notifySyncWritesToComplete(Vbid vbid); | ||
|
||
private: | ||
/** | ||
* A flag for each (possible) Vbid, set to true if there are SyncWrites | ||
* which need to be resolved. | ||
*/ | ||
std::vector<std::atomic_bool> pendingVBs; | ||
|
||
/// The index of the vBucket to check for resolved SyncWrites next | ||
/// in run(). | ||
/// Kept as member variable (and not just local) so we resume from the | ||
/// vBucket we left off from, to ensure fair scheduling. | ||
int vbid; | ||
|
||
/** | ||
* Flag which is used to check if a wakeup has already been scheduled for | ||
* this task. | ||
*/ | ||
std::atomic<bool> wakeUpScheduled{false}; | ||
|
||
/// Maximum duration this task should execute for before yielding back to | ||
/// the ExecutorPool (to allow other tasks to run). | ||
static const std::chrono::steady_clock::duration maxChunkDuration; | ||
}; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.