Skip to content

Commit

Permalink
dcpdrain: Increase default NOOP interval from 1 to 60s
Browse files Browse the repository at this point in the history
dcpdrain currently sets the DCP noop-interval to 1s, so the producer
will send NOOP requests to dcpdrain every 1s and dcpdrain needs to
correctly handle this request and send a response. When connecting to
clusters  with high latency between client and
server nodes, it can take more than 1 second to complete setting up the DCP
connection and endering the main event loop. This means the server node may
start to send DCP noop requests before the DCP connection is
setup - and crucially dcpdrain's event loop is ready to process the
DCP noop request. This results in dcpdrain crashing as it gets a DCP
noop request when it is expecting a control response:

    Process 43094 launched: '/Users/dave/repos/couchbase/server/source/build/kv_engine/dcpdrain' (arm64)
    Using DCP flow control with buffer size: 13421772
    Set DCP control message: set_priority=high
    Set DCP control message: supports_cursor_dropping_vulcan=true
    Set DCP control message: supports_hifi_MFU=true
    Set DCP control message: send_stream_end_on_client_close_stream=true
    Set DCP control message: enable_expiry_opcode=true
    Set DCP control message: set_noop_interval=1
    Set DCP control message: enable_noop=true
    Set DCP control message: enable_out_of_order_snapshots=true
    2023-05-03T12:11:28.705431+01:00 CRITICAL *** Fatal error encountered during exception handling ***
    2023-05-03T12:11:28.708626+01:00 CRITICAL Caught unhandled std::exception-derived exception. what(): Header::getResponse(): Header is not a response

    Target 0: (dcpdrain) stopped.
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
      * frame #0: 0x00000001c24a2d98 libsystem_kernel.dylib` __pthread_kill  + 8
        frame #1: 0x00000001c24d7ee0 libsystem_pthread.dylib` pthread_kill  + 288
        frame #2: 0x00000001c2412340 libsystem_c.dylib` abort  + 168
        frame #3: 0x00000001c2492b08 libc++abi.dylib` abort_message  + 132
        frame #4: 0x00000001c2482938 libc++abi.dylib` demangling_terminate_handler()  + 312
        frame #5: 0x00000001c2378330 libobjc.A.dylib` _objc_terminate()  + 160
        frame #6: 0x000000010008ef30 dcpdrain` backtrace_terminate_handler()  + 752 at terminate_handler.cc:88
        frame #7: 0x00000001c2491ea4 libc++abi.dylib` std::__terminate(void (*)())  + 20
        frame #8: 0x00000001c2494c1c libc++abi.dylib` __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*)  + 36
        frame #9: 0x00000001c2494bc8 libc++abi.dylib` __cxa_throw  + 140
        frame #10: 0x000000010002a77c dcpdrain` BinprotResponse::getTracingData() const [inlined] cb::mcbp::Header::getResponse(this=0x00006000002044a0) const  + 48 at header.h:134
        frame #11: 0x000000010002a74c dcpdrain` BinprotResponse::getTracingData() const [inlined] BinprotResponse::getResponse(this=<unavailable>) const  at client_mcbp_commands.cc:487
        frame #12: 0x000000010002a74c dcpdrain` BinprotResponse::getTracingData(this=0x000000016fdfef90) const  + 188 at client_mcbp_commands.cc:373
        frame #13: 0x000000010002a638 dcpdrain` MemcachedConnection::recvResponse(this=0x0000000101604080, response=0x000000016fdfef90, opcode=<unavailable>, readTimeout=<unavailable>)  + 84 at client_connection.cc:1043
        ...
        frame #21: 0x0000000100038f40 dcpdrain` MemcachedConnection::backoff_execute(..., context="DCP_CONTROL", ...)  + 100 at client_connection.cc:2016
        frame #22: 0x000000010002bab4 dcpdrain` MemcachedConnection::execute(this=0x0000000101604080, command=0x000000016fdfefb0, readTimeout=(__rep_ = 0))  + 168 at client_connection.cc:1998
        frame #23: 0x000000010000d688 dcpdrain` main  + 280 at dcpdrain.cc:451
        frame #24: 0x000000010000d570 dcpdrain` main(argc=<unavailable>, argv=<unavailable>)  + 8488 at dcpdrain.cc:929
        frame #25: 0x00000001005d508c dyld` start  + 520

Ideally dcpdrain should be robust to receiving dcp NOOP messages while
setting up the control flags, but that's not simple as we use common
code in MemcachedConnection which performs a request and expects a
response (of type DCP_CONTROL) in-order.

To workaround this problem simply increase the default DCP noop
interval from 1 to 60 seconds - 60s /should/ be sufficient to complete
the handshake...

Change-Id: I0f846956d6499ea54d74f781cb14d7982387c9f4
Reviewed-on: https://review.couchbase.org/c/kv_engine/+/190418
Tested-by: Build Bot <[email protected]>
Reviewed-by: Trond Norbye <[email protected]>
  • Loading branch information
daverigby authored and trondn committed May 4, 2023
1 parent a2c2054 commit 961a3a3
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion programs/dcpdrain/dcpdrain.cc
Original file line number Diff line number Diff line change
Expand Up @@ -902,7 +902,7 @@ int main(int argc, char** argv) {
{"supports_hifi_MFU", "true"},
{"send_stream_end_on_client_close_stream", "true"},
{"enable_expiry_opcode", "true"},
{"set_noop_interval", "1"},
{"set_noop_interval", "60"},
{"enable_noop", "true"}};
}

Expand Down

0 comments on commit 961a3a3

Please sign in to comment.