Khepri: timeouts when one of the nodes stops responding #10753

mkuratczyk · 2024-03-15T10:08:28Z

Describe the bug

During chaos tests where one of the VMs/nodes is suddenly restarted, timeouts like this occur:

   crasher:
     initial call: rabbit_prequeue:init/1
     pid: <0.1007.0>
     registered_name: []
     exception exit: {{badrecord,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{dict,map_dict,2,[{file,"dict.erl"},{line,467}]},
                       {rabbit_amqqueue,internal_delete,3,
                           [{file,"rabbit_amqqueue.erl"},{line,1805}]},
                       {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',
                           7,
                           [{file,"rabbit_amqqueue_process.erl"},{line,332}]},
                       {rabbit_amqqueue_process,terminate_shutdown,2,
                           [{file,"rabbit_amqqueue_process.erl"},{line,362}]},
                       {gen_server2,terminate,3,
                           [{file,"gen_server2.erl"},{line,1158}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1048}]},
                       {proc_lib,wake_up,3,
                           [{file,"proc_lib.erl"},{line,251}]}]}

   crasher:
     initial call: rabbit_channel:init/1
     pid: <0.90831.0>
     registered_name: []
     exception exit: {{case_clause,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{rabbit_channel,binding_action,10,
                           [{file,"rabbit_channel.erl"},{line,1825}]},
                       {rabbit_channel,handle_method,3,
                           [{file,"rabbit_channel.erl"},{line,1614}]},
                       {rabbit_channel,handle_cast,2,
                           [{file,"rabbit_channel.erl"},{line,631}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1056}]},
                       {proc_lib,init_p_do_apply,3,
                           [{file,"proc_lib.erl"},{line,241}]}]}
       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)

Of course timeouts are not unexpected when machines disappear, but we need to think through these scenarios and decide what to do. Either ways, we should not log such stacktraces probably.

Reproduction steps

It was a chaos test with a workload, including queue deletions and random restarts.

Expected behavior

?

Additional context

No response

The text was updated successfully, but these errors were encountered:

mkuratczyk added the bug label Mar 15, 2024

the-mikedavis self-assigned this Mar 26, 2024

the-mikedavis mentioned this issue Apr 3, 2024

Handle database timeouts from Khepri minority #10915

Closed

the-mikedavis mentioned this issue Aug 22, 2024

Return errors when a queue record fails to be removed #12082

Merged

the-mikedavis closed this as completed in #12082 Aug 27, 2024

mergify bot mentioned this issue Aug 27, 2024

Return errors when a queue record fails to be removed (backport #12082) #12130

Merged

michaelklishin added this to the 4.0.0 milestone Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khepri: timeouts when one of the nodes stops responding #10753

Khepri: timeouts when one of the nodes stops responding #10753

mkuratczyk commented Mar 15, 2024

Khepri: timeouts when one of the nodes stops responding #10753

Khepri: timeouts when one of the nodes stops responding #10753

Comments

mkuratczyk commented Mar 15, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context