-
Notifications
You must be signed in to change notification settings - Fork 565
fix: the repl thread may join itself when the server is stopped #2815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
@fukua95 sure, thank you. |
@@ -853,6 +840,19 @@ void Server::cron() { | |||
CleanupExitedSlaves(); | |||
recordInstantaneousMetrics(); | |||
} | |||
|
|||
assert(stop_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird to move those codes from Stop()
to cron()
. I suspect the issue was caused by the server being stopped before the replication was started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's weird. But I don't find a better place to impl the logic, so I read the Redis code and find Redis is also impl the logic in its serverCron()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the issue was caused by the server being stopped before the replication was started.
I think before the replication was started, the ReplicationThread::stop_flag_
is true, so ReplicationThread::Stop()
will not call the join()
.
And according to the log:
I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server
W20250226 11:50:45.351526 140030983013952 replication.cc:371] Replication thread operation failed: thread #140030983013952 cannot be `join`ed: Resource deadlock avoided
I20250226 11:50:45.351537 140030983013952 replication.cc:373] [replication] Stopped
3952
(the repl thread) receive the signal -> call Server::Stop()
-> join itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3952 (the repl thread) receive the signal -> call Server::Stop() -> join itself.
The repl thread won't receive the terminated signal and stop the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repl thread won't receive the terminated signal and stop the server.
Why do you have this conclusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server
Only the main thread will handle the signal. After receiving the terminated signal, the server->Stop() will be called, and it will then stop the replication thread afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's possible to add a test case for this?
I test it, but I can't reproduce the bug. In a multi-thread process, every thread have the chance to be interrupted by derive. When the repl thread be interrupted, it will implement the signal handle function, and the handle function will join the repl thread, which cause deadlock.
From the log, we know the xxx3952 is the repl thread ID, and it received the signal -> call |
|
ref: #2806
Hi @git-hulk , can you review the pr if you have free time?