Skip to content

Conversation

cxljs
Copy link
Contributor

@cxljs cxljs commented Mar 5, 2025

ref: #2806

Hi @git-hulk , can you review the pr if you have free time?

@git-hulk
Copy link
Member

git-hulk commented Mar 5, 2025

@fukua95 sure, thank you.

@@ -853,6 +840,19 @@ void Server::cron() {
CleanupExitedSlaves();
recordInstantaneousMetrics();
}

assert(stop_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird to move those codes from Stop() to cron(). I suspect the issue was caused by the server being stopped before the replication was started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's weird. But I don't find a better place to impl the logic, so I read the Redis code and find Redis is also impl the logic in its serverCron().

Copy link
Contributor Author

@cxljs cxljs Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the issue was caused by the server being stopped before the replication was started.

I think before the replication was started, the ReplicationThread::stop_flag_ is true, so ReplicationThread::Stop() will not call the join().

And according to the log:

I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server
W20250226 11:50:45.351526 140030983013952 replication.cc:371] Replication thread operation failed: thread #140030983013952 cannot be `join`ed: Resource deadlock avoided
I20250226 11:50:45.351537 140030983013952 replication.cc:373] [replication] Stopped

3952 (the repl thread) receive the signal -> call Server::Stop() -> join itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3952 (the repl thread) receive the signal -> call Server::Stop() -> join itself.

The repl thread won't receive the terminated signal and stop the server.

Copy link
Contributor Author

@cxljs cxljs Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repl thread won't receive the terminated signal and stop the server.

Why do you have this conclusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server

Only the main thread will handle the signal. After receiving the terminated signal, the server->Stop() will be called, and it will then stop the replication thread afterward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@PragmaTwice PragmaTwice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's possible to add a test case for this?

@cxljs
Copy link
Contributor Author

cxljs commented Mar 6, 2025

I'm wondering if it's possible to add a test case for this?

I test it, but I can't reproduce the bug.

In a multi-thread process, every thread have the chance to be interrupted by derive. When the repl thread be interrupted, it will implement the signal handle function, and the handle function will join the repl thread, which cause deadlock.

I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server
W20250226 11:50:45.351526 140030983013952 replication.cc:371] Replication thread operation failed: thread #140030983013952 cannot be `join`ed: Resource deadlock avoided
I20250226 11:50:45.351537 140030983013952 replication.cc:373] [replication] Stopped

From the log, we know the xxx3952 is the repl thread ID, and it received the signal -> call Server::Stop() -> wants to join itself (xxx3952).

Copy link

sonarqubecloud bot commented Mar 6, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants