Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OF-2921: Prevent deadlock by not broadcasting synchronously #2635

Merged
merged 3 commits into from
Jan 4, 2025

Conversation

guusdk
Copy link
Member

@guusdk guusdk commented Dec 17, 2024

When a client drops offline, a presence update is broadcast on behalf of that client. We've observed deadlocks occurring when this happens (as the broadcast itself, especially when it occurs in context in of MUC rooms, will acquire mutexes).

There appears to be little reason for this broadcast to happen synchronous to the process of session termination process. To reduce the likelihood of deadlocks, this commit makes the broadcast happen in an asynchronous process.

When a client drops offline, a presence update is broadcast on behalf of that client. We've observed deadlocks occurring when this happens (as the broadcast itself, especially when it occurs in context in of MUC rooms, will acquire mutexes).

There appears to be little reason for this broadcast to happen synchronous to the process of session termination process. To reduce the likelihood of deadlocks, this commit makes the broadcast happen in an asynchronous process.
@guusdk guusdk requested review from dwd and Fishbowler December 17, 2024 15:41
guusdk added 2 commits January 2, 2025 15:40
When stream management is enabled, deliver and record stanzas under a mutex. If it's not enabled, don't acquire the lock to reduce lock contention.
When a websocket error is raised while holding another mutex (eg: when processing a MUC stanza) then having _another_ mutex introduces a deadlock risk. This has been observed to happen in the wild.

In this commit, the mutex held when processing a websocket error is removed.
@guusdk guusdk added the backport 4.9 on merge, GHA will generate a PR with these changes against 4.9 branch label Jan 2, 2025
@akrherz
Copy link
Member

akrherz commented Jan 4, 2025

IIRC, this was done synchronously due to issues we observed with remote MUC users in open_chat not seeing the shutdown when Ignite's Openfire was restarted.

@guusdk
Copy link
Member Author

guusdk commented Jan 4, 2025

I don't quite recall that issue, but given that this introduces deadlocks (which causes the system to become unusable and throw out-of-memory exceptions, the cure might be worse than the ailment in this case.

@akrherz
Copy link
Member

akrherz commented Jan 4, 2025

Agreed, I think this was https://igniterealtime.atlassian.net/browse/OF-1688

@akrherz akrherz merged commit cdece9b into igniterealtime:main Jan 4, 2025
17 checks passed
Copy link

github-actions bot commented Jan 4, 2025

Successfully created backport PR for 4.9:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 4.9 on merge, GHA will generate a PR with these changes against 4.9 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants