You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Explanation
### Current State
The WebSocket service experienced a critical issue when the backend
server accepted connections but immediately disconnected afterward. I
was only able to reproduce this behavior locally, not in prod endpoint.
### The Problem
When the server accepts the WebSocket connection but then disconnects
rapidly:
- The client would immediately attempt to reconnect (fast reconnection)
- This created a connection thrashing loop: connect → disconnect →
reconnect → disconnect
- Multiple code paths could trigger reconnection attempts independently
- The `scheduleConnect` method could be called multiple times during
rapid disconnect cycles
- Multiple overlapping reconnection attempts could be scheduled
simultaneously
- All disconnected clients would reconnect at exactly the same time,
creating a "thundering herd" problem
- This led to unpredictable behavior, duplicate connection attempts, and
worsened the reconnection storm
### The Solution
This PR fixes the issue with three key improvements:
#### 1. **Centralize Reconnection via `scheduleConnect`**
All reconnection logic now flows through a single entry point:
- All code paths that need to trigger reconnection now use
`scheduleConnect`
- Eliminates scattered reconnection logic across the codebase
- Provides a single point of control for reconnection behavior
- Makes the reconnection strategy consistent and predictable
- Easier to maintain, test, and debug reconnection logic
#### 2. **Idempotent `scheduleConnect`**
Makes `scheduleConnect` idempotent to ensure multiple calls don't create
duplicate or overlapping connection attempts:
- Checks if a reconnection is already scheduled before creating a new
one
- Ensures only one reconnection timer is active at any given time
- Eliminates race conditions from concurrent connection attempts
- Maintains clear, single-threaded reconnection state
#### 3. **Jitter on Reconnection**
Adds randomized jitter to reconnection timing to prevent thundering herd
problems:
- Randomizes the reconnection delay to spread out reconnection attempts
across clients
- When the server disconnects many clients simultaneously, they won't
all reconnect at the exact same moment
- Reduces load spikes on the server during recovery from incidents
- Improves overall system stability and reliability
**Combined Effect:**
With all three fixes, when the server accepts connections but
disconnects immediately:
- All reconnection attempts flow through a single, centralized method
(consistency)
- Only one reconnection attempt will be scheduled per client
(idempotency)
- Reconnection attempts are staggered across the client population
(jitter)
- The reconnection logic follows proper backoff strategy with
randomization
- No duplicate timers or overlapping connection attempts
- Predictable, controlled reconnection behavior that respects server
health
- Server experiences gradual reconnection load rather than synchronized
spikes
### Technical Details
The fix centralizes all reconnection logic through `scheduleConnect`,
makes it idempotent by guarding against duplicate timer creation, and
adds jitter to the reconnection delay calculation to randomize timing
across clients. This prevents both the individual reconnection storm
(from rapid connect/disconnect cycles) and the collective thundering
herd problem (from synchronized reconnections), while ensuring
consistent behavior across the entire service.
## References
- Related to `BackendWebSocketService` in `@metamask/core-backend`
- Fixes # [ADD ISSUE NUMBER HERE]
## Checklist
- [x] I've updated the test suite for new or updated code as appropriate
- [x] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [x] I've communicated my changes to consumers by [updating changelogs
for packages I've
changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs),
highlighting breaking changes as necessary
- [ ] I've prepared draft pull requests for clients and consumer
packages to resolve any breaking changes
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Introduces a controlled `forceReconnection()` flow,
idempotent/jittered backoff scheduling, and stability timer to prevent
reconnect thrashing; updates AccountActivityService to use it and
expands tests.
>
> - **BackendWebSocketService**
> - **New API**: `forceReconnection()` messenger action and method for
controlled disconnect-then-reconnect.
> - **Reconnection**: Idempotent scheduler using `ExponentialBackoff`
with jitter; `connect()` no-ops if a reconnect is scheduled;
stable-connection timer (10s) resets attempts/backoff.
> - **Behavior changes**: `disconnect()` now synchronous and resets
attempts; improved error handling and timer cleanup; avoids duplicate
timers and races.
> - **Actions/Types**: Expose `forceReconnection`; update method action
types union.
> - **AccountActivityService**
> - Switch to `BackendWebSocketService:forceReconnection` for recovery;
remove direct `disconnect`/`connect` sequence.
> - Update allowed messenger actions accordingly.
> - **Tests**
> - Add/adjust tests for new reconnection semantics, stable timer,
idempotent scheduling, and force reconnection flows; minor test cleanup
tweaks.
> - **Changelog**
> - Document new API, reconnection improvements, fixes for race
conditions/memory leaks, and breaking messenger action change for
AccountActivityService.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
373a4f6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Copy file name to clipboardExpand all lines: packages/core-backend/CHANGELOG.md
+27Lines changed: 27 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
### Added
11
+
12
+
- Add `forceReconnection()` method to `BackendWebSocketService` for controlled subscription state cleanup ([#6861](https://github.com/MetaMask/core/pull/6861))
13
+
- Performs a controlled disconnect-then-reconnect sequence with exponential backoff
14
+
- Useful for recovering from subscription/unsubscription issues and cleaning up orphaned subscriptions
- Connection must stay stable for 10 seconds before resetting reconnect attempts
18
+
- Prevents issues when server accepts connection then immediately closes it
19
+
10
20
### Changed
11
21
12
22
- Bump `@metamask/base-controller` from `^8.4.1` to `^8.4.2` ([#6917](https://github.com/MetaMask/core/pull/6917))
23
+
- Update `AccountActivityService` to use new `forceReconnection()` method instead of manually calling disconnect/connect ([#6861](https://github.com/MetaMask/core/pull/6861))
- Remove redundant schedule calls from error paths
31
+
- Update `BackendWebSocketService.disconnect()` to reset reconnect attempts counter ([#6861](https://github.com/MetaMask/core/pull/6861))
32
+
- Update `BackendWebSocketService.disconnect()` return type from `Promise<void>` to `void` ([#6861](https://github.com/MetaMask/core/pull/6861))
33
+
- Improve logging throughout `BackendWebSocketService` for better debugging ([#6861](https://github.com/MetaMask/core/pull/6861))
34
+
35
+
### Fixed
36
+
37
+
- Fix potential race condition in `BackendWebSocketService.connect()` that could bypass exponential backoff when reconnect is already scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
38
+
- Fix memory leak from orphaned timers when multiple reconnects are scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
39
+
- Fix issue where reconnect attempts counter could grow unnecessarily with duplicate scheduled reconnects ([#6861](https://github.com/MetaMask/core/pull/6861))
0 commit comments