Skip to content

Commit 8974527

Browse files
authored
fix(core-backend): reconnection logic (#6861)
## Explanation ### Current State The WebSocket service experienced a critical issue when the backend server accepted connections but immediately disconnected afterward. I was only able to reproduce this behavior locally, not in prod endpoint. ### The Problem When the server accepts the WebSocket connection but then disconnects rapidly: - The client would immediately attempt to reconnect (fast reconnection) - This created a connection thrashing loop: connect → disconnect → reconnect → disconnect - Multiple code paths could trigger reconnection attempts independently - The `scheduleConnect` method could be called multiple times during rapid disconnect cycles - Multiple overlapping reconnection attempts could be scheduled simultaneously - All disconnected clients would reconnect at exactly the same time, creating a "thundering herd" problem - This led to unpredictable behavior, duplicate connection attempts, and worsened the reconnection storm ### The Solution This PR fixes the issue with three key improvements: #### 1. **Centralize Reconnection via `scheduleConnect`** All reconnection logic now flows through a single entry point: - All code paths that need to trigger reconnection now use `scheduleConnect` - Eliminates scattered reconnection logic across the codebase - Provides a single point of control for reconnection behavior - Makes the reconnection strategy consistent and predictable - Easier to maintain, test, and debug reconnection logic #### 2. **Idempotent `scheduleConnect`** Makes `scheduleConnect` idempotent to ensure multiple calls don't create duplicate or overlapping connection attempts: - Checks if a reconnection is already scheduled before creating a new one - Ensures only one reconnection timer is active at any given time - Eliminates race conditions from concurrent connection attempts - Maintains clear, single-threaded reconnection state #### 3. **Jitter on Reconnection** Adds randomized jitter to reconnection timing to prevent thundering herd problems: - Randomizes the reconnection delay to spread out reconnection attempts across clients - When the server disconnects many clients simultaneously, they won't all reconnect at the exact same moment - Reduces load spikes on the server during recovery from incidents - Improves overall system stability and reliability **Combined Effect:** With all three fixes, when the server accepts connections but disconnects immediately: - All reconnection attempts flow through a single, centralized method (consistency) - Only one reconnection attempt will be scheduled per client (idempotency) - Reconnection attempts are staggered across the client population (jitter) - The reconnection logic follows proper backoff strategy with randomization - No duplicate timers or overlapping connection attempts - Predictable, controlled reconnection behavior that respects server health - Server experiences gradual reconnection load rather than synchronized spikes ### Technical Details The fix centralizes all reconnection logic through `scheduleConnect`, makes it idempotent by guarding against duplicate timer creation, and adds jitter to the reconnection delay calculation to randomize timing across clients. This prevents both the individual reconnection storm (from rapid connect/disconnect cycles) and the collective thundering herd problem (from synchronized reconnections), while ensuring consistent behavior across the entire service. ## References - Related to `BackendWebSocketService` in `@metamask/core-backend` - Fixes # [ADD ISSUE NUMBER HERE] ## Checklist - [x] I've updated the test suite for new or updated code as appropriate - [x] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [x] I've communicated my changes to consumers by [updating changelogs for packages I've changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs), highlighting breaking changes as necessary - [ ] I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduces a controlled `forceReconnection()` flow, idempotent/jittered backoff scheduling, and stability timer to prevent reconnect thrashing; updates AccountActivityService to use it and expands tests. > > - **BackendWebSocketService** > - **New API**: `forceReconnection()` messenger action and method for controlled disconnect-then-reconnect. > - **Reconnection**: Idempotent scheduler using `ExponentialBackoff` with jitter; `connect()` no-ops if a reconnect is scheduled; stable-connection timer (10s) resets attempts/backoff. > - **Behavior changes**: `disconnect()` now synchronous and resets attempts; improved error handling and timer cleanup; avoids duplicate timers and races. > - **Actions/Types**: Expose `forceReconnection`; update method action types union. > - **AccountActivityService** > - Switch to `BackendWebSocketService:forceReconnection` for recovery; remove direct `disconnect`/`connect` sequence. > - Update allowed messenger actions accordingly. > - **Tests** > - Add/adjust tests for new reconnection semantics, stable timer, idempotent scheduling, and force reconnection flows; minor test cleanup tweaks. > - **Changelog** > - Document new API, reconnection improvements, fixes for race conditions/memory leaks, and breaking messenger action change for AccountActivityService. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 373a4f6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 3923250 commit 8974527

File tree

6 files changed

+455
-158
lines changed

6 files changed

+455
-158
lines changed

packages/core-backend/CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- Add `forceReconnection()` method to `BackendWebSocketService` for controlled subscription state cleanup ([#6861](https://github.com/MetaMask/core/pull/6861))
13+
- Performs a controlled disconnect-then-reconnect sequence with exponential backoff
14+
- Useful for recovering from subscription/unsubscription issues and cleaning up orphaned subscriptions
15+
- Add `BackendWebSocketService:forceReconnection` messenger action
16+
- Add stable connection timer to prevent rapid reconnection loops ([#6861](https://github.com/MetaMask/core/pull/6861))
17+
- Connection must stay stable for 10 seconds before resetting reconnect attempts
18+
- Prevents issues when server accepts connection then immediately closes it
19+
1020
### Changed
1121

1222
- Bump `@metamask/base-controller` from `^8.4.1` to `^8.4.2` ([#6917](https://github.com/MetaMask/core/pull/6917))
23+
- Update `AccountActivityService` to use new `forceReconnection()` method instead of manually calling disconnect/connect ([#6861](https://github.com/MetaMask/core/pull/6861))
24+
- **BREAKING:** Update allowed actions for `AccountActivityService` messenger: remove `BackendWebSocketService:disconnect`, add `BackendWebSocketService:forceConnect` ([#6861](https://github.com/MetaMask/core/pull/6861))
25+
- Improve reconnection scheduling in `BackendWebSocketService` to be idempotent ([#6861](https://github.com/MetaMask/core/pull/6861))
26+
- Prevents duplicate reconnection timers and inflated attempt counters
27+
- Scheduler checks if reconnect is already scheduled before creating new timer
28+
- Improve error handling in `BackendWebSocketService.connect()` ([#6861](https://github.com/MetaMask/core/pull/6861))
29+
- Always schedule reconnect on connection failure (exponential backoff prevents aggressive retries)
30+
- Remove redundant schedule calls from error paths
31+
- Update `BackendWebSocketService.disconnect()` to reset reconnect attempts counter ([#6861](https://github.com/MetaMask/core/pull/6861))
32+
- Update `BackendWebSocketService.disconnect()` return type from `Promise<void>` to `void` ([#6861](https://github.com/MetaMask/core/pull/6861))
33+
- Improve logging throughout `BackendWebSocketService` for better debugging ([#6861](https://github.com/MetaMask/core/pull/6861))
34+
35+
### Fixed
36+
37+
- Fix potential race condition in `BackendWebSocketService.connect()` that could bypass exponential backoff when reconnect is already scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
38+
- Fix memory leak from orphaned timers when multiple reconnects are scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
39+
- Fix issue where reconnect attempts counter could grow unnecessarily with duplicate scheduled reconnects ([#6861](https://github.com/MetaMask/core/pull/6861))
1340

1441
## [2.1.0]
1542

packages/core-backend/src/AccountActivityService.test.ts

Lines changed: 13 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ const getMessenger = () => {
7171
// Create mock action handlers
7272
const mockGetSelectedAccount = jest.fn();
7373
const mockConnect = jest.fn();
74-
const mockDisconnect = jest.fn();
74+
const mockForceReconnection = jest.fn();
7575
const mockSubscribe = jest.fn();
7676
const mockChannelHasSubscription = jest.fn();
7777
const mockGetSubscriptionsByChannel = jest.fn();
@@ -89,8 +89,8 @@ const getMessenger = () => {
8989
mockConnect,
9090
);
9191
rootMessenger.registerActionHandler(
92-
'BackendWebSocketService:disconnect',
93-
mockDisconnect,
92+
'BackendWebSocketService:forceReconnection',
93+
mockForceReconnection,
9494
);
9595
rootMessenger.registerActionHandler(
9696
'BackendWebSocketService:subscribe',
@@ -123,7 +123,7 @@ const getMessenger = () => {
123123
mocks: {
124124
getSelectedAccount: mockGetSelectedAccount,
125125
connect: mockConnect,
126-
disconnect: mockDisconnect,
126+
forceReconnection: mockForceReconnection,
127127
subscribe: mockSubscribe,
128128
channelHasSubscription: mockChannelHasSubscription,
129129
getSubscriptionsByChannel: mockGetSubscriptionsByChannel,
@@ -222,7 +222,7 @@ type WithServiceCallback<ReturnValue> = (payload: {
222222
mocks: {
223223
getSelectedAccount: jest.Mock;
224224
connect: jest.Mock;
225-
disconnect: jest.Mock;
225+
forceReconnection: jest.Mock;
226226
subscribe: jest.Mock;
227227
channelHasSubscription: jest.Mock;
228228
getSubscriptionsByChannel: jest.Mock;
@@ -464,28 +464,22 @@ describe('AccountActivityService', () => {
464464
);
465465
});
466466

467-
it('should handle disconnect failures during force reconnection by logging error and continuing gracefully', async () => {
467+
it('should handle subscription failure by calling forceReconnection', async () => {
468468
await withService(async ({ service, mocks }) => {
469-
// Mock disconnect to fail - this prevents the reconnect step from executing
470-
mocks.disconnect.mockRejectedValue(
471-
new Error('Disconnect failed during force reconnection'),
472-
);
473-
474-
// Trigger scenario that causes force reconnection by making subscribe fail
469+
// Mock subscribe to fail
475470
mocks.subscribe.mockRejectedValue(new Error('Subscription failed'));
476471

477-
// Should handle both subscription failure and disconnect failure gracefully - should not throw
472+
// Should handle subscription failure gracefully - should not throw
478473
const result = await service.subscribe({ address: '0x123abc' });
479474
expect(result).toBeUndefined();
480475

481476
// Verify the subscription was attempted
482477
expect(mocks.subscribe).toHaveBeenCalledTimes(1);
483478

484-
// Verify disconnect was attempted (but failed, preventing reconnection)
485-
expect(mocks.disconnect).toHaveBeenCalledTimes(1);
479+
// Verify forceReconnection was called (lines 289-290)
480+
expect(mocks.forceReconnection).toHaveBeenCalledTimes(1);
486481

487-
// Connect is only called once at the start because disconnect failed,
488-
// so the reconnect step never executes (it's in the same try-catch block)
482+
// Connect is only called once at the start
489483
expect(mocks.connect).toHaveBeenCalledTimes(1);
490484
});
491485
});
@@ -536,14 +530,8 @@ describe('AccountActivityService', () => {
536530
// unsubscribe catches errors and forces reconnection instead of throwing
537531
await service.unsubscribe(mockSubscription);
538532

539-
// Should have attempted to force reconnection with exact sequence
540-
expect(mocks.disconnect).toHaveBeenCalledTimes(1);
541-
expect(mocks.connect).toHaveBeenCalledTimes(1);
542-
543-
// Verify disconnect was called before connect
544-
const disconnectOrder = mocks.disconnect.mock.invocationCallOrder[0];
545-
const connectOrder = mocks.connect.mock.invocationCallOrder[0];
546-
expect(disconnectOrder).toBeLessThan(connectOrder);
533+
// Should have attempted to force reconnection
534+
expect(mocks.forceReconnection).toHaveBeenCalledTimes(1);
547535
},
548536
);
549537
});

packages/core-backend/src/AccountActivityService.ts

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ export type AccountActivityServiceActions = AccountActivityServiceMethodActions;
8080
export const ACCOUNT_ACTIVITY_SERVICE_ALLOWED_ACTIONS = [
8181
'AccountsController:getSelectedAccount',
8282
'BackendWebSocketService:connect',
83-
'BackendWebSocketService:disconnect',
83+
'BackendWebSocketService:forceReconnection',
8484
'BackendWebSocketService:subscribe',
8585
'BackendWebSocketService:getConnectionInfo',
8686
'BackendWebSocketService:channelHasSubscription',
@@ -559,16 +559,11 @@ export class AccountActivityService {
559559
* Force WebSocket reconnection to clean up subscription state
560560
*/
561561
async #forceReconnection(): Promise<void> {
562-
try {
563-
log('Forcing WebSocket reconnection to clean up subscription state');
564-
565-
// All subscriptions will be cleaned up automatically on WebSocket disconnect
562+
log('Forcing WebSocket reconnection to clean up subscription state');
566563

567-
await this.#messenger.call('BackendWebSocketService:disconnect');
568-
await this.#messenger.call('BackendWebSocketService:connect');
569-
} catch (error) {
570-
log('Failed to force WebSocket reconnection', { error });
571-
}
564+
// Use the dedicated forceReconnection method which performs a controlled
565+
// disconnect-then-connect sequence to clean up subscription state
566+
await this.#messenger.call('BackendWebSocketService:forceReconnection');
572567
}
573568

574569
// =============================================================================

packages/core-backend/src/BackendWebSocketService-method-action-types.ts

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,27 @@ export type BackendWebSocketServiceDisconnectAction = {
2525
handler: BackendWebSocketService['disconnect'];
2626
};
2727

28+
/**
29+
* Forces a WebSocket reconnection to clean up subscription state
30+
*
31+
* This method is useful when subscription state may be out of sync and needs to be reset.
32+
* It performs a controlled disconnect-then-reconnect sequence:
33+
* - Disconnects cleanly to trigger subscription cleanup
34+
* - Schedules reconnection with exponential backoff to prevent rapid loops
35+
* - All subscriptions will be cleaned up automatically on disconnect
36+
*
37+
* Use cases:
38+
* - Recovering from subscription/unsubscription issues
39+
* - Cleaning up orphaned subscriptions
40+
* - Forcing a fresh subscription state
41+
*
42+
* @returns Promise that resolves when disconnection is complete (reconnection is scheduled)
43+
*/
44+
export type BackendWebSocketServiceForceReconnectionAction = {
45+
type: `BackendWebSocketService:forceReconnection`;
46+
handler: BackendWebSocketService['forceReconnection'];
47+
};
48+
2849
/**
2950
* Sends a message through the WebSocket
3051
*
@@ -159,6 +180,7 @@ export type BackendWebSocketServiceSubscribeAction = {
159180
export type BackendWebSocketServiceMethodActions =
160181
| BackendWebSocketServiceConnectAction
161182
| BackendWebSocketServiceDisconnectAction
183+
| BackendWebSocketServiceForceReconnectionAction
162184
| BackendWebSocketServiceSendMessageAction
163185
| BackendWebSocketServiceSendRequestAction
164186
| BackendWebSocketServiceGetConnectionInfoAction

0 commit comments

Comments
 (0)