fix(core-backend): reconnection logic (#6861)

Kriys94 · web-flow · commit 89745278be4d · 2025-10-22T17:23:17.000+02:00
## Explanation ### Current State The WebSocket service experienced a critical issue when the backend server accepted connections but immediately disconnected afterward. I was only able to reproduce this behavior locally, not in prod endpoint. ### The Problem When the server accepts the WebSocket connection but then disconnects rapidly: - The client would immediately attempt to reconnect (fast reconnection) - This created a connection thrashing loop: connect → disconnect → reconnect → disconnect - Multiple code paths could trigger reconnection attempts independently - The `scheduleConnect` method could be called multiple times during rapid disconnect cycles - Multiple overlapping reconnection attempts could be scheduled simultaneously - All disconnected clients would reconnect at exactly the same time, creating a "thundering herd" problem - This led to unpredictable behavior, duplicate connection attempts, and worsened the reconnection storm ### The Solution This PR fixes the issue with three key improvements: #### 1. **Centralize Reconnection via `scheduleConnect`** All reconnection logic now flows through a single entry point: - All code paths that need to trigger reconnection now use `scheduleConnect` - Eliminates scattered reconnection logic across the codebase - Provides a single point of control for reconnection behavior - Makes the reconnection strategy consistent and predictable - Easier to maintain, test, and debug reconnection logic #### 2. **Idempotent `scheduleConnect`** Makes `scheduleConnect` idempotent to ensure multiple calls don't create duplicate or overlapping connection attempts: - Checks if a reconnection is already scheduled before creating a new one - Ensures only one reconnection timer is active at any given time - Eliminates race conditions from concurrent connection attempts - Maintains clear, single-threaded reconnection state #### 3. **Jitter on Reconnection** Adds randomized jitter to reconnection timing to prevent thundering herd problems: - Randomizes the reconnection delay to spread out reconnection attempts across clients - When the server disconnects many clients simultaneously, they won't all reconnect at the exact same moment - Reduces load spikes on the server during recovery from incidents - Improves overall system stability and reliability **Combined Effect:** With all three fixes, when the server accepts connections but disconnects immediately: - All reconnection attempts flow through a single, centralized method (consistency) - Only one reconnection attempt will be scheduled per client (idempotency) - Reconnection attempts are staggered across the client population (jitter) - The reconnection logic follows proper backoff strategy with randomization - No duplicate timers or overlapping connection attempts - Predictable, controlled reconnection behavior that respects server health - Server experiences gradual reconnection load rather than synchronized spikes ### Technical Details The fix centralizes all reconnection logic through `scheduleConnect`, makes it idempotent by guarding against duplicate timer creation, and adds jitter to the reconnection delay calculation to randomize timing across clients. This prevents both the individual reconnection storm (from rapid connect/disconnect cycles) and the collective thundering herd problem (from synchronized reconnections), while ensuring consistent behavior across the entire service. ## References - Related to `BackendWebSocketService` in `@metamask/core-backend` - Fixes # [ADD ISSUE NUMBER HERE] ## Checklist - [x] I've updated the test suite for new or updated code as appropriate - [x] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [x] I've communicated my changes to consumers by [updating changelogs for packages I've changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs), highlighting breaking changes as necessary - [ ] I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes  --- > [!NOTE] > Introduces a controlled `forceReconnection()` flow, idempotent/jittered backoff scheduling, and stability timer to prevent reconnect thrashing; updates AccountActivityService to use it and expands tests. > > - **BackendWebSocketService** > - **New API**: `forceReconnection()` messenger action and method for controlled disconnect-then-reconnect. > - **Reconnection**: Idempotent scheduler using `ExponentialBackoff` with jitter; `connect()` no-ops if a reconnect is scheduled; stable-connection timer (10s) resets attempts/backoff. > - **Behavior changes**: `disconnect()` now synchronous and resets attempts; improved error handling and timer cleanup; avoids duplicate timers and races. > - **Actions/Types**: Expose `forceReconnection`; update method action types union. > - **AccountActivityService** > - Switch to `BackendWebSocketService:forceReconnection` for recovery; remove direct `disconnect`/`connect` sequence. > - Update allowed messenger actions accordingly. > - **Tests** > - Add/adjust tests for new reconnection semantics, stable timer, idempotent scheduling, and force reconnection flows; minor test cleanup tweaks. > - **Changelog** > - Document new API, reconnection improvements, fixes for race conditions/memory leaks, and breaking messenger action change for AccountActivityService. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 373a4f6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>
diff --git a/packages/core-backend/CHANGELOG.md b/packages/core-backend/CHANGELOG.md
@@ -7,9 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- Add `forceReconnection()` method to `BackendWebSocketService` for controlled subscription state cleanup ([#6861](https://github.com/MetaMask/core/pull/6861))
+  - Performs a controlled disconnect-then-reconnect sequence with exponential backoff
+  - Useful for recovering from subscription/unsubscription issues and cleaning up orphaned subscriptions
+  - Add `BackendWebSocketService:forceReconnection` messenger action
+- Add stable connection timer to prevent rapid reconnection loops ([#6861](https://github.com/MetaMask/core/pull/6861))
+  - Connection must stay stable for 10 seconds before resetting reconnect attempts
+  - Prevents issues when server accepts connection then immediately closes it
+
 ### Changed
 
 - Bump `@metamask/base-controller` from `^8.4.1` to `^8.4.2` ([#6917](https://github.com/MetaMask/core/pull/6917))
+- Update `AccountActivityService` to use new `forceReconnection()` method instead of manually calling disconnect/connect ([#6861](https://github.com/MetaMask/core/pull/6861))
+- **BREAKING:** Update allowed actions for `AccountActivityService` messenger: remove `BackendWebSocketService:disconnect`, add `BackendWebSocketService:forceConnect` ([#6861](https://github.com/MetaMask/core/pull/6861))
+- Improve reconnection scheduling in `BackendWebSocketService` to be idempotent ([#6861](https://github.com/MetaMask/core/pull/6861))
+  - Prevents duplicate reconnection timers and inflated attempt counters
+  - Scheduler checks if reconnect is already scheduled before creating new timer
+- Improve error handling in `BackendWebSocketService.connect()` ([#6861](https://github.com/MetaMask/core/pull/6861))
+  - Always schedule reconnect on connection failure (exponential backoff prevents aggressive retries)
+  - Remove redundant schedule calls from error paths
+- Update `BackendWebSocketService.disconnect()` to reset reconnect attempts counter ([#6861](https://github.com/MetaMask/core/pull/6861))
+- Update `BackendWebSocketService.disconnect()` return type from `Promise<void>` to `void` ([#6861](https://github.com/MetaMask/core/pull/6861))
+- Improve logging throughout `BackendWebSocketService` for better debugging ([#6861](https://github.com/MetaMask/core/pull/6861))
+
+### Fixed
+
+- Fix potential race condition in `BackendWebSocketService.connect()` that could bypass exponential backoff when reconnect is already scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
+- Fix memory leak from orphaned timers when multiple reconnects are scheduled ([#6861](https://github.com/MetaMask/core/pull/6861))
+- Fix issue where reconnect attempts counter could grow unnecessarily with duplicate scheduled reconnects ([#6861](https://github.com/MetaMask/core/pull/6861))
 
 ## [2.1.0]
 
diff --git a/packages/core-backend/src/AccountActivityService.test.ts b/packages/core-backend/src/AccountActivityService.test.ts
@@ -71,7 +71,7 @@ const getMessenger = () => {
   // Create mock action handlers
   const mockGetSelectedAccount = jest.fn();
   const mockConnect = jest.fn();
-  const mockDisconnect = jest.fn();
+  const mockForceReconnection = jest.fn();
   const mockSubscribe = jest.fn();
   const mockChannelHasSubscription = jest.fn();
   const mockGetSubscriptionsByChannel = jest.fn();
@@ -89,8 +89,8 @@ const getMessenger = () => {
     mockConnect,
   );
   rootMessenger.registerActionHandler(
-    'BackendWebSocketService:disconnect',
-    mockDisconnect,
+    'BackendWebSocketService:forceReconnection',
+    mockForceReconnection,
   );
   rootMessenger.registerActionHandler(
     'BackendWebSocketService:subscribe',
@@ -123,7 +123,7 @@ const getMessenger = () => {
     mocks: {
       getSelectedAccount: mockGetSelectedAccount,
       connect: mockConnect,
-      disconnect: mockDisconnect,
+      forceReconnection: mockForceReconnection,
       subscribe: mockSubscribe,
       channelHasSubscription: mockChannelHasSubscription,
       getSubscriptionsByChannel: mockGetSubscriptionsByChannel,
@@ -222,7 +222,7 @@ type WithServiceCallback<ReturnValue> = (payload: {
   mocks: {
     getSelectedAccount: jest.Mock;
     connect: jest.Mock;
-    disconnect: jest.Mock;
+    forceReconnection: jest.Mock;
     subscribe: jest.Mock;
     channelHasSubscription: jest.Mock;
     getSubscriptionsByChannel: jest.Mock;
@@ -464,28 +464,22 @@ describe('AccountActivityService', () => {
       );
     });
 
-    it('should handle disconnect failures during force reconnection by logging error and continuing gracefully', async () => {
+    it('should handle subscription failure by calling forceReconnection', async () => {
       await withService(async ({ service, mocks }) => {
-        // Mock disconnect to fail - this prevents the reconnect step from executing
-        mocks.disconnect.mockRejectedValue(
-          new Error('Disconnect failed during force reconnection'),
-        );
-
-        // Trigger scenario that causes force reconnection by making subscribe fail
+        // Mock subscribe to fail
         mocks.subscribe.mockRejectedValue(new Error('Subscription failed'));
 
-        // Should handle both subscription failure and disconnect failure gracefully - should not throw
+        // Should handle subscription failure gracefully - should not throw
         const result = await service.subscribe({ address: '0x123abc' });
         expect(result).toBeUndefined();
 
         // Verify the subscription was attempted
         expect(mocks.subscribe).toHaveBeenCalledTimes(1);
 
-        // Verify disconnect was attempted (but failed, preventing reconnection)
-        expect(mocks.disconnect).toHaveBeenCalledTimes(1);
+        // Verify forceReconnection was called (lines 289-290)
+        expect(mocks.forceReconnection).toHaveBeenCalledTimes(1);
 
-        // Connect is only called once at the start because disconnect failed,
-        // so the reconnect step never executes (it's in the same try-catch block)
+        // Connect is only called once at the start
         expect(mocks.connect).toHaveBeenCalledTimes(1);
       });
     });
@@ -536,14 +530,8 @@ describe('AccountActivityService', () => {
           // unsubscribe catches errors and forces reconnection instead of throwing
           await service.unsubscribe(mockSubscription);
 
-          // Should have attempted to force reconnection with exact sequence
-          expect(mocks.disconnect).toHaveBeenCalledTimes(1);
-          expect(mocks.connect).toHaveBeenCalledTimes(1);
-
-          // Verify disconnect was called before connect
-          const disconnectOrder = mocks.disconnect.mock.invocationCallOrder[0];
-          const connectOrder = mocks.connect.mock.invocationCallOrder[0];
-          expect(disconnectOrder).toBeLessThan(connectOrder);
+          // Should have attempted to force reconnection
+          expect(mocks.forceReconnection).toHaveBeenCalledTimes(1);
         },
       );
     });
diff --git a/packages/core-backend/src/AccountActivityService.ts b/packages/core-backend/src/AccountActivityService.ts
@@ -80,7 +80,7 @@ export type AccountActivityServiceActions = AccountActivityServiceMethodActions;
 export const ACCOUNT_ACTIVITY_SERVICE_ALLOWED_ACTIONS = [
   'AccountsController:getSelectedAccount',
   'BackendWebSocketService:connect',
-  'BackendWebSocketService:disconnect',
+  'BackendWebSocketService:forceReconnection',
   'BackendWebSocketService:subscribe',
   'BackendWebSocketService:getConnectionInfo',
   'BackendWebSocketService:channelHasSubscription',
@@ -559,16 +559,11 @@ export class AccountActivityService {
    * Force WebSocket reconnection to clean up subscription state
    */
   async #forceReconnection(): Promise<void> {
-    try {
-      log('Forcing WebSocket reconnection to clean up subscription state');
-
-      // All subscriptions will be cleaned up automatically on WebSocket disconnect
+    log('Forcing WebSocket reconnection to clean up subscription state');
 
-      await this.#messenger.call('BackendWebSocketService:disconnect');
-      await this.#messenger.call('BackendWebSocketService:connect');
-    } catch (error) {
-      log('Failed to force WebSocket reconnection', { error });
-    }
+    // Use the dedicated forceReconnection method which performs a controlled
+    // disconnect-then-connect sequence to clean up subscription state
+    await this.#messenger.call('BackendWebSocketService:forceReconnection');
   }
 
   // =============================================================================
diff --git a/packages/core-backend/src/BackendWebSocketService-method-action-types.ts b/packages/core-backend/src/BackendWebSocketService-method-action-types.ts
@@ -25,6 +25,27 @@ export type BackendWebSocketServiceDisconnectAction = {
   handler: BackendWebSocketService['disconnect'];
 };
 
+/**
+ * Forces a WebSocket reconnection to clean up subscription state
+ *
+ * This method is useful when subscription state may be out of sync and needs to be reset.
+ * It performs a controlled disconnect-then-reconnect sequence:
+ * - Disconnects cleanly to trigger subscription cleanup
+ * - Schedules reconnection with exponential backoff to prevent rapid loops
+ * - All subscriptions will be cleaned up automatically on disconnect
+ *
+ * Use cases:
+ * - Recovering from subscription/unsubscription issues
+ * - Cleaning up orphaned subscriptions
+ * - Forcing a fresh subscription state
+ *
+ * @returns Promise that resolves when disconnection is complete (reconnection is scheduled)
+ */
+export type BackendWebSocketServiceForceReconnectionAction = {
+  type: `BackendWebSocketService:forceReconnection`;
+  handler: BackendWebSocketService['forceReconnection'];
+};
+
 /**
  * Sends a message through the WebSocket
  *
@@ -159,6 +180,7 @@ export type BackendWebSocketServiceSubscribeAction = {
 export type BackendWebSocketServiceMethodActions =
   | BackendWebSocketServiceConnectAction
   | BackendWebSocketServiceDisconnectAction
+  | BackendWebSocketServiceForceReconnectionAction
   | BackendWebSocketServiceSendMessageAction
   | BackendWebSocketServiceSendRequestAction
   | BackendWebSocketServiceGetConnectionInfoAction
diff --git a/packages/core-backend/src/BackendWebSocketService.test.ts b/packages/core-backend/src/BackendWebSocketService.test.ts
diff --git a/packages/core-backend/src/BackendWebSocketService.ts b/packages/core-backend/src/BackendWebSocketService.ts