Revamp NetworkController RPC endpoint events #7166

mcmire · 2025-11-15T00:10:14Z

Explanation

In a future commit we will introduce changes to network-controller so that it will keep track of the status of each network as requests are made. This commit paves the way for this to happen by redefining the existing RPC endpoint-related events that NetworkController produces.

Currently, when requests are made through the network clients that NetworkController exposes, three events are published:

NetworkController:rpcEndpointDegraded
- Published when enough successive retriable errors are encountered while making a request to an RPC endpoint that the maximum number of retries is reached.
NetworkController:rpcEndpointUnavailable
- Published when enough successive errors are encountered while making a request to an RPC endpoint that the underlying circuit breaks.
NetworkController:rpcEndpointRequestRetried
- Published when a request is retried (mainly used for testing).

It's important to note that in the context of the RPC failover feature, an "RPC endpoint" can actually encompass multiple URLs, so the above events actually fire for any URL.

While these events are useful for reporting metrics on RPC endpoints, in order to effectively be able to update the status of a network, we need events that are less granular and are guaranteed not to fire multiple times in a row. We also need a new event.

Now the list of events looks like this:

NetworkController:rpcEndpointDegraded
- The same as before.
NetworkController:rpcEndpointUnavailable
- The same as before.
NetworkController:rpcEndpointRetried
- Renamed from NetworkController:rpcEndpointRequestRetried.
NetworkController:rpcEndpointChainDegraded
- Similar to NetworkController:rpcEndpointDegraded, but won't be published again if the RPC endpoint is already in a degraded state.
NetworkController:rpcEndpointChainUnavailable
- Published when all of the circuits underlying all of the URLs for an RPC endpoint have broken (none of the URLs are available). Won't be published again if the RPC endpoint is already in an unavailable state.
NetworkController:rpcEndpointChainAvailable
- A new event. Published the first time a successful request is made to one of the URLs for an RPC endpoint, or following a degraded or unavailable status.

Going a bit deeper, in order to make the changes above, it was necessary to rewrite the core logic responsible for diverting traffic to failovers from RpcService to RpcServiceChain, which was a more natural fit, anyway. This also meant that we could simplify RpcService, as well as its tests.

References

Progresses https://consensyssoftware.atlassian.net/browse/WPC-99.

Checklist

I've updated the test suite for new or updated code as appropriate
I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
I've communicated my changes to consumers by updating changelogs for packages I've changed, highlighting breaking changes as necessary
I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes
- This will come in a later PR.

Note

Introduce chain-level RPC endpoint events and a new RpcServiceChain, rename/update retry and payloads, and refactor failover logic with extensive test updates.

Events & Payloads (BREAKING):
- Add NetworkController:rpcEndpointChainAvailable and chain-level events rpcEndpointChainDegraded/rpcEndpointChainUnavailable with non-repeating semantics in src/NetworkController.ts.
- Rename NetworkController:rpcEndpointRequestRetried to NetworkController:rpcEndpointRetried.
- Update payloads: add networkClientId, add primaryEndpointUrl to per-endpoint events; chain-level events omit primaryEndpointUrl.
Failover Architecture:
- Introduce RpcServiceChain (src/rpc-service/rpc-service-chain.ts) to manage primary/failover endpoints, circuit states, and emit chain/service events.
- Simplify RpcService (src/rpc-service/rpc-service.ts): remove embedded failover, add onAvailable, resetPolicy, getCircuitState, improved error handling/logging.
- Wire into client creation: create-network-client.ts builds RpcServiceChain and publishes updated events; create-auto-managed-network-client.ts now passes networkClientId to createNetworkClient.
Exports & Types:
- Update exports in src/index.ts; adjust RpcServiceRequestable event listener types and remove { isolated: true } from onBreak data.
Tests:
- Add comprehensive tests for new chain/service events and behaviors; update helpers/mocks to support dynamic responses.
Changelog & Deps:
- Document breaking changes and new events in CHANGELOG.md.
- Add cockatiel to devDependencies.

^{Written by Cursor Bugbot for commit 916a0e2. This will update automatically on new commits. Configure here.}

In a future commit we will introduce changes to `network-controller` so that it will keep track of the status of each network as requests are made. These updates to `createServicePolicy` assist with that. See the changelog for a list of changes to the `ServicePolicy` API. Besides the changes listed there, the tests for `createServicePolicy` have been refactored slightly so that it is easier to maintain in the future.

In a future commit we will introduce changes to `network-controller` so that it will keep track of the status of each network as requests are made. This commit paves the way for this to happen by redefining the existing RPC endpoint-related events that NetworkController produces. Currently, when requests are made through the network clients that NetworkController exposes, three events are published: - `NetworkController:rpcEndpointDegraded` - Published when enough successive retriable errors are encountered while making a request to an RPC endpoint that the maximum number of retries is reached. - `NetworkController:rpcEndpointUnavailable` - Published when enough successive errors are encountered while making a request to an RPC endpoint that the underlying circuit breaks. - `NetworkController:rpcEndpointRequestRetried` - Published when a request is retried (mainly used for testing). It's important to note that in the context of the RPC failover feature, an "RPC endpoint" can actually encompass multiple URLs, so the above events actually fire for any URL. While these events are useful for reporting metrics on RPC endpoints, in order to effectively be able to update the status of a network, we need events that are less granular and are guaranteed not to fire multiple times in a row. We also need a new event. Now the list of events looks like this: - `NetworkController:rpcEndpointInstanceDegraded` - The same as `NetworkController:rpcEndpointDegraded` before. - `NetworkController:rpcEndpointInstanceUnavailable` - The same as `NetworkController:rpcEndpointInstanceDegraded` before. - `NetworkController:rpcEndpointInstanceRetried` - Renamed from `NetworkController:rpcEndpointRequestRetried`. - `NetworkController:rpcEndpointDegraded` - Similar to `NetworkController:rpcEndpointInstanceDegraded`, but won't be published again if the RPC endpoint is already in a degraded state. - `NetworkController:rpcEndpointUnavailable` - Published when all of the circuits underlying all of the URLs for an RPC endpoint have broken (none of the URLs are available). Won't be published again if the RPC endpoint is already in an unavailable state. - `NetworkController:rpcEndpointAvailable` - A new event. Published the first time a successful request is made to one of the URLs for an RPC endpoint, or following a degraded or unavailable status.

mcmire · 2025-11-17T19:17:32Z

packages/network-controller/src/rpc-service/rpc-service-chain.ts

  ): Promise<JsonRpcResponse<Result | null>> {
-    return this.#services[0].request(jsonRpcRequest, fetchOptions);
-  }
+    // Start with the primary (first) service and switch to failovers as the


Prior to these changes, each RpcService object could have an optional failoverService property. This class, RpcServiceChain, would then build a chain (really, a linked list) of services. In order to make a request, RpcServiceChain would call request on the first service in the chain, and RpcService would decide whether it needed to call the next service in the chain, etc.

While this model is easy to understand, I needed access to certain data along the way, and it seemed easier to use a loop rather than a linked list. Anyway, I figured it really should be the responsibility of RpcServiceChain to manage how requests are sent across the chain.

mcmire · 2025-11-17T19:18:23Z

packages/network-controller/src/rpc-service/rpc-service-requestable.ts

  onRetry(
-    listener: AddToCockatielEventData<
-      Parameters<ServicePolicy['onRetry']>[0],
+    listener: CockatielEventToEventListenerWithData<


This should be doing the same thing as the previous code. I just thought that AddToCockatielEventData<Parameters<...>[0] was a bit ugly (and I added some more self-descriptive utility types, so this is one of them).

mcmire · 2025-11-17T19:21:24Z

packages/network-controller/src/create-auto-managed-network-client.ts

 export function createAutoManagedNetworkClient<
  Configuration extends NetworkClientConfiguration,
 >({
+  networkClientId,


We now expose the network client ID in the rpcEndpoint* messenger events, so we need to receive it and pass it down to createNetworkClient.

mcmire · 2025-11-17T19:25:20Z

packages/network-controller/tests/network-client/rpc-failover.ts

      });
    });

-    it('publishes the NetworkController:rpcEndpointUnavailable event when the failover occurs', async () => {


These tests got moved to src/create-network-client-tests/rpc-endpoint-events.test.ts as I realized they didn't belong here. It's true that this test file is concerned with the RPC failover feature (which is required to test the rpcEndpoint* events), but all of the tests in tests/network-client directory really exercise the middleware stack that createNetworkClient builds and in so doing loop over all of the RPC methods that our internal provider handles specially. We don't need to go to all that trouble to test the rpcEndpoint* events, we can just use an arbitrary RPC method. (I think eventually I will rename tests/network-client to tests/internal-provider-api or something like that, but that's a PR for another time.)

mcmire · 2025-11-17T19:30:00Z

packages/network-controller/src/rpc-service/shared.ts

+/**
+ * Obtains the event data type from a Cockatiel event or event listener type.
+ */
+export type ExtractCockatielEventData<CockatielEventOrEventListener> =


The Cockatiel types are a bit awkward to work with (especially since event listeners whose event payloads are empty are typed as Event<void> which is rather strange). These utilities just make them a bit easier to work with.

mcmire · 2025-11-17T19:33:34Z

packages/network-controller/CHANGELOG.md

  - This ought to be unobservable, but we mark it as breaking out of an abundance of caution.
+- **BREAKING:** Split up and update payload data for `NetworkController:rpcEndpoint{Degraded,Unavailable}` ([#7166](https://github.com/MetaMask/core/pull/7166))
+  - The existing events are now called `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` and retain their present behavior.
+  - `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` do still exist, but they are now designed to represent the entire RPC endpoint and are guaranteed to not be published multiple times in a row. In particular, `NetworkController:rpcEndpointUnavailable` is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.


In particular, NetworkController:rpcEndpointUnavailable is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.

This change is I suppose a bit controversial, and the first version of this was slightly different before I landed on this approach. But I think it makes sense? Basically, if we can reach the network somehow, don't broadcast that it's unavailable until we're absolutely sure. That does mean that the NetworkController:rpcEndpointUnavailable may never be published for Infura networks if the failover does its job, but I think that's precisely the intent.

packages/network-controller/CHANGELOG.md

…oller-rpc-endpoint-events

mcmire · 2025-11-17T21:26:13Z

~~Adding no-changelog because there is a minor change to createServicePolicy which is non-user-facing.~~ I've cherry-picked the createServicePolicy to the other PR, so this no longer applies.

…oller-rpc-endpoint-events

## Explanation  In a future commit we will introduce changes to `network-controller` so that it will keep track of the status of each network as requests are made. These updates to `createServicePolicy` assist with that. See the changelog for more. Besides this, the tests for `createServicePolicy` have been refactored slightly so that they are easier to maintain in the future. ## References  Progresses https://consensyssoftware.atlassian.net/browse/WPC-99. You can see how these changes will be used in the next PR: #7166 ## Checklist - [x] I've updated the test suite for new or updated code as appropriate - [x] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [x] I've communicated my changes to consumers by [updating changelogs for packages I've changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs), highlighting breaking changes as necessary - [x] I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes  --- > [!NOTE] > Adds `getCircuitState`, `onAvailable`, and `reset` to `ServicePolicy`, exports Cockatiel types, and updates logic/tests to support availability tracking and circuit state introspection. > > - **controller-utils**: > - **ServicePolicy API**: > - Add `getCircuitState()` to expose underlying circuit state. > - Add `onAvailable` event for first success and post-recovery success. > - Add `reset()` to close the circuit and reset breaker counters. > - **Behavior/Internals**: > - Track availability status and emit `onAvailable`/`onDegraded` appropriately. > - Update `onBreak` to mark unavailable; wire `ConsecutiveBreaker` for reset. > - **Exports**: > - Export `CockatielEventEmitter` and `CockatielFailureReason`; re-export via `index`. > - **Tests**: > - Expand/refactor tests to cover `onAvailable`, `getCircuitState`, `reset`, and timing cases; update export snapshot. > - **Docs**: > - Update `CHANGELOG.md` with new methods and exports. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e597d0b. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

Add comprehensive documentation for the getError function that extracts errors from Cockatiel's FailureReason type in circuit breaker event handlers. Documents both possible shapes of the FailureReason object.

Remove primaryEndpointUrl field from NetworkController chain-level events: - NetworkController:rpcEndpointChainUnavailable - NetworkController:rpcEndpointChainDegraded - NetworkController:rpcEndpointChainAvailable Chain-level events are designed to represent the overall status of an endpoint chain, not individual endpoints. The primaryEndpointUrl field is redundant since consumers can derive endpoint information from the networkClientId using getNetworkClientById() or getNetworkConfigurationByNetworkClientId(). Individual endpoint events (rpcEndpointUnavailable, rpcEndpointDegraded, rpcEndpointRetried) retain the primaryEndpointUrl field, as it's useful for comparing endpointUrl to primaryEndpointUrl to determine whether the affected endpoint is a primary or a failover. Updated event type definitions, event publishing logic, test assertions, and changelog to reflect these changes.

packages/network-controller/src/create-network-client.ts

packages/network-controller/CHANGELOG.md

packages/network-controller/src/rpc-service/rpc-service-chain.ts

Add the same undefined check that exists in onBreak to ensure type safety and prevent publishing events with undefined error values.

Use CockatielFailureReason type instead of generic object type for better type safety and clarity.

Capture the chain status before calling service.request() to prevent spurious onBreak emissions. The onDegraded handler can fire synchronously during service.request() and change the status from Unavailable to Degraded before the catch block checks it, causing incorrect onBreak events when recovery attempts fail.

packages/network-controller/src/rpc-service/rpc-service-chain.test.ts

Revert the previous fix that captured previousStatus before the request. Checking the current status (this.#status) is correct because it accounts for status changes that may occur during the request from other services in the chain. The original check prevents duplicate onBreak emissions when the chain is already Unavailable.

packages/network-controller/src/rpc-service/rpc-service-chain.test.ts

packages/network-controller/src/NetworkController.ts

The test 'calls onAvailable when a service becomes degraded by responding slowly, and then recovers' was not actually simulating a slow response, so it was only testing initial availability, not recovery from degraded state. Changes: - Add clock.tick(DEFAULT_DEGRADED_THRESHOLD + 1) to first mock to simulate slow response - Add onDegraded listener to verify degradation actually occurred - Add assertions to verify both onDegraded and onAvailable are called - Add assertion to verify call order (degradation before recovery)

packages/network-controller/src/rpc-service/rpc-service-chain.ts

packages/network-controller/src/create-network-client.ts

packages/network-controller/src/rpc-service/rpc-service-chain.ts

…vents - Remove primaryEndpointUrl from event type definitions for onBreak, onDegraded, and onAvailable - Remove primaryEndpointUrl from event emissions in RpcServiceChain - Update event listener type signatures to not include primaryEndpointUrl - Update all test expectations to remove primaryEndpointUrl from assertions - Update create-network-client.ts to remove primaryEndpointUrl from event handlers - Note: onService* methods still include primaryEndpointUrl as they were not changed

packages/network-controller/src/rpc-service/rpc-service-chain.test.ts

- Remove endpointUrl from onBreak, onDegraded, and onAvailable events in RpcServiceChain - Update type definitions to exclude endpointUrl using ExcludeCockatielEventData - Update event emissions to exclude endpointUrl from chain-level events - Update NetworkController event types to remove endpointUrl from chain-level events (rpcEndpointChainDegraded, rpcEndpointChainAvailable, rpcEndpointChainUnavailable) - Update event handlers in create-network-client.ts to not destructure endpointUrl - Update all test assertions to remove endpointUrl from chain-level event expectations - Remove unused rpcUrl parameters from test functions - Align all chain-level events to not include endpointUrl (consistent with unavailable event)

- Change tertiaryEndpointUrl from 'https://second.endpoint' to 'https://third.endpoint'

Gudahtt

LGTM!

mcmire added 6 commits November 14, 2025 14:45

Fix tests

6a3cff1

Add more tests

c08f398

No need for getLastInnerFailureReason

5e0e3e1

Fix an issue with onAvailable

e2eba7a

Reduce the diff

246b2b5

mcmire mentioned this pull request Nov 17, 2025

Extend createServicePolicy to support live network status #7164

Merged

4 tasks

mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2f9688e to 2c7678c Compare November 17, 2025 16:55

Fix tests

199bb79

mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2c7678c to 4c58933 Compare November 17, 2025 17:55

mcmire added 3 commits November 17, 2025 11:58

Use a quasi-enum for the availability status

ff6d832

Fix test

fa66813

mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 4c58933 to 9d090e9 Compare November 17, 2025 19:12

mcmire commented Nov 17, 2025

View reviewed changes

Remove this comment

0da865b

Gudahtt reviewed Nov 17, 2025

View reviewed changes

packages/network-controller/CHANGELOG.md Outdated Show resolved Hide resolved

mcmire added 4 commits November 17, 2025 14:03

Add 'degraded' status

b3909af

Use similar terminology as in createServicePolicy

6b628d7

Merge branch 'update-create-service-policy' into update-network-contr…

4a3985a

…oller-rpc-endpoint-events

Adjust createServicePolicy as well

2d38446

mcmire added the no-changelog label Nov 17, 2025

mcmire added 4 commits November 17, 2025 23:15

Adjust createServicePolicy as well

3d8da80

Merge branch 'update-create-service-policy' into update-network-contr…

7860897

…oller-rpc-endpoint-events

Update some of the terminology

f67839a

Update more of the terminology

110cb0b

mcmire removed the no-changelog label Nov 18, 2025

RpcEndpointUnvailable -> RpcEndpointUnavailable

b16597a

docs: add JSDoc for getError helper in createNetworkClient

7b9c736

Add comprehensive documentation for the getError function that extracts errors from Cockatiel's FailureReason type in circuit breaker event handlers. Documents both possible shapes of the FailureReason object.

cryptodev-2s requested a review from Gudahtt November 25, 2025 18:07

cryptodev-2s force-pushed the update-network-controller-rpc-endpoint-events branch from 9a78c8f to 81a987c Compare November 25, 2025 22:54

cryptodev-2s force-pushed the update-network-controller-rpc-endpoint-events branch from 81a987c to 2c35648 Compare November 25, 2025 23:00

cursor bot reviewed Nov 25, 2025

View reviewed changes

packages/network-controller/src/create-network-client.ts Show resolved Hide resolved

cryptodev-2s enabled auto-merge November 25, 2025 23:07

Gudahtt reviewed Nov 26, 2025

View reviewed changes

packages/network-controller/CHANGELOG.md Outdated Show resolved Hide resolved

docs: clarify event payload changes in network-controller changelog

30a9486