Skip to content

Conversation

@mcmire
Copy link
Contributor

@mcmire mcmire commented Nov 15, 2025

Explanation

In a future commit we will introduce changes to network-controller so that it will keep track of the status of each network as requests are made. This commit paves the way for this to happen by redefining the existing RPC endpoint-related events that NetworkController produces.

Currently, when requests are made through the network clients that NetworkController exposes, three events are published:

  • NetworkController:rpcEndpointDegraded
    • Published when enough successive retriable errors are encountered while making a request to an RPC endpoint that the maximum number of retries is reached.
  • NetworkController:rpcEndpointUnavailable
    • Published when enough successive errors are encountered while making a request to an RPC endpoint that the underlying circuit breaks.
  • NetworkController:rpcEndpointRequestRetried
    • Published when a request is retried (mainly used for testing).

It's important to note that in the context of the RPC failover feature, an "RPC endpoint" can actually encompass multiple URLs, so the above events actually fire for any URL.

While these events are useful for reporting metrics on RPC endpoints, in order to effectively be able to update the status of a network, we need events that are less granular and are guaranteed not to fire multiple times in a row. We also need a new event.

Now the list of events looks like this:

  • NetworkController:rpcEndpointDegraded
    • The same as before.
  • NetworkController:rpcEndpointUnavailable
    • The same as before.
  • NetworkController:rpcEndpointRetried
    • Renamed from NetworkController:rpcEndpointRequestRetried.
  • NetworkController:rpcEndpointChainDegraded
    • Similar to NetworkController:rpcEndpointDegraded, but won't be published again if the RPC endpoint is already in a degraded state.
  • NetworkController:rpcEndpointChainUnavailable
    • Published when all of the circuits underlying all of the URLs for an RPC endpoint have broken (none of the URLs are available). Won't be published again if the RPC endpoint is already in an unavailable state.
  • NetworkController:rpcEndpointChainAvailable
    • A new event. Published the first time a successful request is made to one of the URLs for an RPC endpoint, or following a degraded or unavailable status.

Going a bit deeper, in order to make the changes above, it was necessary to rewrite the core logic responsible for diverting traffic to failovers from RpcService to RpcServiceChain, which was a more natural fit, anyway. This also meant that we could simplify RpcService, as well as its tests.

References

Progresses https://consensyssoftware.atlassian.net/browse/WPC-99.

Checklist

  • I've updated the test suite for new or updated code as appropriate
  • I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
  • I've communicated my changes to consumers by updating changelogs for packages I've changed, highlighting breaking changes as necessary
  • I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes
    • This will come in a later PR.

Note

Introduce chain-level RPC endpoint events and a new RpcServiceChain, rename/update retry and payloads, and refactor failover logic with extensive test updates.

  • Events & Payloads (BREAKING):
    • Add NetworkController:rpcEndpointChainAvailable and chain-level events rpcEndpointChainDegraded/rpcEndpointChainUnavailable with non-repeating semantics in src/NetworkController.ts.
    • Rename NetworkController:rpcEndpointRequestRetried to NetworkController:rpcEndpointRetried.
    • Update payloads: add networkClientId, add primaryEndpointUrl to per-endpoint events; chain-level events omit primaryEndpointUrl.
  • Failover Architecture:
    • Introduce RpcServiceChain (src/rpc-service/rpc-service-chain.ts) to manage primary/failover endpoints, circuit states, and emit chain/service events.
    • Simplify RpcService (src/rpc-service/rpc-service.ts): remove embedded failover, add onAvailable, resetPolicy, getCircuitState, improved error handling/logging.
    • Wire into client creation: create-network-client.ts builds RpcServiceChain and publishes updated events; create-auto-managed-network-client.ts now passes networkClientId to createNetworkClient.
  • Exports & Types:
    • Update exports in src/index.ts; adjust RpcServiceRequestable event listener types and remove { isolated: true } from onBreak data.
  • Tests:
    • Add comprehensive tests for new chain/service events and behaviors; update helpers/mocks to support dynamic responses.
  • Changelog & Deps:
    • Document breaking changes and new events in CHANGELOG.md.
    • Add cockatiel to devDependencies.

Written by Cursor Bugbot for commit 916a0e2. This will update automatically on new commits. Configure here.

In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. These updates to `createServicePolicy` assist with that.

See the changelog for a list of changes to the `ServicePolicy` API.

Besides the changes listed there, the tests for `createServicePolicy`
have been refactored slightly so that it is easier to maintain in the
future.
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2f9688e to 2c7678c Compare November 17, 2025 16:55
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2c7678c to 4c58933 Compare November 17, 2025 17:55
In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. This commit paves the way for this to happen by redefining the
existing RPC endpoint-related events that NetworkController produces.

Currently, when requests are made through the network clients that
NetworkController exposes, three events are published:

- `NetworkController:rpcEndpointDegraded`
  - Published when enough successive retriable errors are encountered
    while making a request to an RPC endpoint that the maximum number of
    retries is reached.
- `NetworkController:rpcEndpointUnavailable`
  - Published when enough successive errors are encountered while making
    a request to an RPC endpoint that the underlying circuit breaks.
- `NetworkController:rpcEndpointRequestRetried`
  - Published when a request is retried (mainly used for testing).

It's important to note that in the context of the RPC failover feature,
an "RPC endpoint" can actually encompass multiple URLs, so the above
events actually fire for any URL.

While these events are useful for reporting metrics on RPC endpoints, in
order to effectively be able to update the status of a network, we need
events that are less granular and are guaranteed not to fire multiple
times in a row. We also need a new event.

Now the list of events looks like this:

- `NetworkController:rpcEndpointInstanceDegraded`
  - The same as `NetworkController:rpcEndpointDegraded` before.
- `NetworkController:rpcEndpointInstanceUnavailable`
  - The same as `NetworkController:rpcEndpointInstanceDegraded` before.
- `NetworkController:rpcEndpointInstanceRetried`
  - Renamed from `NetworkController:rpcEndpointRequestRetried`.
- `NetworkController:rpcEndpointDegraded`
  - Similar to `NetworkController:rpcEndpointInstanceDegraded`, but
    won't be published again if the RPC endpoint is already in a
    degraded state.
- `NetworkController:rpcEndpointUnavailable`
  - Published when all of the circuits underlying all of the URLs for an
    RPC endpoint have broken (none of the URLs are available). Won't be
    published again if the RPC endpoint is already in an unavailable
    state.
- `NetworkController:rpcEndpointAvailable`
  - A new event. Published the first time a successful request is made
    to one of the URLs for an RPC endpoint, or following a degraded or
    unavailable status.
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 4c58933 to 9d090e9 Compare November 17, 2025 19:12
): Promise<JsonRpcResponse<Result | null>> {
return this.#services[0].request(jsonRpcRequest, fetchOptions);
}
// Start with the primary (first) service and switch to failovers as the
Copy link
Contributor Author

@mcmire mcmire Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to these changes, each RpcService object could have an optional failoverService property. This class, RpcServiceChain, would then build a chain (really, a linked list) of services. In order to make a request, RpcServiceChain would call request on the first service in the chain, and RpcService would decide whether it needed to call the next service in the chain, etc.

While this model is easy to understand, I needed access to certain data along the way, and it seemed easier to use a loop rather than a linked list. Anyway, I figured it really should be the responsibility of RpcServiceChain to manage how requests are sent across the chain.

onRetry(
listener: AddToCockatielEventData<
Parameters<ServicePolicy['onRetry']>[0],
listener: CockatielEventToEventListenerWithData<
Copy link
Contributor Author

@mcmire mcmire Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be doing the same thing as the previous code. I just thought that AddToCockatielEventData<Parameters<...>[0] was a bit ugly (and I added some more self-descriptive utility types, so this is one of them).

export function createAutoManagedNetworkClient<
Configuration extends NetworkClientConfiguration,
>({
networkClientId,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now expose the network client ID in the rpcEndpoint* messenger events, so we need to receive it and pass it down to createNetworkClient.

});
});

it('publishes the NetworkController:rpcEndpointUnavailable event when the failover occurs', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests got moved to src/create-network-client-tests/rpc-endpoint-events.test.ts as I realized they didn't belong here. It's true that this test file is concerned with the RPC failover feature (which is required to test the rpcEndpoint* events), but all of the tests in tests/network-client directory really exercise the middleware stack that createNetworkClient builds and in so doing loop over all of the RPC methods that our internal provider handles specially. We don't need to go to all that trouble to test the rpcEndpoint* events, we can just use an arbitrary RPC method. (I think eventually I will rename tests/network-client to tests/internal-provider-api or something like that, but that's a PR for another time.)

/**
* Obtains the event data type from a Cockatiel event or event listener type.
*/
export type ExtractCockatielEventData<CockatielEventOrEventListener> =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cockatiel types are a bit awkward to work with (especially since event listeners whose event payloads are empty are typed as Event<void> which is rather strange). These utilities just make them a bit easier to work with.

- This ought to be unobservable, but we mark it as breaking out of an abundance of caution.
- **BREAKING:** Split up and update payload data for `NetworkController:rpcEndpoint{Degraded,Unavailable}` ([#7166](https://github.com/MetaMask/core/pull/7166))
- The existing events are now called `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` and retain their present behavior.
- `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` do still exist, but they are now designed to represent the entire RPC endpoint and are guaranteed to not be published multiple times in a row. In particular, `NetworkController:rpcEndpointUnavailable` is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, NetworkController:rpcEndpointUnavailable is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.

This change is I suppose a bit controversial, and the first version of this was slightly different before I landed on this approach. But I think it makes sense? Basically, if we can reach the network somehow, don't broadcast that it's unavailable until we're absolutely sure. That does mean that the NetworkController:rpcEndpointUnavailable may never be published for Infura networks if the failover does its job, but I think that's precisely the intent.

@mcmire
Copy link
Contributor Author

mcmire commented Nov 17, 2025

Adding no-changelog because there is a minor change to createServicePolicy which is non-user-facing. I've cherry-picked the createServicePolicy to the other PR, so this no longer applies.

github-merge-queue bot pushed a commit that referenced this pull request Nov 18, 2025
## Explanation

<!--
Thanks for your contribution! Take a moment to answer these questions so
that reviewers have the information they need to properly understand
your changes:

* What is the current state of things and why does it need to change?
* What is the solution your changes offer and how does it work?
* Are there any changes whose purpose might not obvious to those
unfamiliar with the domain?
* If your primary goal was to update one package but you found you had
to update another one along the way, why did you do so?
* If you had to upgrade a dependency, why did you do so?
-->

In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. These updates to `createServicePolicy` assist with that. See the
changelog for more.

Besides this, the tests for `createServicePolicy` have been refactored
slightly so that they are easier to maintain in the future.

## References

<!--
Are there any issues that this pull request is tied to?
Are there other links that reviewers should consult to understand these
changes better?
Are there client or consumer pull requests to adopt any breaking
changes?

For example:

* Fixes #12345
* Related to #67890
-->

Progresses https://consensyssoftware.atlassian.net/browse/WPC-99.

You can see how these changes will be used in the next PR:
#7166

## Checklist

- [x] I've updated the test suite for new or updated code as appropriate
- [x] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [x] I've communicated my changes to consumers by [updating changelogs
for packages I've
changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs),
highlighting breaking changes as necessary
- [x] I've prepared draft pull requests for clients and consumer
packages to resolve any breaking changes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds `getCircuitState`, `onAvailable`, and `reset` to `ServicePolicy`,
exports Cockatiel types, and updates logic/tests to support availability
tracking and circuit state introspection.
> 
> - **controller-utils**:
>   - **ServicePolicy API**:
>     - Add `getCircuitState()` to expose underlying circuit state.
> - Add `onAvailable` event for first success and post-recovery success.
>     - Add `reset()` to close the circuit and reset breaker counters.
>   - **Behavior/Internals**:
> - Track availability status and emit `onAvailable`/`onDegraded`
appropriately.
> - Update `onBreak` to mark unavailable; wire `ConsecutiveBreaker` for
reset.
>   - **Exports**:
> - Export `CockatielEventEmitter` and `CockatielFailureReason`;
re-export via `index`.
>   - **Tests**:
> - Expand/refactor tests to cover `onAvailable`, `getCircuitState`,
`reset`, and timing cases; update export snapshot.
>   - **Docs**:
>     - Update `CHANGELOG.md` with new methods and exports.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e597d0b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Add comprehensive documentation for the getError function that extracts
errors from Cockatiel's FailureReason type in circuit breaker event handlers.
Documents both possible shapes of the FailureReason object.
@cryptodev-2s cryptodev-2s requested a review from Gudahtt November 25, 2025 18:07
@cryptodev-2s cryptodev-2s force-pushed the update-network-controller-rpc-endpoint-events branch from 9a78c8f to 81a987c Compare November 25, 2025 22:54
Remove primaryEndpointUrl field from NetworkController chain-level events:
- NetworkController:rpcEndpointChainUnavailable
- NetworkController:rpcEndpointChainDegraded
- NetworkController:rpcEndpointChainAvailable

Chain-level events are designed to represent the overall status of an
endpoint chain, not individual endpoints. The primaryEndpointUrl field
is redundant since consumers can derive endpoint information from the
networkClientId using getNetworkClientById() or
getNetworkConfigurationByNetworkClientId().

Individual endpoint events (rpcEndpointUnavailable, rpcEndpointDegraded,
rpcEndpointRetried) retain the primaryEndpointUrl field, as it's useful
for comparing endpointUrl to primaryEndpointUrl to determine whether the
affected endpoint is a primary or a failover.

Updated event type definitions, event publishing logic, test assertions,
and changelog to reflect these changes.
@cryptodev-2s cryptodev-2s force-pushed the update-network-controller-rpc-endpoint-events branch from 81a987c to 2c35648 Compare November 25, 2025 23:00
Add the same undefined check that exists in onBreak to ensure type safety
and prevent publishing events with undefined error values.
Use CockatielFailureReason type instead of generic object type for better
type safety and clarity.
@cryptodev-2s cryptodev-2s requested a review from Gudahtt November 26, 2025 16:32
Capture the chain status before calling service.request() to prevent
spurious onBreak emissions. The onDegraded handler can fire synchronously
during service.request() and change the status from Unavailable to Degraded
before the catch block checks it, causing incorrect onBreak events when
recovery attempts fail.
Revert the previous fix that captured previousStatus before the request.
Checking the current status (this.#status) is correct because it accounts
for status changes that may occur during the request from other services
in the chain. The original check prevents duplicate onBreak emissions
when the chain is already Unavailable.
The test 'calls onAvailable when a service becomes degraded by responding
slowly, and then recovers' was not actually simulating a slow response,
so it was only testing initial availability, not recovery from degraded state.

Changes:
- Add clock.tick(DEFAULT_DEGRADED_THRESHOLD + 1) to first mock to simulate slow response
- Add onDegraded listener to verify degradation actually occurred
- Add assertions to verify both onDegraded and onAvailable are called
- Add assertion to verify call order (degradation before recovery)
…vents

- Remove primaryEndpointUrl from event type definitions for onBreak, onDegraded, and onAvailable
- Remove primaryEndpointUrl from event emissions in RpcServiceChain
- Update event listener type signatures to not include primaryEndpointUrl
- Update all test expectations to remove primaryEndpointUrl from assertions
- Update create-network-client.ts to remove primaryEndpointUrl from event handlers
- Note: onService* methods still include primaryEndpointUrl as they were not changed
- Remove endpointUrl from onBreak, onDegraded, and onAvailable events in RpcServiceChain
- Update type definitions to exclude endpointUrl using ExcludeCockatielEventData
- Update event emissions to exclude endpointUrl from chain-level events
- Update NetworkController event types to remove endpointUrl from chain-level events (rpcEndpointChainDegraded, rpcEndpointChainAvailable, rpcEndpointChainUnavailable)
- Update event handlers in create-network-client.ts to not destructure endpointUrl
- Update all test assertions to remove endpointUrl from chain-level event expectations
- Remove unused rpcUrl parameters from test functions
- Align all chain-level events to not include endpointUrl (consistent with unavailable event)
@cryptodev-2s cryptodev-2s requested a review from Gudahtt November 26, 2025 23:05
Copy link
Member

@Gudahtt Gudahtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cryptodev-2s cryptodev-2s added this pull request to the merge queue Nov 27, 2025
Merged via the queue into main with commit 87dc9fa Nov 27, 2025
277 checks passed
@cryptodev-2s cryptodev-2s deleted the update-network-controller-rpc-endpoint-events branch November 27, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants