Skip to content

Conversation

@Fredi-raspall
Copy link
Contributor

@Fredi-raspall Fredi-raspall commented Dec 3, 2025

Please find the HA proposal in this PR.

  • I think the approach is reasonable
  • It requires changes in fabric and gateway, but those are small.
  • I've tried to minimize api changes
  • I'm adding @edipascale since he may help testing this out and an extra pair of eyes is always welcome

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

🚀 Temp artifacts published: v0-aa384c12f 🚀

```
**Note:**

- The _preference_ within an affinity group is just one way of ranking gateways. The value itself may be irrelevant. An alternative encoding may just enumerate the gateways in ascending/descending order, like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for the affinity approach, I think this is preferable, as the other option is more error prone and forces us to consider the case where the user puts the same priority to different gateways in the same group

With this approach:

* gateways --which know about VPCs and peerings-- are responsible for correctly annotating routes with the right priority.
* Leaves use the annotated communities to alter the selection of routes. This can be done with route-maps that assign a weight or local-preference to routes depending on the communities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the gateways VTEPs / RDs or something of the sort to distinguish between the routes advertised by the gateways and set the appropriate metric in the fabric leaves? This way the gateways do not have to do anything and the config is all on the leaves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require leaves to have extra information about the peerings that go over the gateway. I don't think the existing attributes would help us. On the same VRF, a leaf may have routes pointing to distinct gateways, even for the same destination VPC, depending on the peerings. I would not use the VTEP ip for instance, as this would require leaves to know what gateways are there (in the configuration), which makes the solution much more rigid. With the communities, gateways may come and go, and leaves need not know about them. They will just check the communities to know which routes they should prefer, regardless of what gateway (and its identity) advertise them.

* Leaves use the annotated communities to alter the selection of routes. This can be done with route-maps that assign a weight or local-preference to routes depending on the communities.
* This solution requires an a priori agreement of the communities and their meaning by the gateways and fabric nodes (leaves). In the case of leaves, the additional configuration is minimal and does not change depending on the number of VPCs or peerings. It may depend on the number of gateways, though. However, if we assume that a leaf will never have more than k routes to the same prefix (a redundancy of k gateways), then such a configuration is completely static.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something, but did we discuss how we actually implement HA, as in, redirecting routes from a higher priority/affinity gateway to a lower one when the former fails? And how do we detect failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I thought this was clear, but yeah, it's not mentioned.
My assumption is that we detect failures when the BGP sessions go down. This can happen on node failures (gw or spines) or link failures. Routes will not need to be redirected. If a BGP session is gone, so will the routes it carries. So, if a leaf had in its BGP rib for some VRF something like:

192.168.90.0/24 via gw1 (highest prio)
192.168.90.0/24 via gw2 (second highest prio -- not installed)
192.168.90.0/24 via gw3 (third highest prio -- not installed)

and gw1 failed (or it became unreachable due to link failures or spine failures), then its rib would become:

192.168.90.0/24 via gw1 (highest prio) -- GONE
192.168.90.0/24 via gw2 (second highest prio -- installed)
192.168.90.0/24 via gw3 (third highest prio -- not installed)

Ofc, BGP timers may play a role, and we can discuss about the convenience of using BFD to expedite the process, but the correctness should be guaranteed the way BGP works.

## BGP Mechanics for HA
When multiple gateways exist and advertise VPC prefixes, we need to make sure that **leafs select only one path** (over a single gateway) for each prefix exposed by all VPC peers. Since the fabric uses solely eBGP, the simplest could be to let gateways advertise peering prefixes with distinct metrics (MED attribute). While this could work in some cases, it is problematic. MED is an optional, non-transitive attribute, meaning that it will not be preserved in transit ASes. Therefore, the MED advertised by a GW may not be sent further by the spines it is connected to. Preserving the MED would require non-edge nodes in transit to propagate it, which would in turn require those nodes to be aware of VPCs and peerings. This is clearly undesired. Lastly, even if that was possible, care would need to be taken to select suitable metrics that would be valid for any topology.

An alternative option to the MED is the following. Instead of letting gateways advertise prefixes with distinct metrics (depending on whether they should be primary or backup for a given prefix), these may label routes with some pre-defined communities, each of which may indicate a preference. For instance, in a setup with 3 gateways, leaves would normally get the same route from 3 gateways. Gateways would, depending on the peerings, label their advertisements with one community out of the set FABRIC:P1, FABRIC:P2, FABRIC:P3. Leaves would then use a filter to assign the required preference based on those communities. E.g. If a route had community FABRIC:P1, it would win over a route with community FABRIC:P2 or FABRIC:P3. So, a certain leaf node would get for the same prefix and VRF, 3 routes (one per gateway), each with a distinct community.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a third option which is to inject dummy ASes in the path to make BGP prefer the shorter path routes. That would not require changes to the fabric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that may work. I think AS prepending is a bit old-fashioned. Using communities to be able to apply some policy accordingly is a widely used practice.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🚀 Temp artifacts published: v0-d49196519 🚀

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🚀 Temp artifacts published: v0-89dcbf53b 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants