-
Notifications
You must be signed in to change notification settings - Fork 2
HA design doc / proposal #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
🚀 Temp artifacts published: |
| ``` | ||
| **Note:** | ||
|
|
||
| - The _preference_ within an affinity group is just one way of ranking gateways. The value itself may be irrelevant. An alternative encoding may just enumerate the gateways in ascending/descending order, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go for the affinity approach, I think this is preferable, as the other option is more error prone and forces us to consider the case where the user puts the same priority to different gateways in the same group
| With this approach: | ||
|
|
||
| * gateways --which know about VPCs and peerings-- are responsible for correctly annotating routes with the right priority. | ||
| * Leaves use the annotated communities to alter the selection of routes. This can be done with route-maps that assign a weight or local-preference to routes depending on the communities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use the gateways VTEPs / RDs or something of the sort to distinguish between the routes advertised by the gateways and set the appropriate metric in the fabric leaves? This way the gateways do not have to do anything and the config is all on the leaves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require leaves to have extra information about the peerings that go over the gateway. I don't think the existing attributes would help us. On the same VRF, a leaf may have routes pointing to distinct gateways, even for the same destination VPC, depending on the peerings. I would not use the VTEP ip for instance, as this would require leaves to know what gateways are there (in the configuration), which makes the solution much more rigid. With the communities, gateways may come and go, and leaves need not know about them. They will just check the communities to know which routes they should prefer, regardless of what gateway (and its identity) advertise them.
| * Leaves use the annotated communities to alter the selection of routes. This can be done with route-maps that assign a weight or local-preference to routes depending on the communities. | ||
| * This solution requires an a priori agreement of the communities and their meaning by the gateways and fabric nodes (leaves). In the case of leaves, the additional configuration is minimal and does not change depending on the number of VPCs or peerings. It may depend on the number of gateways, though. However, if we assume that a leaf will never have more than k routes to the same prefix (a redundancy of k gateways), then such a configuration is completely static. | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something, but did we discuss how we actually implement HA, as in, redirecting routes from a higher priority/affinity gateway to a lower one when the former fails? And how do we detect failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. I thought this was clear, but yeah, it's not mentioned.
My assumption is that we detect failures when the BGP sessions go down. This can happen on node failures (gw or spines) or link failures. Routes will not need to be redirected. If a BGP session is gone, so will the routes it carries. So, if a leaf had in its BGP rib for some VRF something like:
192.168.90.0/24 via gw1 (highest prio)
192.168.90.0/24 via gw2 (second highest prio -- not installed)
192.168.90.0/24 via gw3 (third highest prio -- not installed)
and gw1 failed (or it became unreachable due to link failures or spine failures), then its rib would become:
192.168.90.0/24 via gw1 (highest prio) -- GONE
192.168.90.0/24 via gw2 (second highest prio -- installed)
192.168.90.0/24 via gw3 (third highest prio -- not installed)
Ofc, BGP timers may play a role, and we can discuss about the convenience of using BFD to expedite the process, but the correctness should be guaranteed the way BGP works.
| ## BGP Mechanics for HA | ||
| When multiple gateways exist and advertise VPC prefixes, we need to make sure that **leafs select only one path** (over a single gateway) for each prefix exposed by all VPC peers. Since the fabric uses solely eBGP, the simplest could be to let gateways advertise peering prefixes with distinct metrics (MED attribute). While this could work in some cases, it is problematic. MED is an optional, non-transitive attribute, meaning that it will not be preserved in transit ASes. Therefore, the MED advertised by a GW may not be sent further by the spines it is connected to. Preserving the MED would require non-edge nodes in transit to propagate it, which would in turn require those nodes to be aware of VPCs and peerings. This is clearly undesired. Lastly, even if that was possible, care would need to be taken to select suitable metrics that would be valid for any topology. | ||
|
|
||
| An alternative option to the MED is the following. Instead of letting gateways advertise prefixes with distinct metrics (depending on whether they should be primary or backup for a given prefix), these may label routes with some pre-defined communities, each of which may indicate a preference. For instance, in a setup with 3 gateways, leaves would normally get the same route from 3 gateways. Gateways would, depending on the peerings, label their advertisements with one community out of the set FABRIC:P1, FABRIC:P2, FABRIC:P3. Leaves would then use a filter to assign the required preference based on those communities. E.g. If a route had community FABRIC:P1, it would win over a route with community FABRIC:P2 or FABRIC:P3. So, a certain leaf node would get for the same prefix and VRF, 3 routes (one per gateway), each with a distinct community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a third option which is to inject dummy ASes in the path to make BGP prefer the shorter path routes. That would not require changes to the fabric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that may work. I think AS prepending is a bit old-fashioned. Using communities to be able to apply some policy accordingly is a widely used practice.
aa384c1 to
d491965
Compare
|
🚀 Temp artifacts published: |
Signed-off-by: Fredi Raspall <[email protected]>
d491965 to
89dcbf5
Compare
|
🚀 Temp artifacts published: |
Please find the HA proposal in this PR.