This repo documents my steps to deploy a calico Cluster Integrated with ACI. I use Terraform to deploy the ACI configuration and spin up the required Virtual Machines.
The Calico cluster communicates with the ACI fabric via an External L3OUT. In order to simplify the configuration and support Virtual Machine Mobility the design will adopt use the floating L3OUT feature.
The floating L3Out feature enables you to configure a L3Out without specifying logical interfaces. The feature saves you from having to configure multiple L3Out logical interfaces to maintain routing when virtual machines (VMs) move from one host to another. Floating L3Out is supported for VMware vSphere Distributed Switch (VDS) with ACI 4.2.(1) and physical domains starting from ACI 5.0(1)
In order to keep the design as flexible as possible and not to dictate the Virtualisation Technology adopted the physical domain approach will be the one used even if the virtualisation environemnt is based on VMware. This is particularly convenient as will allow the user to mix of different Virtualisations and Bare Metal servers at the same time.
For more details on Floating L3 Out refer to the Cisco ACI Floating L3Out documentation.
Terminology refresher:
-
Anchor Node: Are the routers where the routing peering is formed. There is no requirement on the number of leaf switches acting as the anchor leaf node. As of ACI 5.1(3) an ACI leaf can have up to 400 BGP sessions.
-
Non-anchor Node: The non-anchor leaf node does not create any routing sessions for L3Out peering. It acts as a passthrough between the anchor node and the L3Out router. A non-anchor leaf node has the floating IP address and can have a floating secondary IP, if needed.. If it is a VMware vDS VMM domain, the floating IP address is deployed only when the virtual rotuer is connected to the leaf node. If it is a physical domain, and the leaf port uses AEP that has an L3Out domain associated to the floating L3Out, the floating IP address is deployed. The floating IP address is the common IP address for non-anchor leaf nodes. It is used to locate the router virtual machine (VM) if it moves behind any non-anchor leaf node through the data path.
-
Floating IP: A common internal IP for non anchor leaf nodes to communicate with anchor leaf node.
WARNING: This needs to be updated pedning the 2 Anchoer node vs all Anchor node desing due to the tromboning and scaling issue.
The design choices for the floating L3OUT are as following:
- Physical Domain: The Floating L3 OUT VLAN will be deployed on all the ports that are associated with the Physical Domain. Be carful if you choose to re-use a physical domain as you might end up with the Floating L3 OUT VLAN deplouyed on ports that are not connected to your Calico Nodes.
- Two Anchor Ndoes: A single border leaf can handle up to 400 dynamic routing adjagencies. This allow use to deploy up to 400 Calico Nodes per pair of Anchor Nodes. If more than 400 nodes are required we will instantiate a new set of Anchor nodes to spread the load.
- Non-Anchor nodes: This depends only on the rack layout and VM/BM spread.
The eBGP desing follow the approach to configure every Calico Node with a dedicated AS number and to peer with the two ACI Anchor nodes. The following optimizations are already implementd:
- BGP Timers set to 1s/3s to match the Calico Config
- Graceful Restart Helper
- Configure AS relax policy to allow installing ECMP path more than one node
- Increase Max eBGP ECMP Path to 64 (from 16). 64 is the current maximum on ACI
- Configure default-export policy to advertise the POD subnets back to the nodes
- BGP Control plan protection:
- BGP Password authentication
- Ability to set a limit on the number of received prefixes from the nodes
- Set the Maximum of AS limit to 1, no Calico Node should send more than 1 AS in its path.
- Subnet import filtering: Only the expected subnets (POD, Node and Services) are accepted by ACI
The cluster is composed by 3 masters and N workers. The control plane redundancy is ensured by deploying HaProxy and KeepaliveD.
A few add-ons are also installed on the cluster:
- Helm
- Nginx Ingress
- kubectl bash completion
- kubernetes dashboard
- metric server: the default config is modified to add the
--kubelet-insecure-tls
since all the certificates are self signed - Guestbook demo application exposed via ingress. Access via: http://ingress_ip/ this is not ideal, is just for demo purposes
All the configurations requires to spin up a cluster are done in the terraform configuraiton file. Some of the parameters are then used to generate the ansible inventory file and ansible variables. Variables Documentation
- L3OUT ECMP is used to load balance traffic to the services running in the cluster: Every node that has a POD for an exposed service will advertise a /32 host route for the service IP. Currently ACI does not support Resilient hashing for L3out ECMP. This means that if the number of ECMP paths are changed (scaling up/down a deploument could result in that as well as node failure) the flows can potentially be re-hashed to a different nodes resulting in connections resets. There is currently a feature request opened to support Resilient hashing for L3out ECMP: US9273
- Due to CSCvx73502 the
bgp policy timers
mapping into theBGP policy
can't be deleted by theterraform destroy
command resulting in terraform failing to complete the destroy operation. You can workaround this by executing theterraform state rm aci_rest.bgp_pol_timers
command before invokingterraform destroy
as a work around.