The main Deployment workflow will consist in two steps:
- Configuration preparation for Stack components:
- ACI-exporter
- Prometheus and alertmanager
- Grafana
- Syslog-ng
- Promtail
- Deployment using HELM
- Familiarity with Kubernetes: This installation guide is intended to assist with the setup of the ACI Monitoring stack and assumes prior familiarity with Kubernetes; it is not designed to provide instruction on Kubernetes itself.
- A Kubernetes Cluster: Currently the stack has been tested on
Upstream Kubernetes 1.30.x
Minikube
andk3s
- Persistent Volumes: A total 10G should be plenty for a small/demo environment. Many storage provisioner support Volume expansion so should be easy to increase this post installation.
- Ability to expose services for:
- Access to the
Grafana
,Prometheus
andAlertmanager
dashboards: This will be ideally achieved via anIngress Controller
- (Optional) Wildcard DNS Entries for the ingress controller domain.
- Syslog ingestion from ACI: Since the syslog can be sent via
UDP
orTCP
it is required to expose these service directly via either aNodePort
or aLoadBalancer
service Type
- Access to the
- Cluster Compute Resources: This stack has been tested against a 500 node ACI fabric and was consuming roughly 8GB of RAM, CPU resources didn't seem to play a major role and any modern CPU should suffice.
- 1 Dedicated Namespace per instance: One Instance can monitor at least 500 switches.
- This is not strictly required but is suggested to keep the HELM configuration simple so the default K8s service names can be re-used see the Config Preparation section for more details.
- Helm: This stack is distributed as a helm chart and relies on 3rd party helm charts as well
- Connectivity from your Kubernetes Cluster to ACI either over Out Of Band or In Band
The ACI Monitoring Stack is a combination of several Charts, if you are familiar with Helm you are aware of the struggle to propagate dynamic values to sub-charts. For example, it is not possible to pass to a sub-chart the name of a service in a dynamic way.
In order to simplify the user experience the chart
comes with a few pre-configured parameters that are populated in the configurations of the various sub-charts.
For example the aci-exporter Service Name is pre-configured as aci-exporter-svc
and this value is then passed to Prometheus as service Discovery URL.
All these values can be customized and if you need to you can refer to the Values file.
Note: This is the first HELM char camrossi
created, and he is sure it can be improved. If you have suggestions they are extremely welcome! :)
The aci-exporter is the bridge between your Cisco ACI environment and the Prometheus
monitoring ecosystem, for it to works it needs to know:
fabrics
: A list of fabrics and how to connect to the APICs.- Requires a ReadOnly Admin User
service_discovery
: Select if devices are reachable via Out Of Band (oobMgmtAddr
) or InBand (inbMgmtAddr
).
Note: The switches are auto-discovered.
This is done by setting the following Values in Helm:
aci_exporter:
# Profiles for different fabrics
fabrics:
fab1:
username: <username>
password: <password>
apic:
- https://IP1
- https://IP2
- https://IP3
# service_discovery oobMgmtAddr|inbMgmtAddr
service_discovery: oobMgmtAddr
fab2:
username: <username>
password: <password>
apic:
- https://IP1
- https://IP2
- https://IP3
# service_discovery oobMgmtAddr|inbMgmtAddr
service_discovery: inbMgmtAddr
Prometheus is installed via its own Chart the options you need to set are:
- The
ingress
config and the baseURL: These most likely are the same URL which can accessprometheus
andalertmanager
- Persistent Volume Capacity
- (Optional)
retentionSize
: this is only needed if you want to limit the retention by size. Keep in mind that if you run out of disk space Prometheus WILL stop working. - (Optional) alertmanager
route
: these are used to send notifications via Mail/Webex etc... the complete syntax is available Here Below an example:
prometheus:
server:
ingress:
enabled: true
ingressClassName: "traefik"
hosts:
- aci-exporter-prom.apps.c1.cam.ciscolabs.com
baseURL: "http://aci-exporter-prom.apps.c1.cam.ciscolabs.com"
service:
retentionSize: 5GB
persistentVolume:
accessModes: ["ReadWriteOnce"]
size: 5Gi
alertmanager:
baseURL: "http://aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com"
ingress:
enabled: true
ingressClassName: "traefik"
hosts:
- host: aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com
paths:
- path: /
pathType: ImplementationSpecific
config:
route:
group_by: ['alertname']
group_interval: 30s
repeat_interval: 30s
group_wait: 30s
receiver: 'webex'
receivers:
- name: webex
webex_configs:
- send_resolved: false
api_url: "https://webexapis.com/v1/messages"
room_id: "<room_id>"
http_config:
authorization:
credentials: "<credentials>"
If you use Webex here some config steps for you!
Grafana
is installed via its own Chart the main options you need to set are:
- The
ingress
config: External URL which can access Grafana. - Persistent Volume Capacity
- (Optional)
adminPassword
: If not set will be auto generated and can be found in thegrafana
secret - (Optional)
viewers_can_edit
: This allows users with aview only
role to modify the dashboards and accessExplorer
to execute queries againstPormetheus
andLoki
. However, the user will not be able to save any changes. - (Optional)
deploymentStrategy
: if GrafanaPersistent Volume
is of typeReadWriteOnce
rolling updates will get stuck as the new pod cannot start before the old one releases the PVC. SettingdeploymentStrategy.type
toRecreate
destroy the original pod before starting the new one.
Below an example:
grafana:
grafana.ini:
users:
viewers_can_edit: "True"
adminPassword: <adminPassword>
deploymentStrategy:
type: Recreate
ingress:
ingressClassName: "traefik"
enabled: true
hosts:
- aci-exporter-grafana.apps.c1.cam.ciscolabs.com
persistence:
enabled: true
size: 2Gi
The syslog config is the most complicated part as it relies on 3 components (promtail
, loki
and syslog-ng
) with their own individual configs. Furthermore, there are two issues we need to overcome:
- The Syslog messages don't contain the ACI Fabric name: to be able to distinguish the messaged from one fabric to another the only solution is to use dedicated
external services
with uniqueIP:Port
pair per Fabric. - Until ACI 6.1 we need
syslog-ng
betweenACI
andPromtail
to convert from RFC 3164 to 5424 Note: Promtail 3.1.0 adds support for RFC 3164 however this DOES NOT work for Cisco Switches and still requires syslog-ng. syslog-ngsyslog-parser
has extensive logic to handle all the complexities (and inconsistencies) of RFC 3164 messages.
Loki is deployed with the Simple Scalable Profile and is composed of a backend
, read
and write
deployment with a replica of 3.
The backend
and write
deployments requires persistent volumes. This chart is pre-configured to allocate 2Gi Volumes for each deployment (a total of 6 PVC will be created):
3 x data-loki-backend-X
3 x data-loki-write-X
The PVC Size can be easily changed if required.
Loki also requires an Object Store
. This chart is pre-configured to deploy minio. Note: Currently Loki Chart is deploying a very old version of Minio
and there is a PR open to address this already.
Loki also support chunks-cache
via memcached
. The default config allocates 8G of memory. I have decreased this to 1G by default.
If you want to change any of these parameters check the loki
section in the Values file.
Assuming the default parameters are acceptable the only required config for loki is to set the rulerConfig.external_url
to point to the Grafana ingress
URL
loki:
loki:
rulerConfig:
external_url: http://aci-exporter-grafana.apps.c1.cam.ciscolabs.com
These two components are tightly coupled together.
Syslog-ng
translates logs from RFC 3164 to RFC 5424 and forwards them toPromtail
.Promtail
is ingesting logs in RFC 5424 format and forwards them toLoki
.
Promtail
is pre-configured with:
- Deployment Mode with 1 replica
- Loki Push Gateway url:
loki-gateway
This is the Loki Gateway K8s service name. - Auto generated
scrapeConfigs
that will map a Fabric to aIP:Port
Pair.
These setting can be easily changed if required check the Promtail
section in the Values file for more details.
Syslog-ng
is pre-configured with:
- Deployment Mode with 1 replica
If you are happy with my defaults the only configs required are setting the extraPorts
for Loki
and services
for Syslog-ng
. You will need one entry per fabric and the ports needs to "match", see the diagram below for a visual representation.
Syslog-ng
is only needed for ACI < 6.1
Below a diagram of what is our goal for an ACI 6.1 fabric and an ACI 5.2 one.
flowchart-elk LR
subgraph K8s Cluster
subgraph Promtail
PT1513["TCP:1513 label:fab1"]
PT1514["TCP:1514 label:fab2"]
end
subgraph Syslog-ng
SL["UDP:1514"]
end
F1SVC["LoadBalancerIP TCP:1513"]
F2SVC["LoadBalancerIP UDP:1514"]
F1SVC --> PT1513
F2SVC --> SL
end
subgraph ACI
ACI61["ACI Fab1 Ver. 6.1"] --> F1SVC
ACI52["ACI Fab2 Ver. 5.2"] --> F2SVC
end
SL --> PT1514
The above architecture can be achieved with the following config:
name
: This will set thefabric
labels for the logs received by LokicontainerPort
: The port the container listen to. This is mapping a logs stream to a fabricservice.type
: I would suggest to set this to eitherNodePort
orLoadBalancer
. Regardless this IP allocated MUST be reachable by all the Fabric Nodes.service.port
: The port theLoadBalancer
service is listening to, this will be the port you set into the ACI Syslog config.service.nodePort
: The port theNodePort
service is listening to, this will be the port you set into the ACI Syslog config.
promtail:
extraPorts:
fab1:
name: fab1
containerPort: 1513
service:
type: LoadBalancer
port: 1513
fab2:
name: fab2
containerPort: 1516
service:
type: ClusterIP
syslog:
services:
fab2:
name: fab2
containerPort: 1516
protocol: UDP
service:
type: LoadBalancer
port: 1516
If you need a reminder on how to configure ACI Syslog take a look Here
Here you can see an Example Config for 4 Fabrics
Once the configuration file is generated i.e.: aci-mon-stack-config.yaml
Helm can be used to deploy the stack:
helm repo add aci-monitoring-stack https://datacenter.github.io/aci-monitoring-stack
helm repo update
helm -n aci-mon-stack upgrade --install --create-namespace aci-mon-stack aci-monitoring-stack/aci-monitoring-stack -f aci-mon-stack-config.yaml