NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

natemollica-nm · 2024-09-25T23:45:52Z

NET-11126: Excessive Upstream Cluster CPU/Memory Overhead

Configure consul-dataplane sidecar proxies and gateways to use the cluster manager's enable_deferred_cluster_creation feature (set to true) by default to limit upstream cluster initialization overhead when a large number of upstream clusters are available or broadcasted between peered clusters.

Context:

Kubernetes/OpenShift clusters hosting consul-k8s service-mesh
Service redundancy and failover is configured via the peering connection using exported services.
As the scale of consul-k8s clusters increase in mesh service count, and the number of exported services increases, the number of known clusters for initialization also increases for consul-dataplane sidecar proxies to startup.
- This introduces initial startup CPU and Memory spikes that, if using Kubernetes/OpenShift based resource quota limits, that would introduce unwanted latency and delays, and potentially application container restarts prior to the sidecar coming online fully.

Improvements:

Startup dataplane sidecar proxy CPU/Memory utilization
Reduced startup times for sidecar proxies
Reduced resource consumption at startup, encourages the use of resource quotas and minimizes costs.

Example of Upstream cluster counts being initialized during startup for a production scaled cluster with peering service exports:

thread_local_cluster_manager.worker_0.clusters_inflated: 218
thread_local_cluster_manager.worker_1.clusters_inflated: 218
thread_local_cluster_manager.worker_2.clusters_inflated: 218
thread_local_cluster_manager.worker_3.clusters_inflated: 218

The sidecar container would have to process and register all 218 clusters within it's Envoy configuration prior to coming online fully, whether or not the clusters are required for normal operation.

Consul Dataplane Logs

{"@timestamp":"2024-09-16T13:29:26.535638Z+00:00","@module":"envoy.main","@level":"info","@message":"starting main dispatch loop","thread":31}
{"@timestamp":"2024-09-16T13:29:26.837042Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: add 140 cluster(s), remove 0 cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735353Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: added/updated 140 cluster(s), skipped 0 unmodified cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735391Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: initializing secondary clusters","thread":31}
{"@timestamp":"2024-09-16T13:30:45.739702Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
---- (message repeat for variable amount of seconds to minutes/cut for brevity) ----
{"@timestamp":"2024-09-16T13:30:45.740925Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740932Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740941Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740949Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740958Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: all clusters initialized","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740968Z+00:00","@module":"envoy.main","@level":"info","@message":"all clusters initialized. initializing init manager","thread":31}

This PR simply adds the following entry to the bootstrap template, so all instances of dataplane proxies will limit the cluster initialization process and reduce CPU/Memory overhead.

enable_deferred_cluster_creation
(bool) Whether the ClusterManager will create clusters on the worker threads inline during requests. This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.

  "cluster_manager": {
	"enable_deferred_cluster_creation": true
  }

… to bootstrap template

bootstrap: Add cluster_manager.enable_deferred_cluster_creation: true…

a2b9ff3

… to bootstrap template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

natemollica-nm commented Sep 25, 2024 •

edited

Loading

NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

Are you sure you want to change the base?

NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

Conversation

natemollica-nm commented Sep 25, 2024 • edited Loading

natemollica-nm commented Sep 25, 2024 •

edited

Loading