Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NET-11126: bootstrap: Add cluster_manager.enable_deferred_cluster_creation #644

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

natemollica-nm
Copy link

@natemollica-nm natemollica-nm commented Sep 25, 2024

NET-11126: Excessive Upstream Cluster CPU/Memory Overhead

Configure consul-dataplane sidecar proxies and gateways to use the cluster manager's enable_deferred_cluster_creation feature (set to true) by default to limit upstream cluster initialization overhead when a large number of upstream clusters are available or broadcasted between peered clusters.


Context:

  • Kubernetes/OpenShift clusters hosting consul-k8s service-mesh
  • Service redundancy and failover is configured via the peering connection using exported services.
  • As the scale of consul-k8s clusters increase in mesh service count, and the number of exported services increases, the number of known clusters for initialization also increases for consul-dataplane sidecar proxies to startup.
    • This introduces initial startup CPU and Memory spikes that, if using Kubernetes/OpenShift based resource quota limits, that would introduce unwanted latency and delays, and potentially application container restarts prior to the sidecar coming online fully.

Improvements:

  • Startup dataplane sidecar proxy CPU/Memory utilization
  • Reduced startup times for sidecar proxies
  • Reduced resource consumption at startup, encourages the use of resource quotas and minimizes costs.

Example of Upstream cluster counts being initialized during startup for a production scaled cluster with peering service exports:

thread_local_cluster_manager.worker_0.clusters_inflated: 218
thread_local_cluster_manager.worker_1.clusters_inflated: 218
thread_local_cluster_manager.worker_2.clusters_inflated: 218
thread_local_cluster_manager.worker_3.clusters_inflated: 218

The sidecar container would have to process and register all 218 clusters within it's Envoy configuration prior to coming online fully, whether or not the clusters are required for normal operation.


Consul Dataplane Logs

{"@timestamp":"2024-09-16T13:29:26.535638Z+00:00","@module":"envoy.main","@level":"info","@message":"starting main dispatch loop","thread":31}
{"@timestamp":"2024-09-16T13:29:26.837042Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: add 140 cluster(s), remove 0 cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735353Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: added/updated 140 cluster(s), skipped 0 unmodified cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735391Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: initializing secondary clusters","thread":31}
{"@timestamp":"2024-09-16T13:30:45.739702Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
---- (message repeat for variable amount of seconds to minutes/cut for brevity) ----
{"@timestamp":"2024-09-16T13:30:45.740925Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740932Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740941Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740949Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740958Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: all clusters initialized","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740968Z+00:00","@module":"envoy.main","@level":"info","@message":"all clusters initialized. initializing init manager","thread":31}

This PR simply adds the following entry to the bootstrap template, so all instances of dataplane proxies will limit the cluster initialization process and reduce CPU/Memory overhead.

enable_deferred_cluster_creation
(bool) Whether the ClusterManager will create clusters on the worker threads inline during requests. This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.

  "cluster_manager": {
	"enable_deferred_cluster_creation": true
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant