Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of fix: use Envoy's default for validate_clusters to fix breaking routes when some backend clusters don't exist into release/1.19.x #21621

Conversation

hc-github-team-consul-core
Copy link
Collaborator

Backport

This PR is auto-generated from #21587 to be assessed for backporting due to the inclusion of the label backport/1.19.

The below text is copied from the body of the original PR.


Description

The validate_clusters option in Envoy's route configuration says:

"An optional boolean that specifies whether the clusters that the route table refers to will be validated by the cluster manager. If set to true and a route refers to a non-existent cluster, the route table will not load. If set to false and a route refers to a non-existent cluster, the route table will load and the router filter will return a 404 if the route is selected at runtime. This setting defaults to true if the route table is statically defined via the route_config option. This setting default to false if the route table is loaded dynamically via the rds option. Users may wish to override the default behavior in certain cases (for example when using CDS with a static route table)."

We are setting it dynamically via RDS, but overriding the default value to set it explicitly to true. This means when a cluster that the route is supposed to point to doesn't exist, the route can fail to route to any of its backends. This case can be triggered if you have a router -> resolver where the resolver has backends on different peers/wan federated backends, and you add a route to a backend that doesn't exist. The non-existent backend causes the existing backends to fail. I was not able to trigger this case in a single cluster setup, but with a peered backend it can be triggered.

Because, the traffic doesn't just blackhole, but rather returns a 503, this actually seems to be the desired behavior, rather than making all other routing paths within that route fail due to a missing cluster. This is similar to the conclusion that was reached within the Jira ticket.

This PR removes the code that overrides the default value of this validate_clusters option.

Testing & Reproduction steps

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

Overview of commits

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@github-actions github-actions bot added type/docs Documentation needs to be created/updated/clarified theme/api Relating to the HTTP API interface theme/envoy/xds Related to Envoy support labels Aug 20, 2024
@ndhanushkodi ndhanushkodi merged commit b996f99 into release/1.19.x Aug 20, 2024
104 checks passed
@ndhanushkodi ndhanushkodi deleted the backport/nd/net-10435-cluster-validation/overly-warm-thrush branch August 20, 2024 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/api Relating to the HTTP API interface theme/envoy/xds Related to Envoy support type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants