Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config #32283

Closed
jparklab opened this issue Feb 8, 2024 · 9 comments · Fixed by #37180
Labels
area/configuration area/xds beginner Good starter issues! bug no stalebot Disables stalebot from closing an issue

Comments

@jparklab
Copy link
Contributor

jparklab commented Feb 8, 2024

Title: Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

Description:

Based on Resource warming section on the envoy documentation, envoy is expected to use the previously sent RouteConfiguration while warming up a Listener and management does not need to send the RouteConfiguration if there is no change. However, when a field inside Rds field in the Listener, including initial_fetch_timeout field, is changed in a Listener, Envoy does not use the previously sent RouteConfiguration and wait for management server for the RouteConfiguration.

This can cause Envoy to time out while waiting for the RouteConfiguration, and finishes Listener warming without the RouteConfiguration. Once a Listener is warmed up without RouteCofiguration, Envoy responds to requests to the route with 404(NR) responses until it is restarted or the RouteConfiguration is updated and management server sends the updated RouteConfiguration to Envoy.

This happens because Envoy does not use existing_provider in https://github.com/envoyproxy/envoy/blob/v1.26.6/source/common/rds/route_config_provider_manager.cc#L82 if the hash value of rds configuration changes which prevents Envoy from using previously sent RouteConfiguration

Can we update Envoy to use existing_provider when initial_fetch_timeout value is changed? Although we do not need to change it often, we sometimes need to change the value, and we want to avoid restarting envoy proxies whenever we need to update initial_fetch_timeout value.

Repro steps:

This can be reproduced by running an envoy proxy that uses ADS to fetch configurations from a management server, and change initial_fetch_timeout value in ConfigSource in a listener. I have a simple management server to reproduce the issue, and can provide it if helps.

@jparklab jparklab added bug triage Issue requires triage labels Feb 8, 2024
@ramaraochavali
Copy link
Contributor

Curious, what is the reason for dynamically changing init_fetch_timeout ?

@jparklab
Copy link
Contributor Author

jparklab commented Feb 9, 2024

Curious, what is the reason for dynamically changing init_fetch_timeout ?

We have multiple envoy proxies as edge proxies connected to the same management server receiving configurations for more than a few thousands of services, and we see a large number of fetch timeouts when we restart the management server since all of the envoy proxies reconnects. We want to adjust init_fetch_timeout to avoid fetch timeouts (we are also considering rearchitecturing, however, that's a longer term goal for us).

The parameter will be updated when we release the change to the management server, and envoy proxies will get the updated value dynamically when it fetches updated listener configurations

@ravenblackx ravenblackx added area/xds area/configuration and removed triage Issue requires triage labels Feb 9, 2024
@ravenblackx
Copy link
Contributor

@alyssawilk I think (as codeowner on router)

@alyssawilk
Copy link
Contributor

I suspect this is more of an RDS issue so tagging @adisuissa for thoughts

@adisuissa
Copy link
Contributor

Yes, it seems that the identifier should be the unique resource name (+ what config-server used to serve it).
The fix will be creating a unique-ID given the proto instead of hashing the entire proto.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 13, 2024
@adisuissa adisuissa added no stalebot Disables stalebot from closing an issue beginner Good starter issues! and removed stale stalebot believes this issue/PR has not been touched recently labels Mar 13, 2024
@srivatsav1998
Copy link

Hi, I am interested in working on this issue. Please let me know if this is still available.

I am new to this repository and would appreciate your sharing resources around the issue and starting pointers.

@iamkishan98
Copy link

Hi @jparklab, I am new to open source, want to work on good first issue types. I would love to work on this issue, if still available.

@cpakulski
Copy link
Contributor

cpakulski commented Dec 31, 2024

Fix backported to:
1.32: #37749
1.31: #37844
1.30: #37846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/configuration area/xds beginner Good starter issues! bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants