Skip to content

Conversation

@anuragagarwal561994
Copy link

@anuragagarwal561994 anuragagarwal561994 commented Nov 2, 2025

What type of PR is this?

What this PR does / why we need it:
This PR provides addition of new load balancer type client side weighted round robin. This is a new load balancing extension introduced since envoy 1.32

https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/load_balancing_policies/client_side_weighted_round_robin/v3/client_side_weighted_round_robin.proto

Which issue(s) this PR fixes:

Fixes #7305

Release Notes: Yes/No

@anuragagarwal561994
Copy link
Author

@jukie I have added the implementation and also tested it on my local setup

PS: the repo is so easy to contribute everything just works with the docs given on the site :)

@anuragagarwal561994
Copy link
Author

Also I wanted to know should I include slow start in client wrr?

So the thing is that I have submitted the proposal in grpc-xds grpc/proposal#498 and also in envoy I have got the proto updated.

It is not implemented yet, but I am trying to pick it up this month if my time allows

@anuragagarwal561994
Copy link
Author

anuragagarwal561994 commented Nov 2, 2025

Also I am unsure of how to test this e2e, so I have just included an AI generated e2e test suite.

The challenge here is that we need multiple replicas with each server respond with a specific header containing rps and cpu_utilisation and then the traffic is distributed by calculating the weight (rps / cpu)

I don't know what the current e2e tests allow and if this type of test case is feasible to write

Comment on lines 174 to 168
// The multiplier used to adjust endpoint weights with the error rate calculated as eps/qps.
// Must be non-negative. Default is 1.0.
// +kubebuilder:validation:Minimum=0
// +optional
ErrorUtilizationPenalty *float32 `json:"errorUtilizationPenalty,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try and avoid floats in the API layer. Could you adjust this to something like uint32?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it, but won't that make this configuration more aggresive? This config might usually be used for smaller numbers like may be 1, 2, 5, 10, since this is a multipler of how aggressive are we observing errors.

I would say mostly it can just be in range of 0 to 2 from practical point of view. Having a uint32 value just restricts the user to use this configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that I meant adjusted to be a percentage based int such that 100=1.0 or something else reasonable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other example that I can refer, because I guess percentage wise also this is a value which goes above 100

So from my understanding I see the configuration value as if my rps is 1000 and my application is very sensistive to errors then I would usually reduce the traffic by a factor of 1.2 or 1.5, so not sure if making it 150%, 120% would be more understandble.

If the nature of my application is as such that the errors are bound to happen, may be because the traffic is coming from an external source then I would to keep it lower like 0.3 - 0.5 such that only when it crosses this boundary I would say that there is some issue with the system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preconnect Policy is one example where float is used for the Envoy proto but Envoy Gateway uses an int -

PerEndpointPercent *uint32 `json:"perEndpointPercent,omitempty"`

I wasn't proposing 100 as the maximum allowed value but was using that as an example of what the default would translate to.

@codecov
Copy link

codecov bot commented Nov 4, 2025

Codecov Report

❌ Patch coverage is 70.68966% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.34%. Comparing base (1e295b6) to head (f926f90).

Files with missing lines Patch % Lines
internal/xds/translator/cluster.go 55.26% 16 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7407      +/-   ##
==========================================
- Coverage   72.36%   72.34%   -0.02%     
==========================================
  Files         231      231              
  Lines       34042    34100      +58     
==========================================
+ Hits        24633    24669      +36     
- Misses       7633     7655      +22     
  Partials     1776     1776              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anuragagarwal561994
Copy link
Author

anuragagarwal561994 commented Nov 4, 2025

Also I wanted to know should I include slow start in client wrr?

So the thing is that I have submitted the proposal in grpc-xds grpc/proposal#498 and also in envoy I have got the proto updated.

It is not implemented yet, but I am trying to pick it up this month if my time allows

I have started the implementation of slow_start_config and locality lb config with WRR as well, if possible I would also want to include them in the gatway implementation.

envoyproxy/envoy#41841

@jukie
Copy link
Contributor

jukie commented Nov 13, 2025

We wait to add features here until they've made it into a full envoy release. The flow would be getting this lb support added for 1.7 and if your envoy changes get merged we can add that support to gateway in 1.8.

Let's keep the scope of this PR to what's currently available and we can always include additional features in a follow-up. Are you able to make the suggested changes or can you join the contributors call next week to discuss?

@anuragagarwal561994
Copy link
Author

Sure, I just paused because the other changes were also approved, but I understand I will make the respective changes as suggested. Will try to complete them by today / tomorrow @jukie

@anuragagarwal561994
Copy link
Author

@jukie I have made the respective changes

…Gateway CRDs, ensuring configurable parameters and validation rules are integrated. Includes e2e test for validation.

Signed-off-by: anurag.ag <[email protected]>
…eway CRDs and related configurations. Update associated test data and documentation.

Signed-off-by: anurag.ag <[email protected]>
…entSideWeightedRoundRobin configuration, update affected tests and CRDs.

Signed-off-by: anurag.ag <[email protected]>
…cross Gateway CRDs, configuration files, and related tests. Adjust documentation to reflect percentage-based representation.

Signed-off-by: anurag.ag <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Client Load Balancing Weighted Round Robin

3 participants