-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Envoy as an alternative Sky Serve load balancer implementation #4256
base: master
Are you sure you want to change the base?
Conversation
load_balancer = SkyServeLoadBalancer(controller_url=controller_addr, | ||
load_balancer_port=load_balancer_port) | ||
load_balancer.run() | ||
class EnvoyLoadBalancer(SkyServeLoadBalancer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts:
- This is great, thanks for putting this together! Having a robust load balancer instead of our own hand-rolled python lb will help with larger loads.
- Regarding the eventual plan to have a full fledged data plane running on envoy - will this lb serve as the head/controller/master node (if there's anything like that in envoy)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed out of band. IMO the plan should be something like this:
- Make Envoy an option in the Sky Controller.
- Create an option for users to boot "regional load balancers" in different locations in support of multi-region load balancing.
In the very long term we can consider more service-mesh-ish things like running Envoys on replicas, but we can hold off on that until we have more experience with Envoy and the value of the features something like that would enable justify the engineering effort. My guess is that could be quite a long time from now.
In terms of how this fits in. My hope would be that when we/if we have these "regional load balancers", they would utilize this same code. I.e. most of the work would be allowing this controller load balancer code to run off the controller as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ejj for adding this exciting feature! Left some discussions ;)
sky/serve/load_balancer.py
Outdated
# handed traffic by the above listener. | ||
cluster = { | ||
'name': 'cluster', | ||
'connect_timeout': '0.25s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI workloads would easily lasts for 10+ seconds. Will this interrupt such a long session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this timeout governed how long envoy will wait for the TCP connection to be established. This needs to fully complete before the initial HTTP request kicking off the AI workload begins. I don't see any particular reason AI workloads would need something different here, but I'm also not sure why I set it. I may remove it and just go with the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then it should be fine. Is there any timeout for the whole session length? In our python LB we need to manually set this to a larger value cuz the default value for the session is sth like 10s, which is not sufficient for some AI workload. It would be great to double check does envoy need this setting as well ;)
skypilot/sky/serve/load_balancer.py
Line 148 in 3bfc29e
timeout=constants.LB_STREAM_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts
Looks like by default the idle timeout (i.e. how long you can have nothing on the stream) is an hour, and the default timeout for active connections is unlimited. At least for http.
Timeouts for the other stuff seems reasonable as well
2d1c804
to
5e75fa1
Compare
PR Update. Some major changes since the original:
Given these updates and the fact that it's default off, I think it's reasonable to review/merge this and continue to make enhancements with future PRs. |
93f9114
to
5a9aa2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @ejj ! This is awesome. It mostly looks good to me ;) Left some discussions.
@@ -156,6 +157,18 @@ def up( | |||
controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, | |||
task_resources=task.resources) | |||
|
|||
# Check that the Envoy load balancer isn't being used on an unsupported | |||
# cloud. | |||
lb_type = task_config.get('service', {}).get('load_balancer_type', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im wondering if we want a load_balancer
field to host both the policy and type field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we do this as a follow up PR? Since we've already got the load_balancer_policy field I think it would make more sense.
That said if you prefer I can tack it on to this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more PR we separate it to, the more backward compatibility we need to tackle in the future. But cc @Michaelvll for some inputs here
sky/serve/core.py
Outdated
for resource in controller_resources: | ||
if resource.cloud is None: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls also make sure k8s will not be automatically selected for controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see new commit. not 100% sure it's the right approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great to add some unittest for this ;)
import threading | ||
from typing import Dict, Optional, Union | ||
from typing import Dict, List, Optional, Union | ||
from urllib.parse import urlparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: import modules instead of class/function
https://google.github.io/styleguide/pyguide.html#224-decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in next version
@@ -64,6 +65,13 @@ def __init__( | |||
raise ValueError( | |||
f'Unknown load balancing policy: {load_balancing_policy}. ' | |||
f'Available policies: {list(serve.LB_POLICIES.keys())}') | |||
|
|||
if (load_balancer_type is not None and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned before, lets check, if using elb:
- lb policy is not specified;
- autoscaling is not enabled (e.g. min_replicas == max_replica).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, considering the only policy is rrb and we are about to add least load to our plb, can we support those two policies at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if I do this in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this in a follow up PR lgtm, but pls make sure there are proper error for not supported features :))
@@ -5,6 +5,7 @@ service: | |||
initial_delay_seconds: 180 | |||
replica_policy: | |||
min_replicas: 3 | |||
load_balancer_type: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this? are we defaulting to plb now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
felt like reasonable future proofing, but not a big deal. Removed it
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
@ejj Thanks for fixing this! A quick bump on those hidden conversations - did you missed it? Also, pls resolve merge conflict when you got time ;) Thanks! |
This PR adds Envoy as an alternative to the Python load balancer for Sky Serve.
It's an early draft PR, there's a big comment containing a number of things that need to be implemented before Envoy is made default, or before the PR is merged.
Tested (run the relevant ones):
bash format.sh
Did some manual testing, booted/stopped some services, performance testing with fortio.
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh