Add Envoy as an alternative Sky Serve load balancer implementation #4256

ejj · 2024-11-04T04:49:55Z

This PR adds Envoy as an alternative to the Python load balancer for Sky Serve.
It's an early draft PR, there's a big comment containing a number of things that need to be implemented before Envoy is made default, or before the PR is merged.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
Did some manual testing, booted/stopped some services, performance testing with fortio.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

sky/serve/load_balancer.py

romilbhardwaj · 2024-11-08T02:53:16Z

sky/serve/load_balancer.py

-    load_balancer = SkyServeLoadBalancer(controller_url=controller_addr,
-                                         load_balancer_port=load_balancer_port)
-    load_balancer.run()
+class EnvoyLoadBalancer(SkyServeLoadBalancer):


My thoughts:

This is great, thanks for putting this together! Having a robust load balancer instead of our own hand-rolled python lb will help with larger loads.

Regarding the eventual plan to have a full fledged data plane running on envoy - will this lb serve as the head/controller/master node (if there's anything like that in envoy)?

As discussed out of band. IMO the plan should be something like this:

Make Envoy an option in the Sky Controller.

Create an option for users to boot "regional load balancers" in different locations in support of multi-region load balancing.

In the very long term we can consider more service-mesh-ish things like running Envoys on replicas, but we can hold off on that until we have more experience with Envoy and the value of the features something like that would enable justify the engineering effort. My guess is that could be quite a long time from now.

In terms of how this fits in. My hope would be that when we/if we have these "regional load balancers", they would utilize this same code. I.e. most of the work would be allowing this controller load balancer code to run off the controller as well.

cblmemo

Thanks @ejj for adding this exciting feature! Left some discussions ;)

sky/serve/load_balancer.py

cblmemo · 2024-11-10T00:18:15Z

sky/serve/load_balancer.py

+        # handed traffic by the above listener.
+        cluster = {
+            'name': 'cluster',
+            'connect_timeout': '0.25s',


AI workloads would easily lasts for 10+ seconds. Will this interrupt such a long session?

So this timeout governed how long envoy will wait for the TCP connection to be established. This needs to fully complete before the initial HTTP request kicking off the AI workload begins. I don't see any particular reason AI workloads would need something different here, but I'm also not sure why I set it. I may remove it and just go with the default

I see. Then it should be fine. Is there any timeout for the whole session length? In our python LB we need to manually set this to a larger value cuz the default value for the session is sth like 10s, which is not sufficient for some AI workload. It would be great to double check does envoy need this setting as well ;)

skypilot/sky/serve/load_balancer.py

Line 148 in 3bfc29e

timeout=constants.LB_STREAM_TIMEOUT)

https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts

Looks like by default the idle timeout (i.e. how long you can have nothing on the stream) is an hour, and the default timeout for active connections is unlimited. At least for http.

Timeouts for the other stuff seems reasonable as well

sky/serve/load_balancer.py

ejj · 2024-12-03T01:17:42Z

PR Update. Some major changes since the original:

Envoy now dynamically reads configuration from a file. This means the restarts are no longer necessary
Load balancer type is choosable from the serve specification. Default is still Python.
Will through an error if there's an attempt to boot envoy on kubernetes
Smoke test for envoy added.

Given these updates and the fact that it's default off, I think it's reasonable to review/merge this and continue to make enhancements with future PRs.

cblmemo

Thanks for adding this @ejj ! This is awesome. It mostly looks good to me ;) Left some discussions.

sky/clouds/cloud.py

sky/serve/constants.py

cblmemo · 2024-12-04T18:37:00Z

sky/serve/core.py

@@ -156,6 +157,18 @@ def up(
            controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
            task_resources=task.resources)

+        # Check that the Envoy load balancer isn't being used on an unsupported
+        # cloud.
+        lb_type = task_config.get('service', {}).get('load_balancer_type', None)


im wondering if we want a load_balancer field to host both the policy and type field

could we do this as a follow up PR? Since we've already got the load_balancer_policy field I think it would make more sense.

That said if you prefer I can tack it on to this one

I think the more PR we separate it to, the more backward compatibility we need to tackle in the future. But cc @Michaelvll for some inputs here

cblmemo · 2024-12-04T18:37:30Z

sky/serve/core.py

+            for resource in controller_resources:
+                if resource.cloud is None:
+                    continue


pls also make sure k8s will not be automatically selected for controller

see new commit. not 100% sure it's the right approach

It will be great to add some unittest for this ;)

cblmemo · 2024-12-04T18:38:37Z

sky/serve/load_balancer.py

 import threading
-from typing import Dict, Optional, Union
+from typing import Dict, List, Optional, Union
+from urllib.parse import urlparse


nit: import modules instead of class/function

https://google.github.io/styleguide/pyguide.html#224-decision

fixed in next version

sky/serve/load_balancer.py

cblmemo · 2024-12-04T19:04:05Z

sky/serve/service_spec.py

@@ -64,6 +65,13 @@ def __init__(
                raise ValueError(
                    f'Unknown load balancing policy: {load_balancing_policy}. '
                    f'Available policies: {list(serve.LB_POLICIES.keys())}')
+
+        if (load_balancer_type is not None and


as mentioned before, lets check, if using elb:

lb policy is not specified;

autoscaling is not enabled (e.g. min_replicas == max_replica).

Actually, considering the only policy is rrb and we are about to add least load to our plb, can we support those two policies at least?

Mind if I do this in a follow up PR?

do this in a follow up PR lgtm, but pls make sure there are proper error for not supported features :))

cblmemo · 2024-12-04T19:05:30Z

tests/skyserve/load_balancer/service.yaml

@@ -5,6 +5,7 @@ service:
    initial_delay_seconds: 180
  replica_policy:
    min_replicas: 3
+  load_balancer_type: python


do we really need this? are we defaulting to plb now?

felt like reasonable future proofing, but not a big deal. Removed it

Co-authored-by: Tian Xia <[email protected]>

…ncer

Co-authored-by: Tian Xia <[email protected]>

cblmemo · 2024-12-21T11:03:39Z

@ejj Thanks for fixing this! A quick bump on those hidden conversations - did you missed it?

Also, pls resolve merge conflict when you got time ;) Thanks!

ejj marked this pull request as draft November 4, 2024 04:50

romilbhardwaj reviewed Nov 8, 2024

View reviewed changes

cblmemo reviewed Nov 10, 2024

View reviewed changes

ejj force-pushed the ejj-envoy branch 2 times, most recently from 2d1c804 to 5e75fa1 Compare December 3, 2024 01:15

ejj force-pushed the ejj-envoy branch 4 times, most recently from 93f9114 to 5a9aa2f Compare December 3, 2024 04:00

[Serve] Add Envoy as an alternative Sky Serve load balancer

e1037ab

ejj force-pushed the ejj-envoy branch from 5a9aa2f to e1037ab Compare December 3, 2024 04:05

ejj marked this pull request as ready for review December 3, 2024 04:10

ejj requested review from cblmemo and romilbhardwaj December 3, 2024 04:16

cblmemo reviewed Dec 4, 2024

View reviewed changes

ucbstudent and others added 7 commits December 10, 2024 11:10

Use an enum for load balancer types

74bcb7d

Change cloud implementation feature name

01dc978

Fix typo

3793bcc

Co-authored-by: Tian Xia <[email protected]>

Minor style tweak

2b929fd

Co-authored-by: Tian Xia <[email protected]>

Remove load balancer specification from smoke test

0083f8e

Don't choose Kubernetes for the controller when using Envoy Load Bala…

6fa4091

…ncer

Check for envoy start more quickly than 5 seconds

4ebb87b

Co-authored-by: Tian Xia <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Envoy as an alternative Sky Serve load balancer implementation #4256

Add Envoy as an alternative Sky Serve load balancer implementation #4256

ejj commented Nov 4, 2024 •

edited

Loading

romilbhardwaj Nov 8, 2024

ejj Nov 9, 2024

cblmemo left a comment

cblmemo Nov 10, 2024

ejj Nov 10, 2024

cblmemo Nov 11, 2024

ejj Dec 3, 2024

ejj commented Dec 3, 2024

cblmemo left a comment

cblmemo Dec 4, 2024

ejj Dec 17, 2024

cblmemo Dec 21, 2024

cblmemo Dec 4, 2024

ejj Dec 17, 2024

cblmemo Dec 21, 2024

cblmemo Dec 4, 2024

ejj Dec 17, 2024

cblmemo Dec 4, 2024

cblmemo Dec 4, 2024

ejj Dec 17, 2024

cblmemo Dec 21, 2024

cblmemo Dec 4, 2024

ejj Dec 17, 2024

cblmemo commented Dec 21, 2024

Add Envoy as an alternative Sky Serve load balancer implementation #4256

Are you sure you want to change the base?

Add Envoy as an alternative Sky Serve load balancer implementation #4256

Conversation

ejj commented Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejj commented Dec 3, 2024

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo commented Dec 21, 2024

ejj commented Nov 4, 2024 •

edited

Loading