Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Envoy as an alternative Sky Serve load balancer implementation #4256

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ejj
Copy link

@ejj ejj commented Nov 4, 2024

This PR adds Envoy as an alternative to the Python load balancer for Sky Serve.
It's an early draft PR, there's a big comment containing a number of things that need to be implemented before Envoy is made default, or before the PR is merged.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    Did some manual testing, booted/stopped some services, performance testing with fortio.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@ejj ejj marked this pull request as draft November 4, 2024 04:50
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
load_balancer = SkyServeLoadBalancer(controller_url=controller_addr,
load_balancer_port=load_balancer_port)
load_balancer.run()
class EnvoyLoadBalancer(SkyServeLoadBalancer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts:

  • This is great, thanks for putting this together! Having a robust load balancer instead of our own hand-rolled python lb will help with larger loads.
  • Regarding the eventual plan to have a full fledged data plane running on envoy - will this lb serve as the head/controller/master node (if there's anything like that in envoy)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed out of band. IMO the plan should be something like this:

  1. Make Envoy an option in the Sky Controller.
  2. Create an option for users to boot "regional load balancers" in different locations in support of multi-region load balancing.

In the very long term we can consider more service-mesh-ish things like running Envoys on replicas, but we can hold off on that until we have more experience with Envoy and the value of the features something like that would enable justify the engineering effort. My guess is that could be quite a long time from now.

In terms of how this fits in. My hope would be that when we/if we have these "regional load balancers", they would utilize this same code. I.e. most of the work would be allowing this controller load balancer code to run off the controller as well.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ejj for adding this exciting feature! Left some discussions ;)

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
# handed traffic by the above listener.
cluster = {
'name': 'cluster',
'connect_timeout': '0.25s',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI workloads would easily lasts for 10+ seconds. Will this interrupt such a long session?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this timeout governed how long envoy will wait for the TCP connection to be established. This needs to fully complete before the initial HTTP request kicking off the AI workload begins. I don't see any particular reason AI workloads would need something different here, but I'm also not sure why I set it. I may remove it and just go with the default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then it should be fine. Is there any timeout for the whole session length? In our python LB we need to manually set this to a larger value cuz the default value for the session is sth like 10s, which is not sufficient for some AI workload. It would be great to double check does envoy need this setting as well ;)

timeout=constants.LB_STREAM_TIMEOUT)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts

Looks like by default the idle timeout (i.e. how long you can have nothing on the stream) is an hour, and the default timeout for active connections is unlimited. At least for http.

Timeouts for the other stuff seems reasonable as well

sky/serve/load_balancer.py Show resolved Hide resolved
@ejj ejj force-pushed the ejj-envoy branch 2 times, most recently from 2d1c804 to 5e75fa1 Compare December 3, 2024 01:15
@ejj
Copy link
Author

ejj commented Dec 3, 2024

PR Update. Some major changes since the original:

  • Envoy now dynamically reads configuration from a file. This means the restarts are no longer necessary
  • Load balancer type is choosable from the serve specification. Default is still Python.
  • Will through an error if there's an attempt to boot envoy on kubernetes
  • Smoke test for envoy added.

Given these updates and the fact that it's default off, I think it's reasonable to review/merge this and continue to make enhancements with future PRs.

@ejj ejj force-pushed the ejj-envoy branch 4 times, most recently from 93f9114 to 5a9aa2f Compare December 3, 2024 04:00
@ejj ejj marked this pull request as ready for review December 3, 2024 04:10
@ejj ejj requested review from cblmemo and romilbhardwaj December 3, 2024 04:16
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @ejj ! This is awesome. It mostly looks good to me ;) Left some discussions.

sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/serve/constants.py Outdated Show resolved Hide resolved
@@ -156,6 +157,18 @@ def up(
controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
task_resources=task.resources)

# Check that the Envoy load balancer isn't being used on an unsupported
# cloud.
lb_type = task_config.get('service', {}).get('load_balancer_type', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im wondering if we want a load_balancer field to host both the policy and type field

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do this as a follow up PR? Since we've already got the load_balancer_policy field I think it would make more sense.

That said if you prefer I can tack it on to this one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more PR we separate it to, the more backward compatibility we need to tackle in the future. But cc @Michaelvll for some inputs here

Comment on lines 164 to 166
for resource in controller_resources:
if resource.cloud is None:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also make sure k8s will not be automatically selected for controller

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see new commit. not 100% sure it's the right approach

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to add some unittest for this ;)

import threading
from typing import Dict, Optional, Union
from typing import Dict, List, Optional, Union
from urllib.parse import urlparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import modules instead of class/function

https://google.github.io/styleguide/pyguide.html#224-decision

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in next version

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
@@ -64,6 +65,13 @@ def __init__(
raise ValueError(
f'Unknown load balancing policy: {load_balancing_policy}. '
f'Available policies: {list(serve.LB_POLICIES.keys())}')

if (load_balancer_type is not None and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned before, lets check, if using elb:

  • lb policy is not specified;
  • autoscaling is not enabled (e.g. min_replicas == max_replica).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, considering the only policy is rrb and we are about to add least load to our plb, can we support those two policies at least?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I do this in a follow up PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in a follow up PR lgtm, but pls make sure there are proper error for not supported features :))

@@ -5,6 +5,7 @@ service:
initial_delay_seconds: 180
replica_policy:
min_replicas: 3
load_balancer_type: python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? are we defaulting to plb now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

felt like reasonable future proofing, but not a big deal. Removed it

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 21, 2024

@ejj Thanks for fixing this! A quick bump on those hidden conversations - did you missed it?

image

Also, pls resolve merge conflict when you got time ;) Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants