Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] HTTPS Support #3380

Merged
merged 27 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
56f2c04
init
cblmemo Jun 5, 2024
229231e
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Sep 5, 2024
292fc73
disable update ssl
cblmemo Sep 5, 2024
5dd1c0b
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Sep 16, 2024
d8336d4
address comment
cblmemo Sep 16, 2024
dcf6d36
fix
cblmemo Sep 16, 2024
8c01625
fix linter
cblmemo Sep 16, 2024
e2b6af8
remove key & cert file; add smoke test prototype
cblmemo Sep 17, 2024
cca6c02
add byo ssl key
cblmemo Sep 17, 2024
f8686ed
format
cblmemo Sep 17, 2024
f0613bd
comments
cblmemo Sep 17, 2024
80ccacb
smoke test passed
cblmemo Sep 17, 2024
4243a73
check keyfile and certfile with schema; chedck if the file exists
cblmemo Sep 17, 2024
112e06c
nit
cblmemo Sep 17, 2024
385a658
add column TLS_ENCRYPTED in service table; add schema in sky serve st…
cblmemo Sep 17, 2024
95e3ea9
fix smoke test
cblmemo Sep 17, 2024
ad6768d
fix smoke test
cblmemo Sep 18, 2024
bfd1741
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Sep 19, 2024
f239e68
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Oct 24, 2024
4262353
apply suggestions from code review
cblmemo Oct 24, 2024
846e072
use env vars in the test & example
cblmemo Oct 24, 2024
ba2d038
rename test file
cblmemo Oct 24, 2024
30ef361
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Oct 28, 2024
37e8242
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Jan 14, 2025
c57dd77
Merge remote-tracking branch 'origin/master' into serve-https-example…
cblmemo Jan 17, 2025
1c7c76b
print output for https test
cblmemo Jan 19, 2025
735ce29
upd messages
cblmemo Jan 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions examples/serve/https/cert.pem
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
-----BEGIN CERTIFICATE-----
MIIDbTCCAlWgAwIBAgIUOPkOX89x3GWlrI9mOsi/6kIZ4kQwDQYJKoZIhvcNAQEL
BQAwRTELMAkGA1UEBhMCVVMxEzARBgNVBAgMClNvbWUtU3RhdGUxITAfBgNVBAoM
GEludGVybmV0IFdpZGdpdHMgUHR5IEx0ZDAgFw0yNDA2MDUwMDMxMjNaGA8yMTI0
MDUxMjAwMzEyM1owRTELMAkGA1UEBhMCVVMxEzARBgNVBAgMClNvbWUtU3RhdGUx
ITAfBgNVBAoMGEludGVybmV0IFdpZGdpdHMgUHR5IEx0ZDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAN3hLSyXnU89evgCXxpGJ59/mgbuKZZZvmo2LDRt
OGyd/BD3JvjaY0KRzblT8PajLkXUkp9dbQIlhspjVIzMvp8yUxNVMOlPqnmIFLkY
jP4EhaNt9GCBaPnYH7QPCMVU5agxyrZfOj5GKpMbcBZUbnzxNqO+8cAv/hWuIMhw
w8eh6C4LPJRWwOhDGN+Q19bOgGFgOzMQXVVxP+9OChldNL6SlsqVtUg80fJS1lkJ
gtzg/c0MyU1C/QAco93X47bNqz9S98xByl4wE2Si5C2nhkj2W+O/QvCXGCFHPWzY
+KkV6Bkh/Z26d+5IdLz0fKj8DPXGssdUmoTPBYwf5LhBDC0CAwEAAaNTMFEwHQYD
VR0OBBYEFPR9eczYOyaPmGLVS+TDI3UKcj9yMB8GA1UdIwQYMBaAFPR9eczYOyaP
mGLVS+TDI3UKcj9yMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
ANh2jVJLWvqdDPi9YowVGfUFKpGbeoeyiShIQQmQ69bJKO8sZPEor6kJZwlbU0OQ
95jHGnoK7caoV9OFYGvkXxTX9im2QfsX9sWl2lcMzGXZEM9qFnJRGDZl6nEcLTQP
2ijnq0pOjNkfdajcmP2/E2x1y5eHP8QeQzTpDaHMuCpNxAw+849J2uM2LNh4HZQa
sxiJs7t4qun0ZVqDxdlzbXlj53Xkab3xwoa1s1svRIHiNeVO3eHq2gYkrzvxoK6o
J9Lw6CdrOzo0bKxC6mB5uNHT9RRzePOpwn5NRgePpLGjq/VdZhtRdROoSjNN7wx9
+bo9qmPp2EcRP6t6BidFWuY=
-----END CERTIFICATE-----
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
28 changes: 28 additions & 0 deletions examples/serve/https/key.pem
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
-----BEGIN PRIVATE KEY-----
MIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQDd4S0sl51PPXr4
Al8aRieff5oG7imWWb5qNiw0bThsnfwQ9yb42mNCkc25U/D2oy5F1JKfXW0CJYbK
Y1SMzL6fMlMTVTDpT6p5iBS5GIz+BIWjbfRggWj52B+0DwjFVOWoMcq2Xzo+RiqT
G3AWVG588TajvvHAL/4VriDIcMPHoeguCzyUVsDoQxjfkNfWzoBhYDszEF1VcT/v
TgoZXTS+kpbKlbVIPNHyUtZZCYLc4P3NDMlNQv0AHKPd1+O2zas/UvfMQcpeMBNk
ouQtp4ZI9lvjv0LwlxghRz1s2PipFegZIf2dunfuSHS89Hyo/Az1xrLHVJqEzwWM
H+S4QQwtAgMBAAECgf8/ETkpdDYKOwnZ0Gt9z50T5cDdduT9s4FxN9e2YPRvUGVc
pz4ttOwgB/7O668Zcl9IpiZHpAwJFhSxDmdlDgV9bUqcfMc7jmFpNeKLpMeTWR4K
lSBpeQLHtCQ/wF5yBjFHqyRdinaM2w3L38Pqt19R1c6heN573uONIo0JUGlTtQdF
dPhUKBWqRzJOLOwpDDSh5sjhTscxmfxGtKOyC+RPGq4592ElOMQy6yxb3AWXaLTw
p6EHDlnLK7hBgwlJ6SrL0mJhf5NZ3nkAvCoMug9q2+H08enzMhdiBGGoVPDskMS3
REG/EjC765wE/XFeWE5zSJVKlQzfynnjsNST1AECgYEA8yVjmKwji7sjRaEUrlhs
/q88g+9VNgXxDp0rL+IRu/OcEZibA9dY6KdLy46u1bMfyWoThrQfh9v97UAxT0zV
LCB6mRu3URieKOeim7VE3RwuIlrA2Oo9/1cnbpOSkCqNAL9JODdEi4kq47w/X/Gq
+vT4iDusUbGmzGQWovyUaC0CgYEA6Zv6dskat0VCGm+AUdlshm95MCSfZ9uZKhYy
KukpX40jCSDGEvDgu6rIrwJZ3o1W8B9ragP5TnQcOix+W44N+2UvwtN0ZwOKvv/y
i/AvIPZnOEx4Q+rHX0JgrBDhcNramxtBVorJXjLtAvQeP20n886dXyKARdRjaWm7
JQtBtAECgYEA1ERp5JkzFwOy0VDE+0fbL6TQUeYZH+akAbwYPd9A2DLep0Xw3rOb
vNCAjR7tZ0bPk0j17v1FIZhe4EbQRYiv+awJG61kpnm1acR+4yynB9lYNUnBXh1x
Ln9pv1E5/H4JRwO36knln2OKe/KV6S7Ts+81IcnmsBNFqN4gHP4gmakCgYAXUbio
zt5Z2RIGLYczpG6O5OXGBoHbDjK13s1XNzsoDKCosprTHTRxx1SNE8EA1D8PbTN9
u7PRPwnqTLK4VO/UXn9sBUujVNkuw8Hkci9iLPaUyh1fOjp4qdmYk37NkysrPIdD
Kkt7mb3LA69ZSX4pekB/VKRVVCQwS5ug0QTYAQKBgQCzR6NX7FxRJJRNNr+GS8sw
P0fWl/UyL159jmZiscl+N+AFFUJtT+izWGc+r+WSh81WucQ1YBFCLDFoPQMqgT06
IekHRzdyMfaODy/FVbJ02ePU6I9MQFZ79hno4h0dYE3qDqBs/CgvhK9szCO/CB06
QMf1lQZu4wuHTPG5i9j+Rw==
-----END PRIVATE KEY-----
22 changes: 22 additions & 0 deletions examples/serve/https/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SkyServe YAML to run an HTTPS server.
#
# Usage:
# sky serve up -n https examples/serve/https/service.yaml
# The endpoint will be printed in the console. You
# could also check the endpoint by running:
# sky serve status --endpoint https

service:
readiness_probe: /
replicas: 1
ssl:
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
# Generated by running:
# openssl req -x509 -newkey rsa:2048 -days 36500 -nodes
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
keyfile: examples/serve/https/key.pem
certfile: examples/serve/https/cert.pem

resources:
ports: 8080
cpus: 2+

run: python3 -m http.server 8080
22 changes: 21 additions & 1 deletion sky/serve/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,10 +146,21 @@ def up(
controller_resources = controller_utils.get_controller_resources(
controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
task_resources=task.resources)

remote_ssl_keyfile = (
serve_utils.generate_remote_ssl_keyfile_name(service_name))
remote_ssl_certfile = (
serve_utils.generate_remote_ssl_certfile_name(service_name))

service_spec = task.service
# Already validated in _validate_service_task
assert service_spec is not None, 'Service section not found.'
vars_to_fill = {
'remote_task_yaml_path': remote_tmp_task_yaml_path,
'local_task_yaml_path': service_file.name,
'remote_ssl_keyfile': remote_ssl_keyfile,
'remote_ssl_certfile': remote_ssl_certfile,
'local_ssl_keyfile': service_spec.ssl_keyfile,
'local_ssl_certfile': service_spec.ssl_certfile,
'service_name': service_name,
'controller_log_file': controller_log_file,
'remote_user_config_path': remote_config_yaml_path,
Expand Down Expand Up @@ -313,6 +324,15 @@ def update(
service_name: Name of the service.
"""
_validate_service_task(task)

assert task.service is not None
if (task.service.ssl_keyfile is not None or
task.service.ssl_certfile is not None):
logger.warning('Updating SSL keyfile and certfile is not supported. '
'Any updates to the keyfile and certfile will not take '
'effect. To update SSL keyfile and certfile, please '
'tear down the service and spin up a new one.')

handle = backend_utils.is_controller_accessible(
controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
stopped_message=
Expand Down
31 changes: 25 additions & 6 deletions sky/serve/load_balancer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import asyncio
import logging
import threading
from typing import Dict, Union
from typing import Dict, Optional, Union

import aiohttp
import fastapi
Expand All @@ -27,7 +27,9 @@ class SkyServeLoadBalancer:
policy.
"""

def __init__(self, controller_url: str, load_balancer_port: int) -> None:
def __init__(self, controller_url: str, load_balancer_port: int,
ssl_keyfile: Optional[str],
ssl_certfile: Optional[str]) -> None:
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
"""Initialize the load balancer.

Args:
Expand All @@ -41,6 +43,8 @@ def __init__(self, controller_url: str, load_balancer_port: int) -> None:
lb_policies.RoundRobinPolicy())
self._request_aggregator: serve_utils.RequestsAggregator = (
serve_utils.RequestTimestamp())
self._ssl_keyfile: Optional[str] = ssl_keyfile
self._ssl_certfile: Optional[str] = ssl_certfile
# TODO(tian): httpx.Client has a resource limit of 100 max connections
# for each client. We should wait for feedback on the best max
# connections.
Expand Down Expand Up @@ -217,15 +221,30 @@ async def startup():
# Register controller synchronization task
asyncio.create_task(self._sync_with_controller())

ssl_kwargs = {} if self._ssl_keyfile is None else {
'ssl_keyfile': self._ssl_keyfile,
'ssl_certfile': self._ssl_certfile,
}

schema = 'https' if self._ssl_keyfile is not None else 'http'

logger.info('SkyServe Load Balancer started on '
f'http://0.0.0.0:{self._load_balancer_port}')
f'{schema}://0.0.0.0:{self._load_balancer_port}')

uvicorn.run(self._app, host='0.0.0.0', port=self._load_balancer_port)
uvicorn.run(self._app,
host='0.0.0.0',
port=self._load_balancer_port,
**ssl_kwargs)


def run_load_balancer(controller_addr: str, load_balancer_port: int):
def run_load_balancer(controller_addr: str,
load_balancer_port: int,
ssl_keyfile: Optional[str] = None,
ssl_certfile: Optional[str] = None):
load_balancer = SkyServeLoadBalancer(controller_url=controller_addr,
load_balancer_port=load_balancer_port)
load_balancer_port=load_balancer_port,
ssl_keyfile=ssl_keyfile,
ssl_certfile=ssl_certfile)
load_balancer.run()


Expand Down
12 changes: 12 additions & 0 deletions sky/serve/serve_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,18 @@ def generate_replica_log_file_name(service_name: str, replica_id: int) -> str:
return os.path.join(dir_name, f'replica_{replica_id}.log')


def generate_remote_ssl_keyfile_name(service_name: str) -> str:
dir_name = generate_remote_service_dir_name(service_name)
# Don't expand here since it is used for remote machine.
return os.path.join(dir_name, 'ssl_keyfile')


def generate_remote_ssl_certfile_name(service_name: str) -> str:
dir_name = generate_remote_service_dir_name(service_name)
# Don't expand here since it is used for remote machine.
return os.path.join(dir_name, 'ssl_certfile')


def generate_replica_cluster_name(service_name: str, replica_id: int) -> str:
return f'{service_name}-{replica_id}'

Expand Down
17 changes: 13 additions & 4 deletions sky/serve/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import shutil
import time
import traceback
from typing import Dict, List
from typing import Dict, List, Optional

import filelock

Expand Down Expand Up @@ -128,7 +128,8 @@ def _cleanup(service_name: str) -> bool:
return failed


def _start(service_name: str, tmp_task_yaml: str, job_id: int):
def _start(service_name: str, tmp_task_yaml: str, job_id: int,
ssl_keyfile: Optional[str], ssl_certfile: Optional[str]):
"""Starts the service."""
# Generate ssh key pair to avoid race condition when multiple sky.launch
# are executed at the same time.
Expand Down Expand Up @@ -224,7 +225,8 @@ def _get_host():
target=ux_utils.RedirectOutputForProcess(
load_balancer.run_load_balancer,
load_balancer_log_file).run,
args=(controller_addr, load_balancer_port))
args=(controller_addr, load_balancer_port, ssl_keyfile,
ssl_certfile))
load_balancer_process.start()
serve_state.set_service_load_balancer_port(service_name,
load_balancer_port)
Expand Down Expand Up @@ -273,8 +275,15 @@ def _get_host():
required=True,
type=int,
help='Job id for the service job.')
parser.add_argument('--ssl-keyfile',
type=str,
help='Path to the SSL key file')
parser.add_argument('--ssl-certfile',
type=str,
help='Path to the SSL certificate file')
args = parser.parse_args()
# We start process with 'spawn', because 'fork' could result in weird
# behaviors; 'spawn' is also cross-platform.
multiprocessing.set_start_method('spawn', force=True)
_start(args.service_name, args.task_yaml, args.job_id)
_start(args.service_name, args.task_yaml, args.job_id, args.ssl_keyfile,
args.ssl_certfile)
32 changes: 32 additions & 0 deletions sky/serve/service_spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ def __init__(
max_replicas: Optional[int] = None,
target_qps_per_replica: Optional[float] = None,
post_data: Optional[Dict[str, Any]] = None,
ssl_keyfile: Optional[str] = None,
ssl_certfile: Optional[str] = None,
readiness_headers: Optional[Dict[str, str]] = None,
dynamic_ondemand_fallback: Optional[bool] = None,
base_ondemand_fallback_replicas: Optional[int] = None,
Expand Down Expand Up @@ -77,13 +79,22 @@ def __init__(
'Currently, SkyServe will cleanup failed replicas'
'and auto restart it to keep the service running.')

if ssl_keyfile is not None and ssl_certfile is None:
with ux_utils.print_exception_no_traceback():
raise ValueError('SSL certfile is required if keyfile is set.')
if ssl_certfile is not None and ssl_keyfile is None:
with ux_utils.print_exception_no_traceback():
raise ValueError('SSL keyfile is required if certfile is set.')

self._readiness_path: str = readiness_path
self._initial_delay_seconds: int = initial_delay_seconds
self._readiness_timeout_seconds: int = readiness_timeout_seconds
self._min_replicas: int = min_replicas
self._max_replicas: Optional[int] = max_replicas
self._target_qps_per_replica: Optional[float] = target_qps_per_replica
self._post_data: Optional[Dict[str, Any]] = post_data
self._ssl_keyfile: Optional[str] = ssl_keyfile
self._ssl_certfile: Optional[str] = ssl_certfile
self._readiness_headers: Optional[Dict[str, str]] = readiness_headers
self._dynamic_ondemand_fallback: Optional[
bool] = dynamic_ondemand_fallback
Expand Down Expand Up @@ -178,6 +189,11 @@ def from_yaml_config(config: Dict[str, Any]) -> 'SkyServiceSpec':
service_config['dynamic_ondemand_fallback'] = policy_section.get(
'dynamic_ondemand_fallback', None)

ssl_section = config.get('ssl', None)
if ssl_section is not None:
service_config['ssl_keyfile'] = ssl_section.get('keyfile', None)
service_config['ssl_certfile'] = ssl_section.get('certfile', None)

return SkyServiceSpec(**service_config)

@staticmethod
Expand Down Expand Up @@ -233,6 +249,8 @@ def add_if_not_none(section, key, value, no_empty: bool = False):
self.upscale_delay_seconds)
add_if_not_none('replica_policy', 'downscale_delay_seconds',
self.downscale_delay_seconds)
add_if_not_none('ssl', 'keyfile', self.ssl_keyfile)
add_if_not_none('ssl', 'certfile', self.ssl_certfile)
return config

def probe_str(self):
Expand Down Expand Up @@ -277,12 +295,18 @@ def autoscaling_policy_str(self):
f'replica{max_plural} (target QPS per replica: '
f'{self.target_qps_per_replica})')

def ssl_str(self):
if self.ssl_keyfile is None and self.ssl_certfile is None:
return 'No SSL Enabled'
return f'Keyfile: {self.ssl_keyfile}, Certfile: {self.ssl_certfile}'

def __repr__(self) -> str:
return textwrap.dedent(f"""\
Readiness probe method: {self.probe_str()}
Readiness initial delay seconds: {self.initial_delay_seconds}
Readiness probe timeout seconds: {self.readiness_timeout_seconds}
Replica autoscaling policy: {self.autoscaling_policy_str()}
SSL Certificates: {self.ssl_str()}
Spot Policy: {self.spot_policy_str()}
""")

Expand Down Expand Up @@ -315,6 +339,14 @@ def target_qps_per_replica(self) -> Optional[float]:
def post_data(self) -> Optional[Dict[str, Any]]:
return self._post_data

@property
def ssl_keyfile(self) -> Optional[str]:
return self._ssl_keyfile

@property
def ssl_certfile(self) -> Optional[str]:
return self._ssl_certfile

@property
def readiness_headers(self) -> Optional[Dict[str, str]]:
return self._readiness_headers
Expand Down
8 changes: 8 additions & 0 deletions sky/templates/sky-serve-controller.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ file_mounts:
{%- for remote_catalog_path, local_catalog_path in modified_catalogs.items() %}
{{remote_catalog_path}}: {{local_catalog_path}}
{%- endfor %}
{%- if local_ssl_keyfile is not none %}
{{remote_ssl_keyfile}}: {{local_ssl_keyfile}}
{{remote_ssl_certfile}}: {{local_ssl_certfile}}
{%- endif %}

run: |
# Activate the Python environment, so that cloud SDKs can be found in the
Expand All @@ -37,6 +41,10 @@ run: |
--service-name {{service_name}} \
--task-yaml {{remote_task_yaml_path}} \
--job-id $SKYPILOT_INTERNAL_JOB_ID \
{%- if local_ssl_keyfile is not none %}
--ssl-keyfile {{remote_ssl_keyfile}} \
--ssl-certfile {{remote_ssl_certfile}} \
{%- endif %}
>> {{controller_log_file}} 2>&1

envs:
Expand Down
12 changes: 12 additions & 0 deletions sky/utils/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,18 @@ def get_service_schema():
'replicas': {
'type': 'integer',
},
'ssl': {
'type': 'object',
'additionalProperties': False,
'properties': {
'keyfile': {
'type': 'string',
},
'certfile': {
'type': 'string',
},
},
},
}
}

Expand Down
Loading