Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DigitalOcean] droplet integration #3832

Merged
merged 85 commits into from
Jan 2, 2025
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
96c5e81
init digital ocean droplet integration
asaiacai Aug 14, 2024
fa8a6bb
abbreviate cloud name
asaiacai Aug 20, 2024
cc8384e
switch to pydo
asaiacai Aug 20, 2024
80b5941
adjust polling logic and mount block storage to instance
asaiacai Aug 31, 2024
49c411b
merge
asaiacai Aug 31, 2024
a741f11
filter by paginated
asaiacai Aug 31, 2024
8702acd
lint
asaiacai Sep 2, 2024
1de819c
sky launch, start, stop functional
asaiacai Sep 2, 2024
fafc71d
fix credential file mounts, autodown works now
asaiacai Sep 2, 2024
e50126c
set gpu droplet image
asaiacai Sep 3, 2024
8532fcf
cleanup
asaiacai Sep 3, 2024
13628ad
remove more tests
asaiacai Sep 3, 2024
34d1916
atomically destroy instance and block storage simulatenously
asaiacai Sep 3, 2024
5eab8f9
install docker
asaiacai Sep 3, 2024
c992161
disable spot test
asaiacai Sep 3, 2024
a868b1a
fix ip address bug for multinode
asaiacai Sep 6, 2024
d4f7794
lint
asaiacai Sep 6, 2024
30ead7b
patch ssh from job/serve controller
asaiacai Sep 6, 2024
6791a7d
switch to EA slugs
asaiacai Sep 6, 2024
af8e5e9
do adaptor
asaiacai Sep 10, 2024
3a31a0a
lint
asaiacai Sep 11, 2024
ce900ed
Update sky/clouds/do.py
asaiacai Sep 17, 2024
391fea1
Update sky/clouds/do.py
asaiacai Sep 17, 2024
1703b40
comment template
asaiacai Sep 17, 2024
66f0314
comment patch
asaiacai Sep 17, 2024
817f3b3
add h100 test case
asaiacai Sep 17, 2024
5d8368c
comment on instance name length
asaiacai Sep 17, 2024
74856df
Update sky/clouds/do.py
asaiacai Sep 18, 2024
cbbb36b
Update sky/clouds/service_catalog/do_catalog.py
asaiacai Sep 18, 2024
ee98000
comment on max node char len
asaiacai Sep 23, 2024
d6da5e8
comment on weird azure import
asaiacai Sep 23, 2024
79aac0a
comment acc price is included in instance price
asaiacai Sep 23, 2024
71c9f9a
fix return type
asaiacai Sep 23, 2024
4fc8fe8
switch with do_utils
asaiacai Sep 23, 2024
113d24d
remove broad except
asaiacai Sep 23, 2024
1e0f9ec
Update sky/provision/do/instance.py
asaiacai Sep 23, 2024
4ab385b
Update sky/provision/do/instance.py
asaiacai Sep 23, 2024
daa7446
remove azure
asaiacai Sep 23, 2024
0d71031
comment on non_terminated_only
asaiacai Sep 23, 2024
dd8c238
add open port debug message
asaiacai Sep 29, 2024
cf7947b
wrap start instance api
asaiacai Sep 29, 2024
56163c0
use f-string
asaiacai Sep 29, 2024
0d39425
wrap stop
asaiacai Sep 29, 2024
0f8a53b
wrap instance down
asaiacai Sep 29, 2024
2881508
assert credentials and check against all contexts
asaiacai Sep 29, 2024
ae76a80
assert client is None
asaiacai Sep 29, 2024
8056bc8
remove pending instances during instance restart
asaiacai Sep 29, 2024
9bdf9df
wrap rename
asaiacai Sep 29, 2024
6cccf6a
rename ssh key var
asaiacai Oct 4, 2024
901ed4e
fix tags
asaiacai Oct 4, 2024
7d57980
add tags for block device
asaiacai Oct 4, 2024
e8d1782
f strings for errors
asaiacai Oct 4, 2024
2e51c59
support image ids
asaiacai Oct 21, 2024
b5fe945
update do tests
Oct 24, 2024
6565fff
only store head instance id
Oct 24, 2024
c6a4583
Merge branch 'skypilot-org:master' into droplet
asaiacai Oct 25, 2024
fde2bc2
rename image slugs
Oct 25, 2024
baf5b48
Merge branch 'droplet' of https://github.com/asaiacai/skypilot into d…
Oct 25, 2024
ff87fe7
add digital ocean alias
Oct 25, 2024
c49c330
wait for docker to be available
Oct 25, 2024
c857fe9
Merge branch 'skypilot-org:master' into droplet
asaiacai Oct 25, 2024
40b2134
update requirements and tests
Oct 25, 2024
65bfc03
increase docker timeout
Oct 25, 2024
812f747
lint
Oct 26, 2024
031777a
Merge branch 'skypilot-org:master' into droplet
asaiacai Nov 4, 2024
3132acb
merge
asaiacai Dec 19, 2024
279dc93
move tests
asaiacai Dec 19, 2024
3b093b0
lint
asaiacai Dec 19, 2024
3023938
patch test
asaiacai Dec 19, 2024
b98b189
lint
asaiacai Dec 19, 2024
335b646
typo fix
asaiacai Dec 19, 2024
61e0ae9
fix typo
asaiacai Dec 19, 2024
b33edad
patch tests
asaiacai Dec 19, 2024
e0e4701
fix tests
asaiacai Dec 19, 2024
72cd751
no_mark spot test
asaiacai Dec 19, 2024
4bb73ce
handle 2cpu serve tests
asaiacai Dec 19, 2024
18e4e8c
lint
asaiacai Dec 19, 2024
97f8c4e
lint
asaiacai Dec 19, 2024
cd31370
use logger.debug
asaiacai Dec 19, 2024
717c75e
fix none cred path
asaiacai Dec 19, 2024
3d47236
lint
asaiacai Dec 19, 2024
3d5f073
handle get_cred path
asaiacai Dec 19, 2024
396d782
pylint
asaiacai Dec 19, 2024
24d3c4d
patch for DO test_optimizer_dryruns.py
asaiacai Dec 31, 2024
3bf98c8
revert optimizer dryrun
asaiacai Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/getting-started/installation.rst
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ Install SkyPilot using pip:
pip install "skypilot-nightly[runpod]"
pip install "skypilot-nightly[fluidstack]"
pip install "skypilot-nightly[paperspace]"
pip install "skypilot-nightly[do]"
pip install "skypilot-nightly[cudo]"
pip install "skypilot-nightly[ibm]"
pip install "skypilot-nightly[scp]"
Expand Down
20 changes: 20 additions & 0 deletions sky/adaptors/do.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""Digital Ocean cloud adaptors"""

# pylint: disable=import-outside-toplevel

from sky.adaptors import common

_IMPORT_ERROR_MESSAGE = ('Failed to import dependencies for DO. '
'Try pip install "skypilot[do]"')
pydo = common.LazyImport('pydo', import_error_message=_IMPORT_ERROR_MESSAGE)
azure = common.LazyImport('azure', import_error_message=_IMPORT_ERROR_MESSAGE)
_LAZY_MODULES = (pydo, azure)


# `pydo`` inherits Azure exceptions. See:
# https://github.com/digitalocean/pydo/blob/7b01498d99eb0d3a772366b642e5fab3d6fc6aa2/examples/poc_droplets_volumes_sshkeys.py#L6
@common.load_lazy_modules(modules=_LAZY_MODULES)
def exceptions():
"""Azure exceptions."""
from azure.core import exceptions as azure_exceptions
return azure_exceptions
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1000,6 +1000,7 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
clouds.Cudo,
clouds.Paperspace,
clouds.Azure,
clouds.DO,
)):
config = auth.configure_ssh_info(config)
elif isinstance(cloud, clouds.GCP):
Expand Down
1 change: 1 addition & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ def _get_cluster_config_template(cloud):
clouds.SCP: 'scp-ray.yml.j2',
clouds.OCI: 'oci-ray.yml.j2',
clouds.Paperspace: 'paperspace-ray.yml.j2',
clouds.DO: 'do-ray.yml.j2',
clouds.RunPod: 'runpod-ray.yml.j2',
clouds.Kubernetes: 'kubernetes-ray.yml.j2',
clouds.Vsphere: 'vsphere-ray.yml.j2',
Expand Down
2 changes: 2 additions & 0 deletions sky/clouds/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from sky.clouds.aws import AWS
from sky.clouds.azure import Azure
from sky.clouds.cudo import Cudo
from sky.clouds.do import DO
from sky.clouds.fluidstack import Fluidstack
from sky.clouds.gcp import GCP
from sky.clouds.ibm import IBM
Expand All @@ -34,6 +35,7 @@
'Cudo',
'GCP',
'Lambda',
'DO',
'Paperspace',
'SCP',
'RunPod',
Expand Down
303 changes: 303 additions & 0 deletions sky/clouds/do.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
""" Digital Ocean Cloud. """

import json
import typing
from typing import Dict, Iterator, List, Optional, Tuple, Union

from sky import clouds
from sky.adaptors import do
from sky.clouds import service_catalog
from sky.provision.do import utils as do_utils
from sky.utils import resources_utils

if typing.TYPE_CHECKING:
from sky import resources as resources_lib

_CREDENTIAL_FILE = 'config.yaml'


@clouds.CLOUD_REGISTRY.register(aliases=['digitalocean'])
class DO(clouds.Cloud):
"""Digital Ocean Cloud"""

_REPR = 'DO'
_CLOUD_UNSUPPORTED_FEATURES = {
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
clouds.CloudImplementationFeatures.CLONE_DISK_FROM_CLUSTER:
'Migrating '
f'disk is not supported in {_REPR}.',
clouds.CloudImplementationFeatures.SPOT_INSTANCE:
'Spot instances are '
f'not supported in {_REPR}.',
clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER:
'Custom disk tiers'
f' is not supported in {_REPR}.',
}
# DO maximum node name length defined as <= 255
# https://docs.digitalocean.com/reference/api/api-reference/#operation/droplets_create
# 255 - 8 = 247 characters since
# our provisioner adds additional `-worker`.
_MAX_CLUSTER_NAME_LEN_LIMIT = 247
_regions: List[clouds.Region] = []

# Using the latest SkyPilot provisioner API to provision and check status.
PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
STATUS_VERSION = clouds.StatusVersion.SKYPILOT

@classmethod
def _unsupported_features_for_resources(
cls, resources: 'resources_lib.Resources'
) -> Dict[clouds.CloudImplementationFeatures, str]:
"""The features not supported based on the resources provided.

This method is used by check_features_are_supported() to check if the
cloud implementation supports all the requested features.

Returns:
A dict of {feature: reason} for the features not supported by the
cloud implementation.
"""
del resources # unused
return cls._CLOUD_UNSUPPORTED_FEATURES

@classmethod
def _max_cluster_name_length(cls) -> Optional[int]:
return cls._MAX_CLUSTER_NAME_LEN_LIMIT

@classmethod
def regions_with_offering(
cls,
instance_type: str,
accelerators: Optional[Dict[str, int]],
use_spot: bool,
region: Optional[str],
zone: Optional[str],
) -> List[clouds.Region]:
assert zone is None, 'DO does not support zones.'
del accelerators, zone # unused
if use_spot:
return []
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'DO')
if region is not None:
regions = [r for r in regions if r.name == region]
return regions

@classmethod
def get_vcpus_mem_from_instance_type(
cls,
instance_type: str,
) -> Tuple[Optional[float], Optional[float]]:
return service_catalog.get_vcpus_mem_from_instance_type(instance_type,
clouds='DO')

@classmethod
def zones_provision_loop(
cls,
*,
region: str,
num_nodes: int,
instance_type: str,
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool = False,
) -> Iterator[None]:
del num_nodes # unused
regions = cls.regions_with_offering(instance_type,
accelerators,
use_spot,
region=region,
zone=None)
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
for r in regions:
assert r.zones is None, r
yield r.zones

def instance_type_to_hourly_cost(
self,
instance_type: str,
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None,
) -> float:
return service_catalog.get_hourly_cost(
instance_type,
use_spot=use_spot,
region=region,
zone=zone,
clouds='DO',
)

def accelerators_to_hourly_cost(
self,
accelerators: Dict[str, int],
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None,
) -> float:
"""Returns the hourly cost of the accelerators, in dollars/hour."""
# the acc price is include in the instance price.
del accelerators, use_spot, region, zone # unused
return 0.0
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

def get_egress_cost(self, num_gigabytes: float) -> float:
return 0.0
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

def __repr__(self):
return self._REPR

asaiacai marked this conversation as resolved.
Show resolved Hide resolved
@classmethod
def get_default_instance_type(
cls,
cpus: Optional[str] = None,
memory: Optional[str] = None,
disk_tier: Optional[resources_utils.DiskTier] = None,
) -> Optional[str]:
"""Returns the default instance type for DO."""
return service_catalog.get_default_instance_type(cpus=cpus,
memory=memory,
disk_tier=disk_tier,
clouds='DO')

@classmethod
def get_accelerators_from_instance_type(
cls, instance_type: str) -> Optional[Dict[str, Union[int, float]]]:
return service_catalog.get_accelerators_from_instance_type(
instance_type, clouds='DO')

@classmethod
def get_zone_shell_cmd(cls) -> Optional[str]:
return None

def make_deploy_resources_variables(
self,
resources: 'resources_lib.Resources',
cluster_name: resources_utils.ClusterName,
region: 'clouds.Region',
zones: Optional[List['clouds.Zone']],
num_nodes: int,
dryrun: bool = False) -> Dict[str, Optional[str]]:
del zones, dryrun, cluster_name

r = resources
acc_dict = self.get_accelerators_from_instance_type(r.instance_type)
if acc_dict is not None:
custom_resources = json.dumps(acc_dict, separators=(',', ':'))
else:
custom_resources = None
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
image_id = None
if (resources.image_id is not None and
resources.extract_docker_image() is None):
if None in resources.image_id:
image_id = resources.image_id[None]
else:
assert region.name in resources.image_id
image_id = resources.image_id[region.name]
return {
'instance_type': resources.instance_type,
'custom_resources': custom_resources,
'region': region.name,
**({
'image_id': image_id
} if image_id else {})
}
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

def _get_feasible_launchable_resources(
self, resources: 'resources_lib.Resources'
) -> resources_utils.FeasibleResources:
"""Returns a list of feasible resources for the given resources."""
if resources.use_spot:
# TODO: Add hints to all return values in this method to help
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
# users understand why the resources are not launchable.
return resources_utils.FeasibleResources([], [], None)
if resources.instance_type is not None:
assert resources.is_launchable(), resources
resources = resources.copy(accelerators=None)
return resources_utils.FeasibleResources([resources], [], None)

def _make(instance_list):
resource_list = []
for instance_type in instance_list:
r = resources.copy(
cloud=DO(),
instance_type=instance_type,
accelerators=None,
cpus=None,
)
resource_list.append(r)
return resource_list

# Currently, handle a filter on accelerators only.
accelerators = resources.accelerators
if accelerators is None:
# Return a default instance type
default_instance_type = DO.get_default_instance_type(
cpus=resources.cpus,
memory=resources.memory,
disk_tier=resources.disk_tier)
return resources_utils.FeasibleResources(
_make([default_instance_type]), [], None)

assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list) = (
service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
cpus=resources.cpus,
memory=resources.memory,
region=resources.region,
zone=resources.zone,
clouds='DO',
))
if instance_list is None:
return resources_utils.FeasibleResources([], fuzzy_candidate_list,
None)
return resources_utils.FeasibleResources(_make(instance_list),
fuzzy_candidate_list, None)

@classmethod
def check_credentials(cls) -> Tuple[bool, Optional[str]]:
"""Verify that the user has valid credentials for DO."""
try:
# attempt to make a CURL request for listing instances
do_utils.client().droplets.list()
except do.exceptions().HttpResponseError as err:
return False, str(err)
except do_utils.DigitalOceanError as err:
return False, str(err)

return True, None

def get_credential_file_mounts(self) -> Dict[str, str]:
try:
do_utils.client()
return {
f'~/.config/doctl/{_CREDENTIAL_FILE}': do_utils.CREDENTIALS_PATH
}
except do_utils.DigitalOceanError:
return {}

@classmethod
def get_current_user_identity(cls) -> Optional[List[str]]:
# NOTE: used for very advanced SkyPilot functionality
# Can implement later if desired
return None

@classmethod
def get_image_size(cls, image_id: str, region: Optional[str]) -> float:
del region
try:
response = do_utils.client().images.get(image_id=image_id)
return response['image']['size_gigabytes']
except do.exceptions().HttpResponseError as err:
raise do_utils.DigitalOceanError(
'HTTP error while retrieving size of '
f'image_id {response}: {err.error.message}') from err
except KeyError as err:
raise do_utils.DigitalOceanError(
f'No image_id `{image_id}` found') from err
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

def instance_type_exists(self, instance_type: str) -> bool:
return service_catalog.instance_type_exists(instance_type, 'DO')

def validate_region_zone(self, region: Optional[str], zone: Optional[str]):
return service_catalog.validate_region_zone(region, zone, clouds='DO')
2 changes: 1 addition & 1 deletion sky/clouds/service_catalog/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
CATALOG_DIR = '~/.sky/catalogs'
ALL_CLOUDS = ('aws', 'azure', 'gcp', 'ibm', 'lambda', 'scp', 'oci',
'kubernetes', 'runpod', 'vsphere', 'cudo', 'fluidstack',
'paperspace')
'paperspace', 'do')
Loading
Loading