[k8s] Support in-cluster and kubeconfig auth simultaneously by romilbhardwaj · Pull Request #4188 · skypilot-org/skypilot

romilbhardwaj · 2024-10-26T06:35:52Z

Adds support for using both in-cluster auth and kubeconfig based auth simultaneously. This is useful when SkyPilot is running inside a Kubernetes cluster and may need to access an external cluster (e.g., jobs controller, API server etc).

Builds on the in-cluster context naming from #4136. Do not merge before #4136.

TODO:

Test if SSH works (in-cluster kubectl may require explicitly specifying kubeconfig path)

…o k8s_auth_incluster_and_kubeconfig # Conflicts: # sky/clouds/kubernetes.py

…econfig

romilbhardwaj · 2024-11-17T01:33:47Z

This should be ready for review now.

If allowed_contexts is set, we now upload the local kubeconfig since the kubeconfig would be required to authenticate with other clusters in the context list.

Tested on a GKE cluster + sky local up cluster, with Sky jobs controller running on the sky local up` cluster. Note that the GKE cluster would require token based auth (generated with generate_kubeconfig.sh, see #4379).

Michaelvll

Thanks @romilbhardwaj! It mostly looks good to me. Left a few questions

Michaelvll · 2024-11-19T19:09:46Z

+        # If using Kubernetes and using allowed_contexts, we need to upload
+        # credentials for all contexts.


It is needed for? Is this for controller on kubernetes?

Yes, it's when the controller is on kubernetes - added a comment

Michaelvll · 2024-11-19T19:12:28Z

+          {% if k8s_env_vars is not none %}
+          env:
+          {% for key, value in k8s_env_vars.items() %}
+          - name: {{ key }}
+            value: {{ value }}
+          {% endfor %}
+          {% endif %}


If a user specify env in the spec, will this get overwritten?

No, since it's a dict it will get merged:

skypilot/sky/provision/kubernetes/utils.py

Line 1657 in 6c02197

def merge_dicts(source: Dict[Any, Any], destination: Dict[Any, Any]):

Michaelvll · 2024-11-19T19:13:38Z

+        # If context is none, it means we are using incluster auth. In this
+        # case, need to set KUBECONFIG to /dev/null to avoid using kubeconfig.
+        if self.context is None:
+            kubectl_args += ['--kubeconfig', '/dev/null']


Seems we have many places setting the context to some specific value for in-cluster auth. Will it cause that we are getting in-cluster or other context names instead of None?

With this PR, it should be consistent now: context will always be set to the current context, except when running in-cluster (in which case context will be None and will need to be handled when passed to kubectl).

Adding a comment to document this

…o k8s_auth_incluster_and_kubeconfig # Conflicts: # sky/backends/backend_utils.py # sky/clouds/kubernetes.py

…o k8s_auth_incluster_and_kubeconfig

Michaelvll

LGTM. Thanks!

romilbhardwaj · 2024-11-25T05:31:17Z

Manual tests with multiple GKE k8s clusters pass, merging now.

* [perf] optimizations for sky jobs launch (#4341) * cache AWS get_user_identities With SSO enabled (and maybe without?) this takes about a second. We already use an lru_cache for Azure, do the same here. * skip optimization for sky jobs launch --yes The only reason we call optimize for jobs_launch is to give a preview of the resources we expect to use, and give the user an opportunity to back out if it's not what they expect. If you use --yes or -y, you don't have a chance to back out and you're probably running from a script, where you don't care. Optimization can take ~2 seconds, so just skip it. * update logging * address PR comments * [ux] cache cluster status of autostop or spot clusters for 2s (#4332) * add status_updated_at to DB * don't refresh autostop/spot cluster if it's recently been refreshed * update locking mechanism for status check to early exit * address PR comments * add warning about cluster status lock timeout * [k8s] fix managed job issue on k8s (#4357) Signed-off-by: nkwangleiGIT <nkwanglei@126.com> * [Core] Add `NO_UPLOAD` for `remote_identity` (#4307) * Add skip flag to remote_identity * Rename to NO_UPLOAD * Fixes * lint * comments * Add comments * lint * Add Lambda's GH200 instance type (#4377) Add GH200 instance type * [FluidStack] Fix provisioning and add new gpu types (#4359) [FluidStack] Fix provisioning and add new gpu types * Add new `provisioning` status to fix failed deployments * Add H100 SXM5 GPU mapping * [ux] display human-readable name for controller (#4376) * [k8s] Handle apt update log not existing (#4381) do not panic if file does not exist, it may be written soon * Support event based smoke test instead of sleep time based to reduce flaky test and faster test (#4284) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * restore sleep for fail test case * [UX] user-friendly message shown if Kubernetes is not enabled. (#4336) try except * [Jobs] Disable deduplication for logs (#4388) Disable dedup * [OCI] set zone in the ProvisionRecord (#4383) * fix: Add zone to the ProvisionRecord * fix * [Examples] Specify version for vllm cuz vllm v0.6.4.post1 has issue (#4391) * [OCI] Specify vllm version because the latest vllm v0.6.4.post1 has issue * version for vllm-flash-attn * [docs] Specify compartment for OCI resources. (#4384) * [docs] Specify compartment for OCI resources. * Add link to compartment definition page * [k8s] Improve multi-node provisioning time (nimbus) (#4393) * Tracking k8s events with timeline * Remove SSH wait * Parallelize pod creation and status check * Parallelize labelling, add docs on optimizing base image, bump default provision timeout * More parallelization, batching and optimizations * lint * correctness * Fix double launch bug * fix num threads * Add fd limit warning * [k8s] Move setup and ray start to pod args to make them async (#4389) * move scripts to args * Avoid ray setup * fix * Add checks for ray healthiness * remove bc installation * wait for healthy * add todo * fix * fix * format * format * remove unnecessary logging * print out error setup * Add comment * clean up the logging * style * Fixes for ubuntu images * format * remove unused comments * Optimize ray start * add comments * Add comments * Fix comments and logging * missing end_epoch * Add logging * Longer timeout and trigger ray start * Fixes for the ray port and AWS credential setup * Update netcat-openbsd, comments * _NUM_THREADS rename * add num_nodes to calculate timeout * lint * revert * use uv for pip install and for venv creation (#4394) * use uv for pip install and for venv creation uv is a tool that can replace pip and venv (and some other stuff we're not using I think). It's written in rust and in testing is significantly faster for many operation, especially things like `pip list` or `pip install skypilot` when skypilot or all its dependencies are already installed. * add comment to SKY_PIP_CMD * sudo handling for ray * Add comment in dockerfile * fix pod checks * lint --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * [Core] Skip worker ray start for multinode (#4390) * Optimize ray start * add comments * update logging * remove `uv` from runtime setup due to azure installation issue (#4401) * [k8s] Skip listing all pods to speed up optimizer (#4398) * Reduce API calls * lint * [k8s] Nimbus backward compatibility (#4400) * Add nimbus backward compatibility * add uv backcompat * add uv backcompat * add uv backcompat * lint * merge * merge * [Storage] Call `sync_file_mounts` when either rsync or storage file_mounts are specified (#4317) do file mounts if storage is specified * [k8s] Support in-cluster and kubeconfig auth simultaneously (#4188) * per-context SA + incluster auth fixes * lint * Support both incluster and kubeconfig * wip * Ignore kubeconfig when context is not specified, add su, mounting kubeconfig * lint * comments * fix merge issues * lint * Fix Spot instance on Azure (#4408) * [UX] Allow disabling ports in CLI (#4378) [UX] Allow disabling ports * [AWS] Get rid of credential files if `remote_identity: SERVICE_ACCOUNT` specified (#4395) * syntax * minor * Fix OD instance on Azure (#4411) * [UX] Remove K80 and M60 from common GPU list (#4382) * Remove K80 and M60 from GPU list * Fix kubernetes instance type with space * comments * format * format * remove mi25 * Event based smoke tests -- manged jobs (#4386) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * interface change * bug fix * bug fix * raise error on empty status * [k8s] Fix in-cluster auth namespace fetching (#4420) * Fix incluster auth namespace fetching * Fixes * [k8s] Update comparison page image (#4415) Update image * Add a pre commit config to help format before pushing (#4258) * pre commit config * yapf version * fix * mypy check all files * skip smoke_test.py * add doc * better format * newline format * sync with format.sh * comment fix * fix the pylint hook for pre-commit (#4422) * fix the pylint hook * remove default arg * change name * limit pylint files * [k8s] Fix resources.image_id backward compatibility (#4425) * Fix back compat * Fix back compat for image_id + regions * lint * comments * [Tests] Move tests to uv to speed up the dependency installation by >10x (#4424) * correct cache for pypi * Add doc cache and test cache * Add examples folder * fix policy path * use uv for pylint * Fix azure cli * disable cache * use venv * set venv * source instead * rename doc build * Move to uv * Fix azure cli * Add -e * Update .github/workflows/format.yml Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * Update .github/workflows/mypy.yml Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * Update .github/workflows/pylint.yml Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * Update .github/workflows/pytest.yml Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * Update .github/workflows/test-doc-build.yml Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * fix pytest yml * Add merge group --------- Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> * fix db * fix launch * remove transaction id * format * format * format * test doc build * doc build * update readme for test kubernetes example (#4426) * update readme * fetch version from gcloud * rename var to GKE_VERSION * subnetwork also use REGION * format * fix types * fix * format * fix types * [k8s] Fix `show-gpus` availability map when nvidia drivers are not installed (#4429) * Fix availability map * Fix availability map * fix types * avoid catching ValueError during failover (#4432) * avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint * [Core] Execute setup when `--detach-setup` and no `run` section (#4430) * Execute setup when --detach-setup and no run section * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * add comments * Fix types * format * minor * Add test for detach setup only --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * wait for cleanup * [Jobs] Allow logs for finished jobs and add `sky jobs logs --refresh` for restartin jobs controller (#4380) * Stream logs for finished jobs * Allow stream logs for finished jobs * Read files after the indicator lines * Add refresh for `sky jobs logs` * fix log message * address comments * Add smoke test * fix smoke * fix jobs queue smoke test * fix storage * fix merge issue * fix merge issue * Fix merging issue * format --------- Signed-off-by: nkwangleiGIT <nkwanglei@126.com> Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Lei <nkwanglei@126.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Co-authored-by: Cody Brownstein <105375373+cbrownstein-lambda@users.noreply.github.com> Co-authored-by: mjibril <mjibril@users.noreply.github.com> Co-authored-by: zpoint <zguo@covariant.ai> Co-authored-by: Hysun He <hysunhe@foxmail.com> Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: zpoint <zp0int@qq.com>

romilbhardwaj added 3 commits October 21, 2024 17:05

per-context SA + incluster auth fixes

658fc8e

lint

52880a2

Support both incluster and kubeconfig

49d59f4

romilbhardwaj marked this pull request as draft October 26, 2024 06:36

romilbhardwaj mentioned this pull request Oct 28, 2024

[k8s] Allow per-context identity configuration + in-cluster auth fixes #4136

Closed

1 task

romilbhardwaj added this to the v0.7 milestone Oct 28, 2024

wip

9b1bad9

romilbhardwaj removed this from the v0.7 milestone Oct 29, 2024

romilbhardwaj added 3 commits November 15, 2024 15:54

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

e94915e

…o k8s_auth_incluster_and_kubeconfig # Conflicts: # sky/clouds/kubernetes.py

Ignore kubeconfig when context is not specified, add su, mounting kub…

e4efd2d

…econfig

lint

e6dc994

romilbhardwaj marked this pull request as ready for review November 17, 2024 01:33

romilbhardwaj requested a review from Michaelvll November 19, 2024 16:31

Michaelvll approved these changes Nov 19, 2024

View reviewed changes

romilbhardwaj added 5 commits November 20, 2024 12:09

comments

9fbd388

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

3aa6a96

…o k8s_auth_incluster_and_kubeconfig # Conflicts: # sky/backends/backend_utils.py # sky/clouds/kubernetes.py

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

8b18c00

…o k8s_auth_incluster_and_kubeconfig

fix merge issues

4d9f0e8

lint

bfeb2ed

Michaelvll approved these changes Nov 24, 2024

View reviewed changes

romilbhardwaj added this pull request to the merge queue Nov 25, 2024

Merged via the queue into master with commit 76e20b6 Nov 25, 2024

romilbhardwaj deleted the k8s_auth_incluster_and_kubeconfig branch November 25, 2024 05:37

This was referenced Dec 9, 2024

[k8s] Remote identity support when multiple contexts are configured #4131

Closed

sky check from one Kubernetes cluster to another failing #3904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Support in-cluster and kubeconfig auth simultaneously#4188

[k8s] Support in-cluster and kubeconfig auth simultaneously#4188
romilbhardwaj merged 12 commits intomasterfrom
k8s_auth_incluster_and_kubeconfig

romilbhardwaj commented Oct 26, 2024 •

edited

Loading

Uh oh!

romilbhardwaj commented Nov 17, 2024

Uh oh!

Michaelvll left a comment

Uh oh!

Uh oh!

Michaelvll Nov 19, 2024

Uh oh!

romilbhardwaj Nov 20, 2024

Uh oh!

Michaelvll Nov 19, 2024

Uh oh!

romilbhardwaj Nov 20, 2024 •

edited

Loading

Uh oh!

Michaelvll Nov 19, 2024

Uh oh!

romilbhardwaj Nov 20, 2024

Uh oh!

romilbhardwaj Nov 20, 2024

Uh oh!

Michaelvll left a comment

Uh oh!

romilbhardwaj commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# If using Kubernetes and using allowed_contexts, we need to upload
		# credentials for all contexts.

Conversation

romilbhardwaj commented Oct 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romilbhardwaj commented Nov 17, 2024

Uh oh!

Michaelvll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Michaelvll Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

romilbhardwaj Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Michaelvll Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

romilbhardwaj Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Michaelvll Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

romilbhardwaj Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

romilbhardwaj Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Michaelvll left a comment

Choose a reason for hiding this comment

Uh oh!

romilbhardwaj commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romilbhardwaj commented Oct 26, 2024 •

edited

Loading

romilbhardwaj Nov 20, 2024 •

edited

Loading