Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

Open
OrenZ1 opened this issue Dec 27, 2023 · 5 comments

Comments

@OrenZ1
Copy link

OrenZ1 commented Dec 27, 2023

Problem

I am facing a problem when using JEG on kubernetes.
I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels.
When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels.
In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.

Proposed Solution

For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them.
For the other issue I’ve stated I can see two possible solutions:
The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending.
The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.

@OrenZ1
Copy link
Author

OrenZ1 commented Jan 25, 2024

If I can get an update about this request, that would be great.
I will be happy to contribute and add this option, so if you can state the relevant files, I can try to implement this and contribute :)

@kevin-bates
Copy link
Member

Hi @OrenZ1 - I apologize for the delay. Unfortunately, I'm unable to spend much time on EG (and Jupyter in general) these days.

I think this would be a great addition. Ideally, if we can determine that a Pending state is going to remain pending until the prescribed (and long) timeout, it would better to abort. The location where we can detect this during the startup sequence is in the KubernetesProcessProxy and the status loop where we could add more intelligence would be here.

I hope you find that helpful but imagine you've probably poked around a bit already so let me know if this isn't what you were looking for.

Thank you for your interest and helping out!

@lresende
Copy link
Member

There are multiple ways you can go about this:

  • Configure kernel image pullers to avoid delays in downloading images and reduce the startup timeouts
  • Configure culling kernels to avoid kernels wasting resources
  • If this is related to spark? Enable dynamic allocation to help reduce idle usage of resources

Also, having what @kevin-bates proposes above would not only help your use case but also fix a file-handlers leak that I have seen in the past.

@OrenZ1
Copy link
Author

OrenZ1 commented Feb 29, 2024

Hi! Sorry for the delay but I managed to make a PR for the first thing we've discussed here!
For now the PR is for when the kernel pod dies while still in startup, the EG will throw a matching exception to the user, to prevent the need to wait for timeout.

I am still trying to think of a way to handle kernels which are stuck on Pending state. Hope to make a different PR for that too soon :)
#1370

@OrenZ1
Copy link
Author

OrenZ1 commented Jun 2, 2024

Just created a new PR, which enables the option to configure different timeouts for different events which occur during startup, including a "0 seconds" timeout -which means the startup will terminate immediately after such event occurs.
#1383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants