-
Couldn't load subscription status.
- Fork 644
[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted #4141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted #4141
Conversation
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
|
@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it. |
ok, thanks |
Signed-off-by: 400Ping <[email protected]>
Signed-off-by: 400Ping <[email protected]>
| originatedFrom := utils.GetCRDType(instance.Labels[utils.RayOriginatedFromCRDLabelKey]) | ||
| if originatedFrom == utils.RayJobCRD { | ||
| logger.Info( | ||
| "reconcilePods: Found 0 head Pods for a RayJob-managed RayCluster; skipping head creation to let RayJob controller handle the failure", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this cause no head pod to be created at all? We still need to create the first head pod. I think you can check the RayClusterProvisioned condition to decide whether to create one or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to the flaky test in CI now.
https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/11427#019a2849-e063-47f4-8aef-9143855d8976
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try to fix this, super important.
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]>
|
sorry, I am a bit busy taking midterm this week so I am a bit slow to respond |
good luck for your midterm exam, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this in my local laptop, and @machichima did it too.
reproduce step
- create a RayJob
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sidecar-mode-flaky
spec:
# In SidecarMode, the KubeRay operator injects a container into the Ray head Pod to submit the Ray job and tail logs.
# This will avoid inter-Pod communication, which may cause network issues. For example, some users face WebSocket hangs.
# For more details, see https://github.com/ray-project/kuberay/issues/3928#issuecomment-3187164736.
submissionMode: "SidecarMode"
entrypoint: python -c "import time; time.sleep(60)"
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
counter_name: "test_counter"
rayClusterSpec:
rayVersion: '2.46.0'
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.46.0
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
requests:
cpu: "200m"
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
- name: code-sample
configMap:
name: ray-job-code-sample
items:
- key: sample_code.py
path: sample_code.py
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 5
groupName: small-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.46.0
resources:
limits:
cpu: "1"
requests:
cpu: "200m"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
sample_code.py: |
import ray
import os
import requests
ray.init()
@ray.remote
class Counter:
def __init__(self):
# Used to verify runtimeEnv
self.name = os.getenv("counter_name")
assert self.name == "test_counter"
self.counter = 0
def inc(self):
self.counter += 1
def get_counter(self):
return "{} got {}".format(self.name, self.counter)
counter = Counter.remote()
for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
# Verify that the correct runtime env was used for the job.
assert requests.__version__ == "2.26.0"-
delete the head pod once the head container and submitter container are both running
-
check the rayjob CR's status (should be failure both)
-
you might wonder should we delete the worker pod? the answer is no, since we can't know if we use cluster selector from rayjob in raycluster.
but 1 thing we can do is we can use deletion policy to delete the whole raycluster if needed.
|
cc @rueian @andrewsykim for merge, thank you! |
|
@400Ping, The RayJob_fails_when_head_Pod_is_deleted_when_job_is_running test is failing |
Signed-off-by: Future-Outlier <[email protected]>
Please enter a commit message to explain why this merge is necessary,
Signed-off-by: Future-Outlier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Future-Outlier <[email protected]>
ray-project#4141) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted Signed-off-by: 400Ping <[email protected]> * [Fix] Fix e2e error Signed-off-by: 400Ping <[email protected]> * [Fix] fix according to rueian's comment Signed-off-by: 400Ping <[email protected]> * [Chore] fix ci error Signed-off-by: 400Ping <[email protected]> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Rueian <[email protected]>
#4141) (#4156) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted * [Fix] Fix e2e error * [Fix] fix according to rueian's comment * [Chore] fix ci error * Update ray-operator/controllers/ray/raycluster_controller.go * Update ray-operator/controllers/ray/rayjob_controller.go * update * update * Trigger CI --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: Ping <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

Why are these changes needed?
When using sidecar mode, the head pod should not be recreated after it is deleted. The RayJob should be marked as Failed.
Related issue number
Closes #4130
Checks