[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted #4141

400Ping · 2025-10-24T16:49:39Z

Why are these changes needed?

When using sidecar mode, the head pod should not be recreated after it is deleted. The RayJob should be marked as Failed.

Related issue number

Closes #4130

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: 400Ping <[email protected]>

rueian · 2025-10-24T18:14:14Z

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

400Ping · 2025-10-25T00:49:45Z

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

ok, thanks

Signed-off-by: 400Ping <[email protected]>

rueian · 2025-10-25T21:26:34Z

ray-operator/controllers/ray/raycluster_controller.go

+		originatedFrom := utils.GetCRDType(instance.Labels[utils.RayOriginatedFromCRDLabelKey])
+		if originatedFrom == utils.RayJobCRD {
+			logger.Info(
+				"reconcilePods: Found 0 head Pods for a RayJob-managed RayCluster; skipping head creation to let RayJob controller handle the failure",


Won't this cause no head pod to be created at all? We still need to create the first head pod. I think you can check the RayClusterProvisioned condition to decide whether to create one or not.

Future-Outlier

I think this is related to the flaky test in CI now.
https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/11427#019a2849-e063-47f4-8aef-9143855d8976

Future-Outlier

let's try to fix this, super important.

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/rayjob_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]>

400Ping · 2025-10-29T02:26:49Z

sorry, I am a bit busy taking midterm this week so I am a bit slow to respond

Future-Outlier · 2025-10-29T02:31:30Z

sorry, I am a bit busy taking midterm this week so I am a bit slow to respond

good luck for your midterm exam, thank you!

Future-Outlier

I tested this in my local laptop, and @machichima did it too.

reproduce step

create a RayJob

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sidecar-mode-flaky
spec:
  # In SidecarMode, the KubeRay operator injects a container into the Ray head Pod to submit the Ray job and tail logs.
  # This will avoid inter-Pod communication, which may cause network issues. For example, some users face WebSocket hangs.
  # For more details, see https://github.com/ray-project/kuberay/issues/3928#issuecomment-3187164736.
  submissionMode: "SidecarMode"
  entrypoint: python -c  "import time; time.sleep(60)"
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  rayClusterSpec:
    rayVersion: '2.46.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.46.0
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "200m"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          - name: code-sample
            configMap:
              name: ray-job-code-sample
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.46.0
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "200m"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"

delete the head pod once the head container and submitter container are both running
check the rayjob CR's status (should be failure both)
you might wonder should we delete the worker pod? the answer is no, since we can't know if we use cluster selector from rayjob in raycluster.
but 1 thing we can do is we can use deletion policy to delete the whole raycluster if needed.

Future-Outlier · 2025-10-29T02:57:14Z

cc @rueian @andrewsykim for merge, thank you!

rueian · 2025-10-29T03:18:45Z

@400Ping, The RayJob_fails_when_head_Pod_is_deleted_when_job_is_running test is failing

Signed-off-by: Future-Outlier <[email protected]>

Please enter a commit message to explain why this merge is necessary,

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier

The latest change works in my local environment.
I downloaded the log from buildkite and found that k8s job mode's e2e test failed.
based on the log, I think this is a temporary fix.
however, I do found that I can refactor rayjob controller's logic better, I will open a follow-up PR recently.

cc @rueian @andrewsykim

Signed-off-by: Future-Outlier <[email protected]>

ray-project#4141) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted Signed-off-by: 400Ping <[email protected]> * [Fix] Fix e2e error Signed-off-by: 400Ping <[email protected]> * [Fix] fix according to rueian's comment Signed-off-by: 400Ping <[email protected]> * [Chore] fix ci error Signed-off-by: 400Ping <[email protected]> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Rueian <[email protected]>

#4141) (#4156) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted * [Fix] Fix e2e error * [Fix] fix according to rueian's comment * [Chore] fix ci error * Update ray-operator/controllers/ray/raycluster_controller.go * Update ray-operator/controllers/ray/rayjob_controller.go * update * update * Trigger CI --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: Ping <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

c753e82

Signed-off-by: 400Ping <[email protected]>

400Ping requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners October 24, 2025 16:49

400Ping marked this pull request as draft October 24, 2025 16:51

400Ping marked this pull request as ready for review October 24, 2025 17:35

[Fix] Fix e2e error

8304ed3

Signed-off-by: 400Ping <[email protected]>

400Ping marked this pull request as draft October 25, 2025 00:36

400Ping added 2 commits October 25, 2025 09:42

[Fix] fix according to rueian's comment

3741997

Signed-off-by: 400Ping <[email protected]>

[Chore] fix ci error

9d5b968

Signed-off-by: 400Ping <[email protected]>

rueian reviewed Oct 25, 2025

View reviewed changes

Future-Outlier reviewed Oct 28, 2025

View reviewed changes

Future-Outlier reviewed Oct 29, 2025

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Show resolved Hide resolved

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

400Ping and others added 2 commits October 29, 2025 10:25

Update ray-operator/controllers/ray/raycluster_controller.go

f2ec21a

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]>

Update ray-operator/controllers/ray/rayjob_controller.go

08cd600

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]>

400Ping marked this pull request as ready for review October 29, 2025 02:27

Future-Outlier approved these changes Oct 29, 2025

View reviewed changes

Future-Outlier added 3 commits October 30, 2025 00:38

update

95b21b3

Signed-off-by: Future-Outlier <[email protected]>

Merge branch 'master' into bug/sidecar-mode-head-pod

22b148e

Please enter a commit message to explain why this merge is necessary,

update

b7448cf

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier approved these changes Oct 29, 2025

View reviewed changes

Trigger CI

a8d2d8e

Signed-off-by: Future-Outlier <[email protected]>

rueian approved these changes Oct 30, 2025

View reviewed changes

rueian merged commit c21938d into ray-project:master Oct 30, 2025
32 of 38 checks passed

rueian mentioned this pull request Oct 30, 2025

[Bug] Sidecar mode shouldn't restart head pod when head pod is delete… #4156

Merged

4 tasks

Uh oh!

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted #4141

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted #4141

Conversation

400Ping commented Oct 24, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

rueian commented Oct 24, 2025

Uh oh!

400Ping commented Oct 25, 2025

Uh oh!

rueian Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

400Ping commented Oct 29, 2025

Uh oh!

Future-Outlier commented Oct 29, 2025

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Oct 29, 2025

Uh oh!

rueian commented Oct 29, 2025

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Future-Outlier left a comment •

edited

Loading

Future-Outlier left a comment •

edited

Loading