Skip to content

Conversation

win5923
Copy link
Collaborator

@win5923 win5923 commented Aug 18, 2025

Why are these changes needed?

  • RayJob Volcano support: Adds Volcano scheduler support for RayJob CRD.
  • Gang scheduling: Ensures Ray pods and submitter pod are scheduled together as a unit, preventing partial scheduling issues.

E2E

  1. Deploy the KubeRay operator with the batch-scheduler volcano:
./ray-operator/bin/manager -leader-election-namespace default -use-kubernetes-proxy -batch-scheduler=volcano
  1. Create a RayJob with a head node (1 CPU + 2Gi of RAM), two workers (1 CPU + 1Gi of RAM each) and one submitter pod (0.5 CPU + 200Mi of RAM), for a total of 3500m CPU and 4296Mi of RAM
kubectl apply -f ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
  1. Add an additional RayJob with the same configuration but with a different name
sed 's/rayjob-sample-0/rayjob-sample-1/' ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml | kubectl apply -f-
  1. All the pods stuck on pending for new RayJob
image

PodGroup

  • ray-rayjob-sample-0-pg:
$ k get podgroup ray-rayjob-sample-0-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:16:14Z"
  generation: 3
  name: ray-rayjob-sample-0-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-0
    uid: e7652cc7-7593-4bd1-8ab1-bc043e62d7e5
  resourceVersion: "8779"
  uid: 84247ace-fcb5-4bce-9e18-b33e3769b941
spec:
  minMember: 3
  minResources:
    cpu: 3500m
    memory: 4296Mi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:16:15Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: 6ccaf1db-e4f6-4cfa-ad71-f3abf039e03c
    type: Scheduled
  phase: Running
  running: 1
  • ray-rayjob-sample-1-pg:
$ k get podgroup ray-rayjob-sample-1-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:17:54Z"
  generation: 2
  name: ray-rayjob-sample-1-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-1
    uid: 3a98a4fe-19a5-4f36-9ba3-ebd252c5a267
  resourceVersion: "9080"
  uid: 0dde7617-6fa7-4867-97f6-2deb965170a1
spec:
  minMember: 3
  minResources:
    cpu: 3500m
    memory: 4296Mi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:17:55Z"
    message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
      3 minAvailable; Pending: 3 Unschedulable'
    reason: NotEnoughResources
    status: "True"
    transitionID: cfb01bbc-c53b-42e4-9b02-8e56c46b8e6c
    type: Unschedulable
  phase: Pending

Queue

$ k get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}
  creationTimestamp: "2025-09-25T15:17:54Z"
  generation: 2
  name: kuberay-test-queue
  resourceVersion: "9089"
  uid: 2690f4ca-aa29-4812-aa21-3d0228dfa271
spec:
  capability:
    cpu: 4
    memory: 6Gi
  parent: root
  reclaimable: true
  weight: 1
status:
  allocated:
    cpu: "3"
    memory: 4Gi
    pods: "3"
  reservation: {}
  state: Open

Testing RayJob HTTPMode

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample-2
  labels:
    ray.io/scheduler-name: volcano
    volcano.sh/queue-name: kuberay-test-queue
spec:
  submissionMode: HTTPMode
$ k get podgroup ray-rayjob-sample-2-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:20:15Z"
  generation: 2
  name: ray-rayjob-sample-2-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-2
    uid: cc971b5a-347f-4e12-bc24-e6a65710f8d8
  resourceVersion: "9342"
  uid: e87b6399-74fa-4b42-9d88-3413c5b84865
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:20:16Z"
    message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
      3 minAvailable; Pending: 3 Unschedulable'
    reason: NotEnoughResources
    status: "True"
    transitionID: c5c9e5c2-1bd4-4b0e-b405-78331ea6caf1
    type: Unschedulable
  phase: Pending

Related issue number

Closes #1580

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 changed the title [POC] RayJob Volcano Integration RayJob Volcano Integration Aug 18, 2025
@win5923 win5923 force-pushed the rayjob-volcano branch 8 times, most recently from bc3811c to d591688 Compare September 11, 2025 16:10
@win5923 win5923 marked this pull request as ready for review September 11, 2025 16:23
@win5923 win5923 marked this pull request as draft September 22, 2025 13:59
@win5923 win5923 force-pushed the rayjob-volcano branch 7 times, most recently from 26af624 to ace94b2 Compare September 23, 2025 15:53
@win5923 win5923 marked this pull request as ready for review September 23, 2025 15:54
@win5923 win5923 force-pushed the rayjob-volcano branch 3 times, most recently from c10c53e to 9cd200c Compare September 23, 2025 16:37
)

const (
PodGroupName = "podgroups.scheduling.volcano.sh"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is unused and can be removed i think.

Copy link
Collaborator

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should separate this PR into two. One for interface migration and one for RayJob Volcano Integration. This makes both review and rollback easier IMO.

@win5923
Copy link
Collaborator Author

win5923 commented Sep 24, 2025

I am wondering if we should separate this PR into two. One for interface migration and one for RayJob Volcano Integration. This makes both review and rollback easier IMO.

Got it! I will change this PR to focus on the RayJob Volcano integration, and will create a separate PR for the interface migration.
Thanks.

@win5923 win5923 force-pushed the rayjob-volcano branch 3 times, most recently from d921269 to dfeed19 Compare September 24, 2025 14:35
@win5923
Copy link
Collaborator Author

win5923 commented Sep 24, 2025

Created a issue for interface migration.
#4097

}

// handleRayJob calculates the PodGroup MinMember and MinResources for a RayJob
// The submitter pod is intentionally excluded from MinMember calculation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating on why we want to exclude the submitter pod? Also, if we exclude the submitter pod from resource calculation, what is the main goal of this RayJob integration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submitter pod needs to wait until all RayCluster pods are ready before it is created. However, this leads to a minMember mismatch in the PodGroup, causing the RayCluster to remain in a Pending state.
As a temporary workaround, I’ve configured the RayJob to use the same PodGroup settings as the RayCluster.

Copy link
Collaborator

@owenowenisme owenowenisme Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, currently gang scheduling does not consider submitter pod? ( If I am right it doesn't consider submitter pod resource according to the pod group spec below. )

k describe podgroup ray-rayjob-sample-1-pg
Name:         ray-rayjob-sample-1-pg
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Spec:
  Min Member:  3
  Min Resources:
    Cpu:     3
    Memory:  4Gi
  Queue:     kuberay-test-queue

I think even though the min member mismatch we could still calculate the correct min resource so that the gang scheduling can consider submitter's resource requirement.

In other word, can minMember = worker num+ 1 (head ) and minResource = worker resource + head resource + submitter resource solve the problem?

Copy link
Collaborator Author

@win5923 win5923 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even though the min member mismatch we could still calculate the correct min resource so that the gang scheduling can consider submitter's resource requirement.

In other word, can minMember = worker num+ 1 (head ) and minResource = worker resource + head resource + submitter resource solve the problem?

Sure, Thanks your suggestion! Updated in a96d3a4

$ k get podgroup ray-rayjob-sample-0-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:16:14Z"
  generation: 3
  name: ray-rayjob-sample-0-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-0
    uid: e7652cc7-7593-4bd1-8ab1-bc043e62d7e5
  resourceVersion: "8779"
  uid: 84247ace-fcb5-4bce-9e18-b33e3769b941
spec:
  minMember: 3
  minResources:
    cpu: 3500m
    memory: 4296Mi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:16:15Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: 6ccaf1db-e4f6-4cfa-ad71-f3abf039e03c
    type: Scheduled
  phase: Running
  running: 1

}

return v.syncPodGroup(ctx, app, minMember, totalResource)
totalResourceList := []corev1.ResourceList{{}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For curiosity, why is there an empty ResourceList?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To passed to sumResourceList(list []corev1.ResourceList) corev1.ResourceList for calculating the total required resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, there is no name and quantity in an empty ResourceList. It just skip the inner loop and further to the next element in the slice.

Copy link
Collaborator Author

@win5923 win5923 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right! I was just following the code from util.go

func CalculateDesiredResources(cluster *rayv1.RayCluster) corev1.ResourceList {
desiredResourcesList := []corev1.ResourceList{{}}
headPodResource := CalculatePodResource(cluster.Spec.HeadGroupSpec.Template.Spec)
desiredResourcesList = append(desiredResourcesList, headPodResource)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
if nodeGroup.Suspend != nil && *nodeGroup.Suspend {
continue
}
podResource := CalculatePodResource(nodeGroup.Template.Spec)
calculateReplicaResource(&podResource, nodeGroup.NumOfHosts)
for i := int32(0); i < *nodeGroup.Replicas; i++ {
desiredResourcesList = append(desiredResourcesList, podResource)
}
}
return sumResourceList(desiredResourcesList)
}
func CalculateMinResources(cluster *rayv1.RayCluster) corev1.ResourceList {
minResourcesList := []corev1.ResourceList{{}}
headPodResource := CalculatePodResource(cluster.Spec.HeadGroupSpec.Template.Spec)
minResourcesList = append(minResourcesList, headPodResource)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
podResource := CalculatePodResource(nodeGroup.Template.Spec)
calculateReplicaResource(&podResource, nodeGroup.NumOfHosts)
for i := int32(0); i < *nodeGroup.MinReplicas; i++ {
minResourcesList = append(minResourcesList, podResource)
}
}
return sumResourceList(minResourcesList)
}

I think we can open a follow-up PR to clean up these redundant empty ResourceList initializations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayJob Volcano integration
4 participants