Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique node selector and toleration per replica #223

Open
2 of 3 tasks
ahg-g opened this issue Sep 14, 2024 · 4 comments
Open
2 of 3 tasks

Unique node selector and toleration per replica #223

ahg-g opened this issue Sep 14, 2024 · 4 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Sep 14, 2024

What would you like to be added:

Allow injecting a unique nodeSelector and toleration for each LWS replica to trigger cluster autoscaler to create a dedicated placement group for each replica.

In the api, the user sets the key they would like to use, and the value would be the name of the replica (the leader pod name)

ReplicaUniqueNodeSelector: 
   - compact-placement-group

The result is a nodeSelector injected as follows:
compact-placement-group: <lws-leader-name>

Similarly for tolerations:

ReplicaUniqueToleration
      - key: compact-placement-group
        effect: NoSchedule

The result is a toleration injected on the pods of a group as follows:

      - key: group
        operator: Equal
        value: <lws-leader-name>
        effect: NoSchedule

Why is this needed:
To force cluster autoscaler to create a node group per replica, which can be necessary to create compactly placed nodes (on the same rack for example) for better network performance, and can improve multi-host GPU inference.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@ahg-g ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 14, 2024
@googs1025
Copy link
Member

I'm willing to give it a try.:)
/assign

Would it be better to provide a Google Docs first?

@googs1025
Copy link
Member

googs1025 commented Sep 17, 2024

Sorry, I don't quite understand how compact-placement-group is defined.
Does compact-placement-group mean the name of a leader pod or a user-defined field name?

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-multi-template
spec:
  replicas: 3
  leaderWorkerTemplate:
    ReplicaSpecificNodeSelector: compact-placement-group
    ReplicaSpecificToleration:
      - key: compact-placement-group
    leaderTemplate:
      spec:
        containers:
          - name: nginx2
...

@ahg-g
Copy link
Contributor Author

ahg-g commented Sep 18, 2024

compact-placement-group is a string that the user sets, and we use it as the key to a nodeSelector with a value equal to the leader pod name.

so the snippet you have for the api is correct, the outcome is that we inject a nodeSelector for each group as follows:

nodeSelector:
 - compact-placement-group: <leader-pod-name>

@googs1025
Copy link
Member

Thanks for the explanation.
If having time, please help me check if this is what we want. @ahg-g

Update:

Currently I envision injecting the values ​​of these fields in sts.
As shown in the following example:
yaml:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-multi-template
spec:
  replicas: 3
  leaderWorkerTemplate:
    replicaSpecificNodeSelector:
      - test
    replicaSpecificToleration:
      - key: test
        effect: NoSchedule
    leaderTemplate:
      spec:
        containers:
        - name: nginx2
          image: nginx:1.14.2
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
    size: 4
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: nginx:1.14.2
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080

Node-Selectors and Tolerations fields will be injected into sts.

root@VM-0-16-ubuntu:/home/ubuntu# kubectl get sts
NAME                               READY   AGE
leaderworkerset-multi-template     3/3     145m
leaderworkerset-multi-template-0   3/3     145m
leaderworkerset-multi-template-1   3/3     145m
leaderworkerset-multi-template-2   0/3     145m
root@VM-0-16-ubuntu:/home/ubuntu# kubectl describe sts leaderworkerset-multi-template-0
Name:               leaderworkerset-multi-template-0
Namespace:          default
CreationTimestamp:  Tue, 22 Oct 2024 13:33:17 +0800
Selector:           leaderworkerset.sigs.k8s.io/group-index=0,leaderworkerset.sigs.k8s.io/group-key=689ce1b52864f5b6433d403de39845ba1ab94b07,leaderworkerset.sigs.k8s.io/name=leaderworkerset-multi-template
Labels:             leaderworkerset.sigs.k8s.io/group-index=0
                    leaderworkerset.sigs.k8s.io/group-key=689ce1b52864f5b6433d403de39845ba1ab94b07
                    leaderworkerset.sigs.k8s.io/name=leaderworkerset-multi-template
                    leaderworkerset.sigs.k8s.io/template-revision-hash=0f4e30acf40aef19cbbc2456d8652c7fc6d62705
Annotations:        <none>
Replicas:           3 desired | 3 total
Update Strategy:    RollingUpdate
  Partition:        0
Pods Status:        3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       leaderworkerset.sigs.k8s.io/group-index=0
                leaderworkerset.sigs.k8s.io/group-key=689ce1b52864f5b6433d403de39845ba1ab94b07
                leaderworkerset.sigs.k8s.io/name=leaderworkerset-multi-template
                leaderworkerset.sigs.k8s.io/template-revision-hash=0f4e30acf40aef19cbbc2456d8652c7fc6d62705
  Annotations:  leaderworkerset.sigs.k8s.io/leader-name: leaderworkerset-multi-template-0
                leaderworkerset.sigs.k8s.io/size: 4
  Containers:
   nginx:
    Image:      nginx:1.14.2
    Port:       8080/TCP
    Host Port:  0/TCP
    Limits:
      cpu:  100m
    Requests:
      cpu:         50m
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  test=leaderworkerset-multi-template-0
  Tolerations:     test=leaderworkerset-multi-template-0:NoSchedule
Volume Claims:     <none>
Events:            <none>

When there is no label compact-placement-group: <leader-pod-name> on the node, the pod on sts will not be scheduled.

root@VM-0-16-ubuntu:/home/ubuntu# kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
leaderworkerset-multi-template-0     1/1     Running   0          152m
leaderworkerset-multi-template-0-1   1/1     Running   0          152m
leaderworkerset-multi-template-0-2   1/1     Running   0          152m
leaderworkerset-multi-template-0-3   1/1     Running   0          152m
leaderworkerset-multi-template-1     1/1     Running   0          152m
leaderworkerset-multi-template-1-1   1/1     Running   0          152m
leaderworkerset-multi-template-1-2   1/1     Running   0          152m
leaderworkerset-multi-template-1-3   1/1     Running   0          152m
leaderworkerset-multi-template-2     1/1     Running   0          152m
leaderworkerset-multi-template-2-1   0/1     Pending   0          152m
leaderworkerset-multi-template-2-2   0/1     Pending   0          152m
leaderworkerset-multi-template-2-3   0/1     Pending   0          152m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants