GKE Autopilot Outofcpu #2527
-
Hi, Running ARC on GKE Autopilot, using Mentioned this in another discussion actions/runner-container-hooks#50 (comment) : """ The runner pod and the workflow pod are forced onto the exact same node? It seems likely that node won't have enough resources. GKE Autopilot launches small nodes, and will use the hpa autoscaler to add more nodes when necessary. Does the design of https://github.com/actions/actions-runner-controller require they be on the same node? If so, what method can you use in kubernetes to be sure the nodes have plenty of extra space, so that pods can be collocated, and there is room to do so. Is it possible to always be certain the runner pod and workflow pod can fit on the same host node? |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
I believe currently this is the case yes.
I would have a look at the CPU that the runner uses and set the cpu |
Beta Was this translation helpful? Give feedback.
-
Hey 👋 The reason we used the same node is to share the You can get around this problem with specifying a volume for the We are currently implementing an extension that will allow you to pass podSpec and the container spec that can overwrite and extend this default behaviour and allow the hook to be more flexible. |
Beta Was this translation helpful? Give feedback.
-
If they are large CI jobs that need 8GB memory, for example, and such a request amount is set on the workflow pod, it seems likely that the "same node" (on a new gke autopilot cluster) will not have an extra 8GB available. So it's going to crash. It's better to let the hpa autoscaler be able to launch new nodes and use those. That depends on not being tied to the same node. I don't think a hack to have the runner pod request more memory will fix it. If the runner pod requests 16GB, it will be placed on a node that has at least 16GB, but all of that memory will be assigned to the runner pod. If a workflow pod comes along and requests an amount, it will be seen as 'additional', and if that additional number is not available the scheduling will fail. "Autopilot automatically sets resource limits equal to requests" |
Beta Was this translation helpful? Give feedback.
-
It would be great if there were a tutorial added to the docs. A quickstart scenario, but specifically using |
Beta Was this translation helpful? Give feedback.
-
@sdarwin Hi! I'm in the midst of configuring self-hosted runners on GCP. Is it true that ARC can run with GKE Autopilot? Right now I think I have 3 options that I'm considering: GCE managed instance groups with SpotVMs And I'm not sure which is best. |
Beta Was this translation helpful? Give feedback.
-
A year has passed. To review the situation from before.
Would not work on autopilot because it requires privileged mode.
Would not work on autopilot because it collocates the runner pod and the workflow pod onto the same node, and it's almost guaranteed the node won't have sufficient CPU. When using kubernetes, it's important to allow the native controller to schedule pods where there is space. And not force pods on the same small server instance which is likely to be maxed out. Has the issue been fixed (recently)? GKE autopilot is the recommended mode of operation for customers. It's a popular k8s distro. The term 'autopilot' doesn't appear in this repo https://github.com/actions/actions-runner-controller/ . Could documentation be added, in an FAQ section? |
Beta Was this translation helpful? Give feedback.
Hey 👋
The reason we used the same node is to share the
work
volume mount in ARC which is anEmptyDir
volume. Basically, to work with default ARC configuration without many modifications, we needed a job pod to land on the same node, so it can share the kubeletsEmptyDir
mount.You can get around this problem with specifying a volume for the
work
volume mount that is not an empty directory, but the hook would still pick the same node to run the job pod on it.We are currently implementing an extension that will allow you to pass podSpec and the container spec that can overwrite and extend this default behaviour and allow the hook to be more flexible.