-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324
Conversation
/hold |
Pull Request Test Coverage Report for Build 12281612841Details
💛 - Coveralls |
|
||
# Create the TrainJob. | ||
try: | ||
self.custom_api.create_namespaced_custom_object( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check the status of this api, whether this returns 200 or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if this API fails it throws an Exception
training-operator/sdk_v2/kubeflow/training/api/training_client.py
Lines 205 to 209 in e9ff7d1
except Exception: | |
raise RuntimeError( | |
f"Failed to create {constants.TRAINJOB_KIND}: {self.namespace}/{train_job_name}" | |
) | |
PHASE_PRE_TRAINING = "pre-training" | ||
|
||
# The value indicates that runtime can be used for the model pre-training. | ||
PHASE_PRE_TRAINING = "post-training" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be PHASE_POST_TRAINING I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please change the comment as well @andreyvelich
The value indicates that runtime can be used for the model pre-training.
Thanks !
@dataclass | ||
class HuggingFaceDatasetConfig: | ||
storage_uri: str | ||
access_token: Optional[str] = None | ||
|
||
|
||
@dataclass | ||
# Configuration for the HuggingFace model provider. | ||
class HuggingFaceModelInputConfig: | ||
storage_uri: str | ||
access_token: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to move HuggingFaceDatasetConfig and HuggingFaceModelConfig to here in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question. We have defined dataset and model related classes in pkg/initializerz_v2
.
training-operator/pkg/initializer_v2/dataset/config.py
Lines 5 to 9 in 94dee0e
# TODO (andreyvelich): This should be moved under Training V2 SDK. | |
@dataclass | |
class HuggingFaceDatasetConfig: | |
storage_uri: str | |
access_token: Optional[str] = None |
training-operator/pkg/initializer_v2/model/config.py
Lines 5 to 9 in 94dee0e
# TODO (andreyvelich): This should be moved under Training V2 SDK. | |
@dataclass | |
class HuggingFaceModelInputConfig: | |
storage_uri: str | |
access_token: Optional[str] = None |
I guess it would be better if we move these py files in pkg/initializer_v2
to the SDK directory when generating SDK using setup.py
. For example: https://github.com/kubeflow/katib/blob/2b41ae62ab3905984e02123218351a703c03bf56/sdk/python/v1beta1/setup.py#L33-L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will move them soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanlaii @Electronic-Waste I've completed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. I'm willing to help if needed:)
@dataclass | ||
class HuggingFaceDatasetConfig: | ||
storage_uri: str | ||
access_token: Optional[str] = None | ||
|
||
|
||
@dataclass | ||
# Configuration for the HuggingFace model provider. | ||
class HuggingFaceModelInputConfig: | ||
storage_uri: str | ||
access_token: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question. We have defined dataset and model related classes in pkg/initializerz_v2
.
training-operator/pkg/initializer_v2/dataset/config.py
Lines 5 to 9 in 94dee0e
# TODO (andreyvelich): This should be moved under Training V2 SDK. | |
@dataclass | |
class HuggingFaceDatasetConfig: | |
storage_uri: str | |
access_token: Optional[str] = None |
training-operator/pkg/initializer_v2/model/config.py
Lines 5 to 9 in 94dee0e
# TODO (andreyvelich): This should be moved under Training V2 SDK. | |
@dataclass | |
class HuggingFaceModelInputConfig: | |
storage_uri: str | |
access_token: Optional[str] = None |
I guess it would be better if we move these py files in pkg/initializer_v2
to the SDK directory when generating SDK using setup.py
. For example: https://github.com/kubeflow/katib/blob/2b41ae62ab3905984e02123218351a703c03bf56/sdk/python/v1beta1/setup.py#L33-L39
|
||
return train_job_name | ||
|
||
def list_jobs(self) -> List[types.TrainJob]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, do we want to pass runtimeRef name as optional argument here? might be a good idea if we can get train jobs for a particular runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, good idea.
I added this parameter @saileshd1402
I think, this PR should be ready to review to give users provide initial support for Python SDK. |
ee57831
to
2abf580
Compare
/hold cancel |
config.load_incluster_config() | ||
|
||
k8s_client = client.ApiClient(client_configuration) | ||
self.custom_api = client.CustomObjectsApi(k8s_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does these client include retries on any failure? May be we could have a wrapper client that would add retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is a good idea.
@akshaychitneni Can we have a followup PR to implement it ?
Please can you create an issue to track it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Lots of work! I left a few comments for you.
I think we should co-design configs for Pre-Training and Fine-Tuning in the Trainer
parameter.
|
||
# The label key to identify the device (e.g. GPU, TPU) that is used for training. | ||
# TODO: Potentially, we should get this data from the Node selectors. | ||
DEVICE_KEY = "training.kubeflow.org/device" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, what exactly does the "device" mean?
Computing Device? Persistent Storage device? Ephemeral Storage device like memory and extended persistent memory? NIC device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the main goal of this is to show Data Scientists which device will be used by the Training Runtime to perform model training.
This will help them to understand the accelerator (e.g. V100, H100) on which their model training will be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, should we name it just an accelerator?
The device sounds weird since you seem to aim support only accelerator including FPGA, and you do not have any plan to support any other special devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. However, I was just exploring how make it more close to ML frameworks, and they call their API as "device".
- PyTorch: https://pytorch.org/docs/stable/generated/torch.cuda.device.html
- MLX: https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.set_default_device.html#mlx.core.set_default_device
Thus, I was thinking that device is closer to data scientists expectation when they want to assign their ML code to the specific accelerator. WDYT @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PyTorch, it seems that the device indicates only GPU since the lib name is "cuda.device".
For MLX, I'm not sure which devices can be specified because I can not find any documentation for supported devices at first glance.
As a result, I still would recommend using the accelerator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MLX, I'm not sure which devices can be specified because I can not find any documentation for supported devices at first glance
In MLX you can specify CPU or GPU (e.g. Metal). For example: mx.set_default_device(mx.cpu) or mx.set_default_device(mx.gpu)
: https://github.com/ml-explore/mlx-examples/blob/main/mnist/main.py#L103
On EKS and GKE the label that indicates GPU type on the node is named accelerator k8s.amazonaws.com/accelerator
: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#special-note-on-gpu-instances
However, how we can deal with CPU-based runtimes, e.g. we don't call CPU an "accelerator", isn't ?
Or maybe in the future we can support other devices/accelerators.
@tenzen-y Can we create a tracking issue to discuss this further ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, how we can deal with CPU-based runtimes, e.g. we don't call CPU an "accelerator", isn't ?
Or maybe in the future we can support other devices/accelerators.
IIUC, this device field corresponds to the Kubernetes scheduling directives like nodeSelector. So, I'm not sure the reason why we need to provide the knob for CPU workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y The main goal of this device and device count parameter is to show Data Scientists which hardware resources they can utilize while using the existing Runtimes. With that, they can appropriately configure their training code to use those devices. For example, using nccl
backend in PyTorch if GPUs are available or using bfloat16
if GPU supports it.
As you can see, the device and device count is part of Runtime and Component classes that Data Scientists can get:
training-operator/sdk_v2/kubeflow/training/types/types.py
Lines 24 to 39 in e9c22b2
class Runtime: | |
name: str | |
phase: str | |
device: str | |
device_count: str | |
# Representation for the TrainJob component. | |
@dataclass | |
class Component: | |
name: str | |
status: str | |
device: str | |
device_count: str | |
pod_name: str | |
Does it make sense ?
|
||
pod_name = None | ||
# Get Initializer or Trainer Pod name. | ||
for c in self.get_job(name).components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple Pods for the same role like initializer? How can we identify the Pod name?
We can imagine the situation where the Pods are recreate (but the old one still exist) based on restart, recreate, success, failure policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it is a good point.
I think, in the current loop it return the following:
Component(name='initializer', status='Failed', device='cpu', device_count='10', pod_name='xe37e16669f7-initializer-0-0-gaf51')
Component(name='initializer', status='Succeeded', device='cpu', device_count='10', pod_name='xe37e16669f7-initializer-0-0-hl4wv')
Component(name='trainer-node-0', status='Succeeded', device='gpu', device_count='4', pod_name='xe37e16669f7-trainer-node-0-0-ktbt8')
I think, we should discuss if that experience looks good to our users or we should change the output.
WDYT @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the katib UI had similar problem, and we addressed that, previously.
We might be able to learn something from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess, that function can return logs from the Failed Pod even if we have another Succeeded Pod since the podList order is alphabetical. Is that understanding correct, @tenzen-y ?
What do you think we should return with get_job()
API in that case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess, that function can return logs from the Failed Pod even if we have another Succeeded Pod since the podList order is alphabetical. Is that understanding correct, @tenzen-y ?
I guess that this current log collection implementation obtains the logs randomly. So, we sometimes fetch the logs from Failed Pods even if the replaced Running or Succeeded Pods, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y As you can see right now we fetch logs only from the Trainer component, if follow=True
:
training-operator/sdk_v2/kubeflow/training/api/training_client.py
Lines 368 to 377 in e9c22b2
if follow and component == constants.JOB_TRAINER_NODE: | |
log_streams = [] | |
log_streams.append( | |
watch.Watch().stream( | |
self.core_api.read_namespaced_pod_log, | |
name=pod_name, | |
namespace=self.namespace, | |
container=constants.CONTAINER_TRAINER, | |
) | |
) |
Otherwise, we just return logs from all TrainJob components:
training-operator/sdk_v2/kubeflow/training/api/training_client.py
Lines 410 to 432 in e9c22b2
if component == constants.JOB_INITIALIZER: | |
logs_dict[constants.CONTAINER_DATASET_INITIALIZER] = ( | |
self.core_api.read_namespaced_pod_log( | |
name=pod_name, | |
namespace=self.namespace, | |
container=constants.CONTAINER_DATASET_INITIALIZER, | |
) | |
) | |
logs_dict[constants.CONTAINER_MODEL_INITIALIZER] = ( | |
self.core_api.read_namespaced_pod_log( | |
name=pod_name, | |
namespace=self.namespace, | |
container=constants.CONTAINER_MODEL_INITIALIZER, | |
) | |
) | |
else: | |
logs_dict[component + "-" + str(node_index)] = ( | |
self.core_api.read_namespaced_pod_log( | |
name=pod_name, | |
namespace=self.namespace, | |
container=constants.CONTAINER_TRAINER, | |
) | |
) |
Additionally, by default the Job controller will not create additional Pods if it fails. It will try to re-create the existing Pod until it reaches BackOffLimits.
I am happy to revisit this API in the following PRs if we want to improve it.
WDYT @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Electronic-Waste @akshaychitneni @shravan-achar @droctothorpe @deepanker13 @seanlaii @astefanutti Any thoughts/ideas on the above topic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, by default the Job controller will not create additional Pods if it fails
The behavior of the failure is different based on Pod restart Policy, batch job success, and failure policies. So, we can easily reproduce the unexpected log fetching situations.
I can easily imagine the situations where the v2 SDK users create issues to report unexpected logs seeing.
So, I believe that addressing the issue would be better before the first v2 release.
If you can work on the issue in another PR before we release the first v2 release, I'm ok for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
cc3be21
to
e5736c9
Compare
6033a72
to
dd3a7a6
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Use kube-openapi generator to remove unnecessary models from apimachinery Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
e9c22b2
to
8e8243c
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Part of: #2216
This is initial implementation for the Kubeflow Training V2 SDK.
We are still designing the APIs, so these functions are not final.
These changes allow us to showcase the Kubeflow Training V2 demo at KubeCon 2024.
We discusses with @tenzen-y that we can add specific label to our runtimes to define whether runtime can be used for pre-training or post-training:
/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92 @kuizhiqing @shravan-achar @seanlaii