Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Nov 9, 2024

Part of: #2216

This is initial implementation for the Kubeflow Training V2 SDK.

We are still designing the APIs, so these functions are not final.
These changes allow us to showcase the Kubeflow Training V2 demo at KubeCon 2024.

We discusses with @tenzen-y that we can add specific label to our runtimes to define whether runtime can be used for pre-training or post-training:

training.kubeflow.org/phase: pre-training
training.kubeflow.org/phase: post-training

/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92 @kuizhiqing @shravan-achar @seanlaii

@andreyvelich
Copy link
Member Author

/hold

@coveralls
Copy link

coveralls commented Nov 9, 2024

Pull Request Test Coverage Report for Build 12281612841

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12241759597: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls


# Create the TrainJob.
try:
self.custom_api.create_namespaced_custom_object(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the status of this api, whether this returns 200 or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if this API fails it throws an Exception

except Exception:
raise RuntimeError(
f"Failed to create {constants.TRAINJOB_KIND}: {self.namespace}/{train_job_name}"
)

PHASE_PRE_TRAINING = "pre-training"

# The value indicates that runtime can be used for the model pre-training.
PHASE_PRE_TRAINING = "post-training"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be PHASE_POST_TRAINING I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Copy link
Contributor

@deepanker13 deepanker13 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please change the comment as well @andreyvelich
The value indicates that runtime can be used for the model pre-training.
Thanks !

Comment on lines +63 to +98
@dataclass
class HuggingFaceDatasetConfig:
storage_uri: str
access_token: Optional[str] = None


@dataclass
# Configuration for the HuggingFace model provider.
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to move HuggingFaceDatasetConfig and HuggingFaceModelConfig to here in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. We have defined dataset and model related classes in pkg/initializerz_v2.

# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceDatasetConfig:
storage_uri: str
access_token: Optional[str] = None

# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None

I guess it would be better if we move these py files in pkg/initializer_v2 to the SDK directory when generating SDK using setup.py. For example: https://github.com/kubeflow/katib/blob/2b41ae62ab3905984e02123218351a703c03bf56/sdk/python/v1beta1/setup.py#L33-L39

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will move them soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanlaii @Electronic-Waste I've completed this.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. I'm willing to help if needed:)

Comment on lines +63 to +98
@dataclass
class HuggingFaceDatasetConfig:
storage_uri: str
access_token: Optional[str] = None


@dataclass
# Configuration for the HuggingFace model provider.
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. We have defined dataset and model related classes in pkg/initializerz_v2.

# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceDatasetConfig:
storage_uri: str
access_token: Optional[str] = None

# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None

I guess it would be better if we move these py files in pkg/initializer_v2 to the SDK directory when generating SDK using setup.py. For example: https://github.com/kubeflow/katib/blob/2b41ae62ab3905984e02123218351a703c03bf56/sdk/python/v1beta1/setup.py#L33-L39


return train_job_name

def list_jobs(self) -> List[types.TrainJob]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, do we want to pass runtimeRef name as optional argument here? might be a good idea if we can get train jobs for a particular runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good idea.
I added this parameter @saileshd1402

@andreyvelich
Copy link
Member Author

andreyvelich commented Nov 15, 2024

I think, this PR should be ready to review to give users provide initial support for Python SDK.
We've already given demo for these APIs at KubeCon 2024 NA, will share the recordings when it is ready.

@andreyvelich andreyvelich force-pushed the issue-2216-initial-sdk-api branch from ee57831 to 2abf580 Compare November 18, 2024 12:58
@andreyvelich
Copy link
Member Author

/hold cancel

config.load_incluster_config()

k8s_client = client.ApiClient(client_configuration)
self.custom_api = client.CustomObjectsApi(k8s_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does these client include retries on any failure? May be we could have a wrapper client that would add retries

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is a good idea.
@akshaychitneni Can we have a followup PR to implement it ?
Please can you create an issue to track it ?

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Lots of work! I left a few comments for you.

I think we should co-design configs for Pre-Training and Fine-Tuning in the Trainer parameter.

sdk_v2/kubeflow/training/types/types.py Show resolved Hide resolved
hack/python-sdk-v2/gen-sdk.sh Outdated Show resolved Hide resolved

# The label key to identify the device (e.g. GPU, TPU) that is used for training.
# TODO: Potentially, we should get this data from the Node selectors.
DEVICE_KEY = "training.kubeflow.org/device"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, what exactly does the "device" mean?
Computing Device? Persistent Storage device? Ephemeral Storage device like memory and extended persistent memory? NIC device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the main goal of this is to show Data Scientists which device will be used by the Training Runtime to perform model training.

This will help them to understand the accelerator (e.g. V100, H100) on which their model training will be run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we name it just an accelerator?
The device sounds weird since you seem to aim support only accelerator including FPGA, and you do not have any plan to support any other special devices.

Copy link
Member Author

@andreyvelich andreyvelich Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. However, I was just exploring how make it more close to ML frameworks, and they call their API as "device".

Thus, I was thinking that device is closer to data scientists expectation when they want to assign their ML code to the specific accelerator. WDYT @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PyTorch, it seems that the device indicates only GPU since the lib name is "cuda.device".
For MLX, I'm not sure which devices can be specified because I can not find any documentation for supported devices at first glance.

As a result, I still would recommend using the accelerator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MLX, I'm not sure which devices can be specified because I can not find any documentation for supported devices at first glance

In MLX you can specify CPU or GPU (e.g. Metal). For example: mx.set_default_device(mx.cpu) or mx.set_default_device(mx.gpu): https://github.com/ml-explore/mlx-examples/blob/main/mnist/main.py#L103

On EKS and GKE the label that indicates GPU type on the node is named accelerator k8s.amazonaws.com/accelerator: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#special-note-on-gpu-instances

However, how we can deal with CPU-based runtimes, e.g. we don't call CPU an "accelerator", isn't ?
Or maybe in the future we can support other devices/accelerators.

@tenzen-y Can we create a tracking issue to discuss this further ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, how we can deal with CPU-based runtimes, e.g. we don't call CPU an "accelerator", isn't ?
Or maybe in the future we can support other devices/accelerators.

IIUC, this device field corresponds to the Kubernetes scheduling directives like nodeSelector. So, I'm not sure the reason why we need to provide the knob for CPU workloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y The main goal of this device and device count parameter is to show Data Scientists which hardware resources they can utilize while using the existing Runtimes. With that, they can appropriately configure their training code to use those devices. For example, using nccl backend in PyTorch if GPUs are available or using bfloat16 if GPU supports it.

As you can see, the device and device count is part of Runtime and Component classes that Data Scientists can get:

class Runtime:
name: str
phase: str
device: str
device_count: str
# Representation for the TrainJob component.
@dataclass
class Component:
name: str
status: str
device: str
device_count: str
pod_name: str

Does it make sense ?

sdk_v2/kubeflow/training/api/training_client.py Outdated Show resolved Hide resolved

pod_name = None
# Get Initializer or Trainer Pod name.
for c in self.get_job(name).components:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple Pods for the same role like initializer? How can we identify the Pod name?
We can imagine the situation where the Pods are recreate (but the old one still exist) based on restart, recreate, success, failure policies.

Copy link
Member Author

@andreyvelich andreyvelich Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it is a good point.
I think, in the current loop it return the following:

Component(name='initializer', status='Failed', device='cpu', device_count='10', pod_name='xe37e16669f7-initializer-0-0-gaf51')
Component(name='initializer', status='Succeeded', device='cpu', device_count='10', pod_name='xe37e16669f7-initializer-0-0-hl4wv')
Component(name='trainer-node-0', status='Succeeded', device='gpu', device_count='4', pod_name='xe37e16669f7-trainer-node-0-0-ktbt8')

I think, we should discuss if that experience looks good to our users or we should change the output.

WDYT @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the katib UI had similar problem, and we addressed that, previously.
We might be able to learn something from that.

https://github.com/kubeflow/katib/blob/2b41ae62ab3905984e02123218351a703c03bf56/pkg/ui/v1beta1/backend.go#L720

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess, that function can return logs from the Failed Pod even if we have another Succeeded Pod since the podList order is alphabetical. Is that understanding correct, @tenzen-y ?
What do you think we should return with get_job() API in that case ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess, that function can return logs from the Failed Pod even if we have another Succeeded Pod since the podList order is alphabetical. Is that understanding correct, @tenzen-y ?

I guess that this current log collection implementation obtains the logs randomly. So, we sometimes fetch the logs from Failed Pods even if the replaced Running or Succeeded Pods, isn't it?

Copy link
Member Author

@andreyvelich andreyvelich Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y As you can see right now we fetch logs only from the Trainer component, if follow=True:

if follow and component == constants.JOB_TRAINER_NODE:
log_streams = []
log_streams.append(
watch.Watch().stream(
self.core_api.read_namespaced_pod_log,
name=pod_name,
namespace=self.namespace,
container=constants.CONTAINER_TRAINER,
)
)

Otherwise, we just return logs from all TrainJob components:

if component == constants.JOB_INITIALIZER:
logs_dict[constants.CONTAINER_DATASET_INITIALIZER] = (
self.core_api.read_namespaced_pod_log(
name=pod_name,
namespace=self.namespace,
container=constants.CONTAINER_DATASET_INITIALIZER,
)
)
logs_dict[constants.CONTAINER_MODEL_INITIALIZER] = (
self.core_api.read_namespaced_pod_log(
name=pod_name,
namespace=self.namespace,
container=constants.CONTAINER_MODEL_INITIALIZER,
)
)
else:
logs_dict[component + "-" + str(node_index)] = (
self.core_api.read_namespaced_pod_log(
name=pod_name,
namespace=self.namespace,
container=constants.CONTAINER_TRAINER,
)
)

Additionally, by default the Job controller will not create additional Pods if it fails. It will try to re-create the existing Pod until it reaches BackOffLimits.

I am happy to revisit this API in the following PRs if we want to improve it.
WDYT @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, by default the Job controller will not create additional Pods if it fails

The behavior of the failure is different based on Pod restart Policy, batch job success, and failure policies. So, we can easily reproduce the unexpected log fetching situations.

I can easily imagine the situations where the v2 SDK users create issues to report unexpected logs seeing.
So, I believe that addressing the issue would be better before the first v2 release.

If you can work on the issue in another PR before we release the first v2 release, I'm ok for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, created this @tenzen-y: #2348

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@andreyvelich andreyvelich force-pushed the issue-2216-initial-sdk-api branch from cc3be21 to e5736c9 Compare November 27, 2024 16:17
@andreyvelich andreyvelich force-pushed the issue-2216-initial-sdk-api branch 2 times, most recently from 6033a72 to dd3a7a6 Compare November 28, 2024 21:58
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Use kube-openapi generator to remove unnecessary models from apimachinery

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2216-initial-sdk-api branch from e9c22b2 to 8e8243c Compare December 10, 2024 12:44
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit ea01481 into kubeflow:master Dec 13, 2024
52 checks passed
@andreyvelich andreyvelich deleted the issue-2216-initial-sdk-api branch December 13, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants