[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit · 2024-09-03T13:17:38Z

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow · 2024-09-03T13:17:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

helenxie-bit · 2024-09-03T13:21:23Z

/area gsoc

helenxie-bit · 2024-09-03T13:21:49Z

Ref: #2339

Signed-off-by: helenxie-bit <[email protected]>

…roller Signed-off-by: helenxie-bit <[email protected]>

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow · 2025-01-25T01:02:05Z

@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: mahdikhashan.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Thank you for doing this @helenxie-bit!
Just small comments.

.github/workflows/free-up-disk-space/action.yaml

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

andreyvelich · 2025-01-27T17:32:31Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
+
+    # Use the test case from fine-tuning API tutorial.
+    # https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/


Should we link an updated guide for Katib LLM Optimization ?

Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?

Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.

I tried running the above example, but I ran into some unexpected errors in the storage_initializer container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.

If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.

I think, it is fine to include it in RC.1 since it is a bug fix.

we can keep URL for Kubeflow Training docs for now.

Signed-off-by: helenxie-bit <[email protected]>

mahdikhashan · 2025-01-28T15:52:11Z

I started reviewing this pr.

mahdikhashan

I couldn't run the external test on Mac M1 with 8GB ram with K8s cluster using K3d.

INFO:root:---------------------------------------------------------------
INFO:root:E2E is failed for Experiment created by tune: default/tune-example-2
INFO:root:---------------------------------------------------------------
INFO:root:---------------------------------------------------------------
DEBUG:kubernetes.client.rest:response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 183, in <module>
    raise e
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 177, in <module>
    run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 81, in run_e2e_experiment_create_by_tune_with_llm_optimization
    katib_client.tune(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 605, in tune
    lora_config = utils.get_trial_substitutions_from_trainer(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 213, in get_trial_substitutions_from_trainer
    parameters = json.dumps(parameters.__dict__, cls=SetEncoder)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 143, in default
    return json.JSONEncoder.default(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LoraRuntimeConfig is not JSON serializable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1223, in delete_experiment
    self.custom_api.delete_namespaced_custom_object(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 911, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 1038, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 270, in DELETE
    return self.request("DELETE", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '2b6ee8e1-d8e8-4ec1-9fe4-bcea39264f1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7a07626f-55e1-4b48-b5f1-87b4cd8b517f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a02fdbf6-92c2-4dc6-b5f8-416b965fc7f7', 'Date': 'Wed, 29 Jan 2025 12:14:45 GMT', 'Content-Length': '246'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 188, in <module>
    katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1236, in delete_experiment
    raise RuntimeError(f"Failed to delete Katib Experiment: {namespace}/{name}")
RuntimeError: Failed to delete Katib Experiment: default/tune-example-2
NAME                                 READY   STATUS    RESTARTS   AGE
katib-controller-754877f9f-zvscj     1/1     Running   0          20m
katib-db-manager-64d9c694dd-m9k4h    1/1     Running   0          20m
katib-mysql-74f9795f8b-6h55q         1/1     Running   0          20m
katib-ui-65698b4896-glq9p            1/1     Running   0          20m
training-operator-7dc56b6448-28r69   1/1     Running   0          22m

mahdikhashan · 2025-01-29T12:27:53Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+    HuggingFaceDatasetParams,
+    HuggingFaceModelParams,
+    HuggingFaceTrainerParams,
+)


I would suggest importing each e2e specific requirements inside its function, for example:

# Test for Experiment created with external models and datasets. def run_e2e_experiment_create_by_tune_with_llm_optimization( katib_client: KatibClient, exp_name: str, exp_namespace: str, ): from kubeflow.storage_initializer.hugging_face import ( HuggingFaceDatasetParams, HuggingFaceModelParams, HuggingFaceTrainerParams, ) import transformers from peft import LoraConfig # Create Katib Experiment and wait until it is finished. logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

in this way, the scope of each test is more determined - WDYT?

Sorry for the late reply. That makes sense, already updated.

thank you for your time.

mahdikhashan · 2025-01-29T12:33:11Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+        logging.info("---------------------------------------------------------------")
+        katib_client.delete_experiment(exp_name_custom_objective, exp_namespace)
+
+    try:


I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:

test_tune_data = [ ( "tune_with_custom_objective", run_e2e_experiment_create_by_tune_with_custom_objective, ), ]

WDYT?

Since there are only three test functions in this e2e test, I think we can stick to the original way. WDYT?

yes, let's keep it as it is for now, i'll create an issue to follow with it later improvement. thank you for your time.

i created an issue for it here, maybe someone can contribute to katib later. (#2532)

mahdikhashan · 2025-01-29T12:35:48Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

@@ -79,18 +156,33 @@ def objective(parameters):
        client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

    # Test with run_e2e_experiment_create_by_tune
-    exp_name = "tune-example"
+    exp_name_custom_objective = "tune-example-1"
+    exp_name_llm_optimization = "tune-example-2"


I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2.

how about tune-for-an-objective-function and tune-for-external-model. WDYT? (feel free to offer better names, these were spontaneous ideas).

Thank you for the suggestions! I updated the e2e test for LLM optimization API to "tune-example-llm-optimization", hope it will be clearer.

thank you for your time.

mahdikhashan · 2025-01-29T12:39:41Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+
+    # Print the Experiment and Suggestion.
+    logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
+    logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))


I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?

Thank you for the suggestion! Already updated.

thank you for your time.

mahdikhashan · 2025-01-29T12:46:17Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

andreyvelich · 2025-02-03T13:12:11Z

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0.
Do you have enough time to finish it ?

helenxie-bit · 2025-02-03T17:57:54Z

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0. Do you have enough time to finish it ?

Sorry I'm a bit swamped with other stuff right now, so merging this by Wednesday might be tough. Could we include it in Katib RC.1 instead? I'll do my best to wrap it up as soon as I can.

andreyvelich · 2025-02-03T18:05:24Z

Sure, no problem, we can merge it after the first RC

andreyvelich · 2025-03-05T14:18:49Z

Hi @helenxie-bit, we want to cut the Katib 0.18 soon. Do you have time to finish this PR ?

helenxie-bit · 2025-03-05T17:08:18Z

@andreyvelich Sorry for the delay! I'm occupied with things these two days and will be free and work on this after then. When are we going to cut Katib 0.18?

andreyvelich · 2025-03-05T17:10:49Z

We would like to release it in the next 2 weeks.

andreyvelich · 2025-03-18T18:24:52Z

Hi @helenxie-bit , we are planning to cut Katib release this week.
Do you think you can finish this PR ?

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2025-03-18T23:12:19Z

[ ]

Hi @helenxie-bit , we are planning to cut Katib release this week. Do you think you can finish this PR ?

@andreyvelich Thank you for catching up! I'm working on this. But the e2e test failed due to some problem inside the trainer. Here is the error message:

I0318 22:47:28.491627     308 main.go:396] Trial Name: tune-example-llm-optimization-mkfm67k9
I0318 22:47:33.946979     308 main.go:139] 2025-03-18T22:47:33Z INFO     Starting HuggingFace LLM Trainer
I0318 22:47:33.950305     308 main.go:139] /usr/local/lib/python3.10/dist-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0318 22:47:33.950324     308 main.go:139]   warnings.warn(
I0318 22:47:33.952095     308 main.go:139] /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1317: UserWarning: For MPI backend, world_size (1) and rank (0) are ignored since they are assigned by the MPI runtime.
I0318 22:47:33.952106     308 main.go:139]   warnings.warn(
I0318 22:47:34.003708     308 main.go:139] /usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0318 22:47:34.003725     308 main.go:139]   warnings.warn(
I0318 22:47:34.005569     308 main.go:139] 2025-03-18T22:47:34Z INFO     Setup model and tokenizer
I0318 22:47:34.006007     308 main.go:139] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
I0318 22:47:34.006018     308 main.go:139]   warnings.warn(
I0318 22:47:35.597752     308 main.go:139] [rank0]: Traceback (most recent call last):
I0318 22:47:35.597801     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
I0318 22:47:35.597818     308 main.go:139] [rank0]:     resolved_file = hf_hub_download(
I0318 22:47:35.597822     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
I0318 22:47:35.597834     308 main.go:139] [rank0]:     return fn(*args, **kwargs)
I0318 22:47:35.597842     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
I0318 22:47:35.597856     308 main.go:139] [rank0]:     return _hf_hub_download_to_cache_dir(
I0318 22:47:35.597863     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
I0318 22:47:35.597875     308 main.go:139] [rank0]:     _raise_on_head_call_error(head_call_error, force_download, local_files_only)
I0318 22:47:35.597882     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1477, in _raise_on_head_call_error
I0318 22:47:35.597893     308 main.go:139] [rank0]:     raise LocalEntryNotFoundError(
I0318 22:47:35.597898     308 main.go:139] [rank0]: huggingface_hub.errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

I checked the logs of the master pod, and it only has two containers: pytorch and metrics-logger-and-collector. It seems the container of storage-initializer is not created.

command: kubectl logs tune-example-llm-optimization-mkfm67k9-master-0 -n default
                           
Defaulted container "pytorch" out of: pytorch, metrics-logger-and-collector

I'm not sure if it has something to do with the update of training operator. Do you have any ideas?

By the way, I've installed Training Operator control plane v1.8.1. I tried to install the latest Training Operator control plane by running kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=master", however, it shows the following error. I'm not sure if it has something to due with the storage-initializer error:

error: evalsymlink failure on '/private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone' : lstat /private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone: no such file or directory

helenxie-bit · 2025-03-18T23:30:58Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable
it seems that the reason for test failure on my machine is
TypeError: Object of type LoraRuntimeConfig is not JSON serializable
my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

mahdikhashan · 2025-03-19T17:17:40Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable
it seems that the reason for test failure on my machine is
TypeError: Object of type LoraRuntimeConfig is not JSON serializable
my python version is Python 3.12.7.
@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

yes, i'll do so and share the full testing env for it so then we can work on it.

add e2e test for tune api

6be7f29

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege September 3, 2024 13:17

google-oss-prow bot added the size/M label Sep 3, 2024

helenxie-bit mentioned this pull request Sep 3, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Open

6 tasks

google-oss-prow bot added the area/gsoc label Sep 3, 2024

helenxie-bit added 2 commits September 3, 2024 21:38

upgrade training-operator sdk

1a1f119

Signed-off-by: helenxie-bit <[email protected]>

specify the version of training operator sdk

8461a49

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit changed the title ~~[GSoC] Add e2e test for tune api with LLM hyperparameter optimization~~ [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 3, 2024

fix num_labels error and update the version of training operator cont…

c860238

…roller Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024

helenxie-bit added 13 commits September 3, 2024 22:30

check the version of training operator

216ebd9

Signed-off-by: helenxie-bit <[email protected]>

debug

f6b96f5

Signed-off-by: helenxie-bit <[email protected]>

check import path of HuggingFaceModelParams

c636493

Signed-off-by: helenxie-bit <[email protected]>

update the version of training operator sdk

8180422

Signed-off-by: helenxie-bit <[email protected]>

update the name of experiment

6101489

Signed-off-by: helenxie-bit <[email protected]>

add step of checking pod

d67a1b8

Signed-off-by: helenxie-bit <[email protected]>

check the logs of pod

295abb6

Signed-off-by: helenxie-bit <[email protected]>

add check

e0a1b6d

Signed-off-by: helenxie-bit <[email protected]>

check reason for imagepullbackoff

1df7df9

Signed-off-by: helenxie-bit <[email protected]>

revert timeout limit

d1e1311

Signed-off-by: helenxie-bit <[email protected]>

fix format

0cc319f

Signed-off-by: helenxie-bit <[email protected]>

extend timeout limit

0383932

Signed-off-by: helenxie-bit <[email protected]>

update training operator sdk version

08c8634

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from a team and Electronic-Waste January 25, 2025 01:02

Electronic-Waste mentioned this pull request Jan 25, 2025

[Release] Katib 0.18 Roadmap #2386

Open

7 tasks

andreyvelich reviewed Jan 27, 2025

View reviewed changes

helenxie-bit added 3 commits January 27, 2025 09:55

add blank line at the end of free-up-disk-space yaml file

5e2e44f

Signed-off-by: helenxie-bit <[email protected]>

update experiment name

982e268

Signed-off-by: helenxie-bit <[email protected]>

update test function name to be consistent with experiment name

55c404d

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit mentioned this pull request Jan 27, 2025

[SDK] ValueError: <HUB_TOKEN> is not a valid HubStrategy, please select one of ['end', 'every_save', 'checkpoint', 'all_checkpoints'] #2495

Open

mahdikhashan reviewed Jan 29, 2025

View reviewed changes

andreyvelich added this to the v0.18 milestone Feb 3, 2025

mahdikhashan mentioned this pull request Feb 16, 2025

Add mahdi khashan as a member kubeflow/internal-acls#751

Merged

helenxie-bit added 5 commits March 18, 2025 13:39

move import statements inside the function

023f535

Signed-off-by: helenxie-bit <[email protected]>

apply pprint for the logging output

17fe1c1

Signed-off-by: helenxie-bit <[email protected]>

update experiment names

3dd5282

Signed-off-by: helenxie-bit <[email protected]>

resolve conflicts

33b38b8

Signed-off-by: helenxie-bit <[email protected]>

fix format

0487832

Signed-off-by: helenxie-bit <[email protected]>

mahdikhashan mentioned this pull request Mar 19, 2025

[test] improve e2e test file flow for test cases #2532

Open

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Are you sure you want to change the base?

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Conversation

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Jan 25, 2025

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan commented Jan 28, 2025

mahdikhashan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan commented Jan 29, 2025 • edited Loading

andreyvelich commented Feb 3, 2025 • edited Loading

helenxie-bit commented Feb 3, 2025

andreyvelich commented Feb 3, 2025

andreyvelich commented Mar 5, 2025

helenxie-bit commented Mar 5, 2025

andreyvelich commented Mar 5, 2025

andreyvelich commented Mar 18, 2025

helenxie-bit commented Mar 18, 2025 • edited Loading

helenxie-bit commented Mar 18, 2025

mahdikhashan commented Mar 19, 2025

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

mahdikhashan Jan 29, 2025 •

edited

Loading

mahdikhashan Mar 19, 2025 •

edited

Loading

mahdikhashan commented Jan 29, 2025 •

edited

Loading

andreyvelich commented Feb 3, 2025 •

edited

Loading

helenxie-bit commented Mar 18, 2025 •

edited

Loading