Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Open
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@helenxie-bit
Copy link
Contributor Author

/area gsoc

@helenxie-bit
Copy link
Contributor Author

Ref: #2339

@helenxie-bit helenxie-bit changed the title [GSoC] Add e2e test for tune api with LLM hyperparameter optimization [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@google-oss-prow google-oss-prow bot requested review from a team and Electronic-Waste January 25, 2025 01:02
Copy link

@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: mahdikhashan.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @helenxie-bit!
Just small comments.

logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link an updated guide for Katib LLM Optimization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?

Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running the above example, but I ran into some unexpected errors in the storage_initializer container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.

If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is fine to include it in RC.1 since it is a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep URL for Kubeflow Training docs for now.

@mahdikhashan
Copy link
Member

I started reviewing this pr.

Copy link
Member

@mahdikhashan mahdikhashan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't run the external test on Mac M1 with 8GB ram with K8s cluster using K3d.

INFO:root:---------------------------------------------------------------
INFO:root:E2E is failed for Experiment created by tune: default/tune-example-2
INFO:root:---------------------------------------------------------------
INFO:root:---------------------------------------------------------------
DEBUG:kubernetes.client.rest:response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 183, in <module>
    raise e
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 177, in <module>
    run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 81, in run_e2e_experiment_create_by_tune_with_llm_optimization
    katib_client.tune(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 605, in tune
    lora_config = utils.get_trial_substitutions_from_trainer(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 213, in get_trial_substitutions_from_trainer
    parameters = json.dumps(parameters.__dict__, cls=SetEncoder)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 143, in default
    return json.JSONEncoder.default(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LoraRuntimeConfig is not JSON serializable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1223, in delete_experiment
    self.custom_api.delete_namespaced_custom_object(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 911, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 1038, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 270, in DELETE
    return self.request("DELETE", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '2b6ee8e1-d8e8-4ec1-9fe4-bcea39264f1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7a07626f-55e1-4b48-b5f1-87b4cd8b517f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a02fdbf6-92c2-4dc6-b5f8-416b965fc7f7', 'Date': 'Wed, 29 Jan 2025 12:14:45 GMT', 'Content-Length': '246'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 188, in <module>
    katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1236, in delete_experiment
    raise RuntimeError(f"Failed to delete Katib Experiment: {namespace}/{name}")
RuntimeError: Failed to delete Katib Experiment: default/tune-example-2
NAME                                 READY   STATUS    RESTARTS   AGE
katib-controller-754877f9f-zvscj     1/1     Running   0          20m
katib-db-manager-64d9c694dd-m9k4h    1/1     Running   0          20m
katib-mysql-74f9795f8b-6h55q         1/1     Running   0          20m
katib-ui-65698b4896-glq9p            1/1     Running   0          20m
training-operator-7dc56b6448-28r69   1/1     Running   0          22m

HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest importing each e2e specific requirements inside its function, for example:

# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_llm_optimization(
    katib_client: KatibClient,
    exp_name: str,
    exp_namespace: str,
):
    from kubeflow.storage_initializer.hugging_face import (
        HuggingFaceDatasetParams,
        HuggingFaceModelParams,
        HuggingFaceTrainerParams,
    )
    import transformers
    from peft import LoraConfig

    # Create Katib Experiment and wait until it is finished.
    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

in this way, the scope of each test is more determined - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. That makes sense, already updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your time.

logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name_custom_objective, exp_namespace)

try:
Copy link
Member

@mahdikhashan mahdikhashan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:

test_tune_data = [
    (
        "tune_with_custom_objective",
        run_e2e_experiment_create_by_tune_with_custom_objective,
    ),
]

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are only three test functions in this e2e test, I think we can stick to the original way. WDYT?

Copy link
Member

@mahdikhashan mahdikhashan Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's keep it as it is for now, i'll create an issue to follow with it later improvement. thank you for your time.

i created an issue for it here, maybe someone can contribute to katib later. (#2532)

@@ -79,18 +156,33 @@ def objective(parameters):
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

# Test with run_e2e_experiment_create_by_tune
exp_name = "tune-example"
exp_name_custom_objective = "tune-example-1"
exp_name_llm_optimization = "tune-example-2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2.

how about tune-for-an-objective-function and tune-for-external-model. WDYT? (feel free to offer better names, these were spontaneous ideas).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestions! I updated the e2e test for LLM optimization API to "tune-example-llm-optimization", hope it will be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your time.


# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! Already updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your time.

@mahdikhashan
Copy link
Member

mahdikhashan commented Jan 29, 2025

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@andreyvelich
Copy link
Member

andreyvelich commented Feb 3, 2025

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0.
Do you have enough time to finish it ?

@andreyvelich andreyvelich added this to the v0.18 milestone Feb 3, 2025
@helenxie-bit
Copy link
Contributor Author

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0. Do you have enough time to finish it ?

Sorry I'm a bit swamped with other stuff right now, so merging this by Wednesday might be tough. Could we include it in Katib RC.1 instead? I'll do my best to wrap it up as soon as I can.

@andreyvelich
Copy link
Member

Sure, no problem, we can merge it after the first RC

@andreyvelich
Copy link
Member

Hi @helenxie-bit, we want to cut the Katib 0.18 soon. Do you have time to finish this PR ?

@helenxie-bit
Copy link
Contributor Author

@andreyvelich Sorry for the delay! I'm occupied with things these two days and will be free and work on this after then. When are we going to cut Katib 0.18?

@andreyvelich
Copy link
Member

We would like to release it in the next 2 weeks.

@andreyvelich
Copy link
Member

Hi @helenxie-bit , we are planning to cut Katib release this week.
Do you think you can finish this PR ?

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Mar 18, 2025

  • [ ]

Hi @helenxie-bit , we are planning to cut Katib release this week. Do you think you can finish this PR ?

@andreyvelich Thank you for catching up! I'm working on this. But the e2e test failed due to some problem inside the trainer. Here is the error message:

I0318 22:47:28.491627     308 main.go:396] Trial Name: tune-example-llm-optimization-mkfm67k9
I0318 22:47:33.946979     308 main.go:139] 2025-03-18T22:47:33Z INFO     Starting HuggingFace LLM Trainer
I0318 22:47:33.950305     308 main.go:139] /usr/local/lib/python3.10/dist-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0318 22:47:33.950324     308 main.go:139]   warnings.warn(
I0318 22:47:33.952095     308 main.go:139] /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1317: UserWarning: For MPI backend, world_size (1) and rank (0) are ignored since they are assigned by the MPI runtime.
I0318 22:47:33.952106     308 main.go:139]   warnings.warn(
I0318 22:47:34.003708     308 main.go:139] /usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0318 22:47:34.003725     308 main.go:139]   warnings.warn(
I0318 22:47:34.005569     308 main.go:139] 2025-03-18T22:47:34Z INFO     Setup model and tokenizer
I0318 22:47:34.006007     308 main.go:139] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
I0318 22:47:34.006018     308 main.go:139]   warnings.warn(
I0318 22:47:35.597752     308 main.go:139] [rank0]: Traceback (most recent call last):
I0318 22:47:35.597801     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
I0318 22:47:35.597818     308 main.go:139] [rank0]:     resolved_file = hf_hub_download(
I0318 22:47:35.597822     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
I0318 22:47:35.597834     308 main.go:139] [rank0]:     return fn(*args, **kwargs)
I0318 22:47:35.597842     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
I0318 22:47:35.597856     308 main.go:139] [rank0]:     return _hf_hub_download_to_cache_dir(
I0318 22:47:35.597863     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
I0318 22:47:35.597875     308 main.go:139] [rank0]:     _raise_on_head_call_error(head_call_error, force_download, local_files_only)
I0318 22:47:35.597882     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1477, in _raise_on_head_call_error
I0318 22:47:35.597893     308 main.go:139] [rank0]:     raise LocalEntryNotFoundError(
I0318 22:47:35.597898     308 main.go:139] [rank0]: huggingface_hub.errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

I checked the logs of the master pod, and it only has two containers: pytorch and metrics-logger-and-collector. It seems the container of storage-initializer is not created.

command: kubectl logs tune-example-llm-optimization-mkfm67k9-master-0 -n default
                           
Defaulted container "pytorch" out of: pytorch, metrics-logger-and-collector

I'm not sure if it has something to do with the update of training operator. Do you have any ideas?

By the way, I've installed Training Operator control plane v1.8.1. I tried to install the latest Training Operator control plane by running kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=master", however, it shows the following error. I'm not sure if it has something to due with the storage-initializer error:

error: evalsymlink failure on '/private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone' : lstat /private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone: no such file or directory

@helenxie-bit
Copy link
Contributor Author

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

@mahdikhashan
Copy link
Member

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

yes, i'll do so and share the full testing env for it so then we can work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants