Add e2e test for train API (#2199)

* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
kubeflow · Dec 22, 2024 · d7f69e8 · d7f69e8
1 parent 69094e1
commit d7f69e8
Show file tree

Hide file tree

Showing 6 changed files with 224 additions and 1 deletion.
diff --git a/.github/workflows/e2e-test-train-api.yaml b/.github/workflows/e2e-test-train-api.yaml
@@ -0,0 +1,61 @@
+name: E2E Test with train API
+on:
+  - pull_request
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  e2e-test:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        kubernetes-version: ["v1.31.4"]
+        python-version: ["3.9", "3.10", "3.11"]
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Setup E2E Tests
+        uses: ./.github/workflows/setup-e2e-test
+        with:
+          kubernetes-version: ${{ matrix.kubernetes-version }}
+          python-version: ${{ matrix.python-version }}
+
+      - name: Build trainer
+        run: |
+          ./scripts/gha/build-trainer.sh
+        env:
+          TRAINER_CI_IMAGE: kubeflowtraining/trainer:test
+
+      - name: Load trainer
+        run: |
+          kind load docker-image ${{ env.TRAINER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
+        env:
+          KIND_CLUSTER: training-operator-cluster
+          TRAINER_CI_IMAGE: kubeflowtraining/trainer:test
+
+      - name: Build storage initializer
+        run: |
+          ./scripts/gha/build-storage-initializer.sh
+        env:
+          STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test
+          TRAINER_CI_IMAGE: kubeflowtraining/trainer:test
+
+      - name: Load storage initializer
+        run: |
+          kind load docker-image ${{ env.STORAGE_INITIALIZER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
+        env:
+          KIND_CLUSTER: training-operator-cluster
+          STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test
+
+      - name: Run tests
+        run: |
+          pip install pytest
+          python3 -m pip install -e sdk/python[huggingface]
+          pytest -s sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py --log-cli-level=debug
+        env:
+          STORAGE_INITIALIZER_IMAGE: kubeflowtraining/storage-initializer:test
+          TRAINER_TRANSFORMER_IMAGE_DEFAULT: kubeflowtraining/trainer:test
diff --git a/.github/workflows/integration-tests.yaml b/.github/workflows/integration-tests.yaml
@@ -68,7 +68,7 @@ jobs:
       - name: Run tests
         run: |
           pip install pytest
-          python3 -m pip install -e sdk/python; pytest -s sdk/python/test --log-cli-level=debug --namespace=default
+          python3 -m pip install -e sdk/python; pytest -s sdk/python/test/e2e --log-cli-level=debug --namespace=default
         env:
           GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
 

diff --git a/scripts/gha/build-storage-initializer.sh b/scripts/gha/build-storage-initializer.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+# Copyright 2024 The Kubeflow Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# The script is used to build Kubeflow Training image.
+
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+docker build sdk/python/kubeflow/storage_initializer -t ${STORAGE_INITIALIZER_CI_IMAGE} -f sdk/python/kubeflow/storage_initializer/Dockerfile
diff --git a/scripts/gha/build-trainer.sh b/scripts/gha/build-trainer.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+# Copyright 2024 The Kubeflow Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# The script is used to build Kubeflow Training image.
+
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+docker build sdk/python/kubeflow/trainer -t ${TRAINER_CI_IMAGE} -f sdk/python/kubeflow/trainer/Dockerfile.cpu
diff --git a/sdk/python/kubeflow/trainer/Dockerfile.cpu b/sdk/python/kubeflow/trainer/Dockerfile.cpu
@@ -0,0 +1,18 @@
+# Use an official Python runtime as a parent image
+FROM python:3.11
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy the requirements.txt file into the container
+COPY requirements.txt /app/requirements.txt
+
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir torch==2.5.1
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the Python package and its source code into the container
+COPY . /app
+
+# Run storage.py when the container launches
+ENTRYPOINT ["torchrun", "hf_llm_training.py"]
diff --git a/sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py b/sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py
@@ -0,0 +1,96 @@
+# Copyright 2024 kubeflow.org.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+import transformers
+from kubeflow.storage_initializer.hugging_face import (
+    HuggingFaceDatasetParams,
+    HuggingFaceModelParams,
+    HuggingFaceTrainerParams,
+)
+from kubeflow.training import TrainingClient, constants
+from peft import LoraConfig
+
+import test.e2e.utils as utils
+
+logging.basicConfig(format="%(message)s")
+logging.getLogger("kubeflow.training.api.training_client").setLevel(logging.DEBUG)
+
+TRAINING_CLIENT = TrainingClient(job_kind=constants.PYTORCHJOB_KIND)
+
+
+def test_sdk_e2e_create_from_train_api(job_namespace="default"):
+    JOB_NAME = "pytorchjob-from-train-api"
+
+    # Use test case from fine-tuning API tutorial.
+    # https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
+    TRAINING_CLIENT.train(
+        name=JOB_NAME,
+        namespace=job_namespace,
+        # BERT model URI and type of Transformer to train it.
+        model_provider_parameters=HuggingFaceModelParams(
+            model_uri="hf://google-bert/bert-base-cased",
+            transformer_type=transformers.AutoModelForSequenceClassification,
+            num_labels=5,
+        ),
+        # In order to save test time, use 8 samples from Yelp dataset.
+        dataset_provider_parameters=HuggingFaceDatasetParams(
+            repo_id="yelp_review_full",
+            split="train[:8]",
+        ),
+        # Specify HuggingFace Trainer parameters.
+        trainer_parameters=HuggingFaceTrainerParams(
+            training_parameters=transformers.TrainingArguments(
+                output_dir="test_trainer",
+                save_strategy="no",
+                evaluation_strategy="no",
+                do_eval=False,
+                disable_tqdm=True,
+                log_level="info",
+                num_train_epochs=1,
+            ),
+            # Set LoRA config to reduce number of trainable parameters.
+            lora_config=LoraConfig(
+                r=8,
+                lora_alpha=8,
+                lora_dropout=0.1,
+                bias="none",
+            ),
+        ),
+        num_workers=1,
+        num_procs_per_worker=1,
+        resources_per_worker={
+            "gpu": 0,
+            "cpu": 2,
+            "memory": "10G",
+        },
+        storage_config={
+            "size": "10Gi",
+            "access_modes": ["ReadWriteOnce"],
+        },
+    )
+
+    logging.info(f"List of created {TRAINING_CLIENT.job_kind}s")
+    logging.info(TRAINING_CLIENT.list_jobs(job_namespace))
+
+    try:
+        utils.verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, wait_timeout=900)
+    except Exception as e:
+        utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
+        TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
+        raise Exception(f"PyTorchJob create from API E2E fails. Exception: {e}")
+
+    utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
+    TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)