Skip to content

Commit

Permalink
Add e2e test for train API (#2199)
Browse files Browse the repository at this point in the history
* add e2e test for train API

Signed-off-by: helenxie-bit <[email protected]>

* fix peft import error

Signed-off-by: helenxie-bit <[email protected]>

* update settings of the job

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix error detection

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflict

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflict

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflict

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix NoneType error

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* test bug

Signed-off-by: helenxie-bit <[email protected]>

* find bug

Signed-off-by: helenxie-bit <[email protected]>

* find bug

Signed-off-by: helenxie-bit <[email protected]>

* find bug

Signed-off-by: helenxie-bit <[email protected]>

* add storage_config

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* reduce pvc size

Signed-off-by: helenxie-bit <[email protected]>

* set storage_config

Signed-off-by: helenxie-bit <[email protected]>

* set storage_config

Signed-off-by: helenxie-bit <[email protected]>

* set storage_config

Signed-off-by: helenxie-bit <[email protected]>

* set storage_config

Signed-off-by: helenxie-bit <[email protected]>

* use gpu

Signed-off-by: helenxie-bit <[email protected]>

* use gpu

Signed-off-by: helenxie-bit <[email protected]>

* use gpu

Signed-off-by: helenxie-bit <[email protected]>

* fix 'set_device' error

Signed-off-by: helenxie-bit <[email protected]>

* add timeout error

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* update e2e test for train api

Signed-off-by: helenxie-bit <[email protected]>

* add num_labels

Signed-off-by: helenxie-bit <[email protected]>

* update pip install

Signed-off-by: helenxie-bit <[email protected]>

* check disk space

Signed-off-by: helenxie-bit <[email protected]>

* change sequence of e2e tests

Signed-off-by: helenxie-bit <[email protected]>

* add clean-up after each e2e test of pytorchjob

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function-add check disk

Signed-off-by: helenxie-bit <[email protected]>

* check docker volumes

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* check docker directory

Signed-off-by: helenxie-bit <[email protected]>

* update pip install and 'num_workers'

Signed-off-by: helenxie-bit <[email protected]>

* update pip install and 'num_workers'

Signed-off-by: helenxie-bit <[email protected]>

* update pip install

Signed-off-by: helenxie-bit <[email protected]>

* change the value of 'clean_pod_policy'

Signed-off-by: helenxie-bit <[email protected]>

* change the value of 'update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* check docker volumes

Signed-off-by: helenxie-bit <[email protected]>

* check docker volumes

Signed-off-by: helenxie-bit <[email protected]>

* stop the controller and restart it again to clean up

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* update cleanup function

Signed-off-by: helenxie-bit <[email protected]>

* separate e2e test for train api

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix parameter of namespace

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* reduce resources

Signed-off-by: helenxie-bit <[email protected]>

* separate e2e test for train API

Signed-off-by: helenxie-bit <[email protected]>

* remove go setup

Signed-off-by: helenxie-bit <[email protected]>

* adjust the version of k8s

Signed-off-by: helenxie-bit <[email protected]>

* move test file to new place

Signed-off-by: helenxie-bit <[email protected]>

* fix typos

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* update install packages

Signed-off-by: helenxie-bit <[email protected]>

* build and verify images of storage-intializer and trainer

Signed-off-by: helenxie-bit <[email protected]>

* fix image build error

Signed-off-by: helenxie-bit <[email protected]>

* fix image build error

Signed-off-by: helenxie-bit <[email protected]>

* check disk space

Signed-off-by: helenxie-bit <[email protected]>

* make 'setup-storage-initializer-and-trainer' executable

Signed-off-by: helenxie-bit <[email protected]>

* separate step of loading images

Signed-off-by: helenxie-bit <[email protected]>

* check disk space after loading image

Signed-off-by: helenxie-bit <[email protected]>

* clean up and check disk space

Signed-off-by: helenxie-bit <[email protected]>

* prune docker build cache

Signed-off-by: helenxie-bit <[email protected]>

* prune docker build cache

Signed-off-by: helenxie-bit <[email protected]>

* adjust sequence of building and loading images

Signed-off-by: helenxie-bit <[email protected]>

* move working directory

Signed-off-by: helenxie-bit <[email protected]>

* delete moving working directory

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* use 'docker system prune'

Signed-off-by: helenxie-bit <[email protected]>

* make the format of the commands to be consistent

Signed-off-by: helenxie-bit <[email protected]>

* update base image

Signed-off-by: helenxie-bit <[email protected]>

* update base image

Signed-off-by: helenxie-bit <[email protected]>

* update base image

Signed-off-by: helenxie-bit <[email protected]>

* delete unnecessary space clear and check code

Signed-off-by: helenxie-bit <[email protected]>

* merge e2e test for train api into integration tests

Signed-off-by: helenxie-bit <[email protected]>

* check for timeout error

Signed-off-by: helenxie-bit <[email protected]>

* fix name of trainer image

Signed-off-by: helenxie-bit <[email protected]>

* fix env of building storage initializer image

Signed-off-by: helenxie-bit <[email protected]>

* clean format

Signed-off-by: helenxie-bit <[email protected]>

* skip e2e test for train API when use scheduling

Signed-off-by: helenxie-bit <[email protected]>

* Update name of fileholder

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Hezhi (Helen) Xie <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* separate e2e test for train API

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* move test script

Signed-off-by: helenxie-bit <[email protected]>

* update path to test script

Signed-off-by: helenxie-bit <[email protected]>

* update path to test script

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* update kubernetes version

Signed-off-by: helenxie-bit <[email protected]>

* update kubernetes version

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* adjust kubernetes version to 1.30.6

Signed-off-by: helenxie-bit <[email protected]>

* adjust kubernetes version to 1.31.4

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: Hezhi (Helen) Xie <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
  • Loading branch information
helenxie-bit and andreyvelich authored Dec 22, 2024
1 parent 69094e1 commit d7f69e8
Show file tree
Hide file tree
Showing 6 changed files with 224 additions and 1 deletion.
61 changes: 61 additions & 0 deletions .github/workflows/e2e-test-train-api.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: E2E Test with train API
on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
e2e-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.31.4"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Build trainer
run: |
./scripts/gha/build-trainer.sh
env:
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load trainer
run: |
kind load docker-image ${{ env.TRAINER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
env:
KIND_CLUSTER: training-operator-cluster
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Build storage initializer
run: |
./scripts/gha/build-storage-initializer.sh
env:
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load storage initializer
run: |
kind load docker-image ${{ env.STORAGE_INITIALIZER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
env:
KIND_CLUSTER: training-operator-cluster
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test

- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python[huggingface]
pytest -s sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py --log-cli-level=debug
env:
STORAGE_INITIALIZER_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_TRANSFORMER_IMAGE_DEFAULT: kubeflowtraining/trainer:test
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python; pytest -s sdk/python/test --log-cli-level=debug --namespace=default
python3 -m pip install -e sdk/python; pytest -s sdk/python/test/e2e --log-cli-level=debug --namespace=default
env:
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}

Expand Down
24 changes: 24 additions & 0 deletions scripts/gha/build-storage-initializer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/storage_initializer -t ${STORAGE_INITIALIZER_CI_IMAGE} -f sdk/python/kubeflow/storage_initializer/Dockerfile
24 changes: 24 additions & 0 deletions scripts/gha/build-trainer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/trainer -t ${TRAINER_CI_IMAGE} -f sdk/python/kubeflow/trainer/Dockerfile.cpu
18 changes: 18 additions & 0 deletions sdk/python/kubeflow/trainer/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Use an official Python runtime as a parent image
FROM python:3.11

# Set the working directory in the container
WORKDIR /app

# Copy the requirements.txt file into the container
COPY requirements.txt /app/requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir torch==2.5.1
RUN pip install --no-cache-dir -r requirements.txt

# Copy the Python package and its source code into the container
COPY . /app

# Run storage.py when the container launches
ENTRYPOINT ["torchrun", "hf_llm_training.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright 2024 kubeflow.org.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging

import transformers
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
from kubeflow.training import TrainingClient, constants
from peft import LoraConfig

import test.e2e.utils as utils

logging.basicConfig(format="%(message)s")
logging.getLogger("kubeflow.training.api.training_client").setLevel(logging.DEBUG)

TRAINING_CLIENT = TrainingClient(job_kind=constants.PYTORCHJOB_KIND)


def test_sdk_e2e_create_from_train_api(job_namespace="default"):
JOB_NAME = "pytorchjob-from-train-api"

# Use test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
TRAINING_CLIENT.train(
name=JOB_NAME,
namespace=job_namespace,
# BERT model URI and type of Transformer to train it.
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
num_labels=5,
),
# In order to save test time, use 8 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
repo_id="yelp_review_full",
split="train[:8]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
num_train_epochs=1,
),
# Set LoRA config to reduce number of trainable parameters.
lora_config=LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
num_workers=1,
num_procs_per_worker=1,
resources_per_worker={
"gpu": 0,
"cpu": 2,
"memory": "10G",
},
storage_config={
"size": "10Gi",
"access_modes": ["ReadWriteOnce"],
},
)

logging.info(f"List of created {TRAINING_CLIENT.job_kind}s")
logging.info(TRAINING_CLIENT.list_jobs(job_namespace))

try:
utils.verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, wait_timeout=900)
except Exception as e:
utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
raise Exception(f"PyTorchJob create from API E2E fails. Exception: {e}")

utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)

0 comments on commit d7f69e8

Please sign in to comment.