Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fine_tuning] Add Cluster scale up for Ilab testing #595

Merged
merged 16 commits into from
Nov 25, 2024

Conversation

kpouget
Copy link
Contributor

@kpouget kpouget commented Nov 22, 2024

No description provided.

Copy link

openshift-ci bot commented Nov 22, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from kpouget. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kpouget kpouget force-pushed the instruct-scaleup branch 2 times, most recently from 2e0df88 to d92d55a Compare November 22, 2024 10:20
@kpouget
Copy link
Contributor Author

kpouget commented Nov 22, 2024

/test rhoai-light fine_tuning

Copy link

topsail-bot bot commented Nov 22, 2024

Jenkins Job #1694

🔴 Test of 'rhoai test test_ci' failed after 00 hours 05 minutes 11 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: ilab

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 313, in _run_test_and_visualize
    prepare_finetuning.scale_down()
AttributeError: module 'prepare_finetuning' has no attribute 'scale_down'



[Test ran on the internal Perflab CI]

@kpouget kpouget force-pushed the instruct-scaleup branch 2 times, most recently from 999af36 to f7454ac Compare November 22, 2024 10:41
Copy link

topsail-bot bot commented Nov 22, 2024

Jenkins Job #1695

🔴 Test of 'rhoai test test_ci' failed after 00 hours 04 minutes 21 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: ilab

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 313, in _run_test_and_visualize
    prepare_finetuning.cluster_scale_down()
TypeError: cluster_scale_down() missing 1 required positional argument: 'to_zero'



[Test ran on the internal Perflab CI]

@kpouget kpouget force-pushed the instruct-scaleup branch 2 times, most recently from 880a8ef to 7721432 Compare November 22, 2024 11:11
Copy link

topsail-bot bot commented Nov 22, 2024

Jenkins Job #1696

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: fms
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: fms

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 264, in _run_test_and_visualize
    raise RuntimeError(msg)
RuntimeError: FMS or Quality or Ray or Ilab testing must be enabled. Found 0 enabled. Cannot proceed.



[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Nov 22, 2024

Jenkins Job #1697

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 04 minutes 39 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: fms
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: fms

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 22, 2024

/test rhoai-light fine_tuning ilab gpu

Copy link

topsail-bot bot commented Nov 22, 2024

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 16 minutes 34 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: ilab gpu
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab
PR_POSITIONAL_ARG_2: gpu

Failure indicator: Empty. (See run.log)

@kpouget
Copy link
Contributor Author

kpouget commented Nov 23, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 23, 2024

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 06 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/000__cluster__wait_fully_awake/FAILURE | [000__cluster__wait_fully_awake] ./run_toolbox.py cluster wait_fully_awake --> 2


@kpouget
Copy link
Contributor Author

kpouget commented Nov 23, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 23, 2024

🔴 Test of 'rhoai test test_ci' failed after 00 hours 23 minutes 19 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/004__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1} --> 2
/logs/artifacts/004__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 24, 2024

🔴 Test of 'rhoai test test_ci' failed after 00 hours 31 minutes 35 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/004__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1} --> 2
/logs/artifacts/004__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 24, 2024

🔴 Test of 'rhoai test test_ci' failed after 00 hours 30 minutes 55 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/004__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1} --> 2
/logs/artifacts/004__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning

Copy link

topsail-bot bot commented Nov 24, 2024

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 13 minutes 56 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: ''
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light

Failure indicator:

/logs/artifacts/002__prepare2/001__prepare_data/000__cluster__preload_image_fine-tuning-image/FAILURE | [000__cluster__preload_image_fine-tuning-image] ./run_toolbox.py from_config cluster preload_image --prefix=sutest --extra={} --> 2
/logs/artifacts/002__prepare2/001__prepare_data/002__cluster__preload_image_fine-tuning-image/FAILURE | [002__cluster__preload_image_fine-tuning-image] ./run_toolbox.py from_config cluster preload_image --prefix=sutest --extra={} --> 2
/logs/artifacts/002__prepare2/001__prepare_data/003__cluster__preload_image_fine-tuning-image/FAILURE | [003__cluster__preload_image_fine-tuning-image] ./run_toolbox.py from_config cluster preload_image --prefix=sutest --extra={} --> 2


@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning

Copy link

topsail-bot bot commented Nov 24, 2024

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 36 minutes 10 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ''
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light

@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning ilab

@kpouget
Copy link
Contributor Author

kpouget commented Nov 24, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 24, 2024

🔴 Test of 'rhoai test test_ci' failed after 00 hours 05 minutes 11 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/004__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1} --> 2
/logs/artifacts/004__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 25, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 25, 2024

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 03 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator: Empty. (See run.log)

Copy link

openshift-ci bot commented Nov 25, 2024

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhoai-light db4e8b1 link true /test rhoai-light

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kpouget
Copy link
Contributor Author

kpouget commented Nov 25, 2024

/test rhoai-light fine_tuning ilab

Copy link

topsail-bot bot commented Nov 25, 2024

🔴 Test of 'rhoai test test_ci' failed after 00 hours 14 minutes 35 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: ilab
PR_POSITIONAL_ARG_0: fine_tuning-rhoai-light
PR_POSITIONAL_ARG_1: ilab

Failure indicator:

/logs/artifacts/004__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1} --> 2
/logs/artifacts/004__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'dataset_name': 'ilab_skills_data.jsonl', 'dataset_replication': 1, 'gpu': 1, 'model_name': 'granite-3b-code-instruct', 'name': 'ilab', 'pod_count': 1}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor Author

kpouget commented Nov 25, 2024

merging, I don't know why the g5.xlarge node keeps failing :/

  Warning  NodeNotReady            5m14s  node-controller          Node is not ready

@kpouget kpouget merged commit d22cc7b into openshift-psap:main Nov 25, 2024
6 of 8 checks passed
@kpouget kpouget deleted the instruct-scaleup branch November 25, 2024 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant