Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-12182: Moves the dw installation info to the installation guide #495

Merged
merged 11 commits into from
Oct 16, 2024
5 changes: 2 additions & 3 deletions assemblies/configuring-distributed-workloads.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,12 @@ ifdef::context[:parent-context: {context}]

[role='_abstract']
ifdef::self-managed,cloud-service[]
To configure the distributed workloads feature for your data scientists to use in {productname-short}, you must enable several components in the {productname-long} {install-package}, create the required Kueue resources, and optionally configure the CodeFlare Operator.
To configure the distributed workloads feature for your data scientists to use in {productname-short}, you must create the required Kueue resources, and optionally configure the CodeFlare Operator.
endif::[]
ifdef::upstream[]
To configure the distributed workloads feature for your data scientists to use in {productname-short}, you must enable several components in the {productname-long} Operator, create the required Kueue resources, and optionally configure the CodeFlare Operator.
To configure the distributed workloads feature for your data scientists to use in {productname-short}, you must create the required Kueue resources, and optionally configure the CodeFlare Operator.
endif::[]

include::modules/configuring-the-distributed-workloads-components.adoc[leveloffset=+1]
include::modules/configuring-quota-management-for-distributed-workloads.adoc[leveloffset=+1]
include::modules/configuring-the-codeflare-operator.adoc[leveloffset=+1]

Expand Down
2 changes: 2 additions & 0 deletions assemblies/installing-odh-v2.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ include::modules/installing-the-odh-operator-v2.adoc[leveloffset=+1]

include::modules/installing-odh-components.adoc[leveloffset=+1]

include::modules/installing-the-distributed-workloads-components.adoc[leveloffset=+1]

include::modules/accessing-the-odh-dashboard.adoc[leveloffset=+1]

ifdef::parent-context[:context: {parent-context}]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,28 @@ ifdef::cloud-service[]
* You have downloaded and installed the OpenShift command-line interface (CLI). See link:https://docs.openshift.com/dedicated/cli_reference/openshift_cli/getting-started-cli.html#installing-openshift-cli[Installing the OpenShift CLI] (Red Hat OpenShift Dedicated) or link:https://docs.openshift.com/rosa/cli_reference/openshift_cli/getting-started-cli.html#installing-openshift-cli[Installing the OpenShift CLI] (Red Hat OpenShift Service on AWS).
endif::[]

ifndef::upstream[]
* You have enabled the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads#configuring-the-distributed-workloads-components_distributed-workloads[Configuring the distributed workloads components].
endif::[]

ifdef::upstream[]
* You have enabled the required distributed workloads components as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-the-distributed-workloads-components_distributed-workloads[Configuring the distributed workloads components].
* You have installed the required distributed workloads components as described in link:{odhdocshome}/installing-open-data-hub/#installing-distributed-workload-components_installv2[Installing the distributed workloads components].
endif::[]

ifdef::self-managed[]

ifndef::disconnected[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]

ifdef::disconnected[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]
bredamc marked this conversation as resolved.
Show resolved Hide resolved

endif::[]
bredamc marked this conversation as resolved.
Show resolved Hide resolved

ifdef::cloud-service[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]


ifndef::upstream[]
* You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the *Standard Data Science* notebook. For information about how to create a project, see link:{rhoaidocshome}{default-format-url}/working_on_data_science_projects/using-data-science-projects_projects#creating-a-data-science-project_projects[Creating a data science project].
endif::[]
Expand Down
23 changes: 19 additions & 4 deletions modules/configuring-the-codeflare-operator.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,28 @@ ifdef::cloud-service[]
* You have logged in to OpenShift with the `cluster-admin` role.
endif::[]

ifndef::upstream[]
* You have enabled the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads#configuring-the-distributed-workloads-components_distributed-workloads[Configuring the distributed workloads components].
endif::[]

ifdef::upstream[]
* You have enabled the required distributed workloads components as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-the-distributed-workloads-components_distributed-workloads[Configuring the distributed workloads components].
* You have installed the required distributed workloads components as described in link:{odhdocshome}/installing-open-data-hub/#installing-distributed-workload-components_installv2[Installing the distributed workloads components].
endif::[]

ifdef::self-managed[]

ifndef::disconnected[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]

ifdef::disconnected[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]
bredamc marked this conversation as resolved.
Show resolved Hide resolved

endif::[]

ifdef::cloud-service[]
* You have installed the required distributed workloads components as described in link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}/installing-and-deploying-openshift-ai_install#installing-distributed-workload-components_component-install[Installing the distributed workloads components].
endif::[]



.Procedure
ifdef::upstream,self-managed[]
Expand Down
4 changes: 2 additions & 2 deletions modules/configuring-the-training-job.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ ifdef::cloud-service[]
endif::[]

ifndef::upstream[]
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
endif::[]
ifdef::upstream[]
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/managing-openshift-ai/#managing_distributed_workloads[Managing distributed workloads].
bredamc marked this conversation as resolved.
Show resolved Hide resolved
endif::[]

ifndef::upstream[]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ If you do not want to run distributed workloads from notebooks, you can skip thi

.Prerequisites
ifndef::upstream[]
* You can access a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You can access a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
endif::[]
ifdef::upstream[]
* You can access a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You can access a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/managing-openshift-ai/#managing_distributed_workloads[Managing distributed workloads].
endif::[]

ifndef::upstream[]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
:_module-type: PROCEDURE

[id="configuring-the-distributed-workloads-components_{context}"]
= Configuring the distributed workloads components
[id="installing-the-distributed-workloads-components_{context}"]
= Installing the distributed workloads components

[role='_abstract']
To configure the distributed workloads feature for your data scientists to use in {productname-short}, you must enable several components.
To use the distributed workloads feature in {productname-short}, you must install several components.

.Prerequisites
ifdef::upstream,self-managed[]
* You have logged in to {openshift-platform} with the `cluster-admin` role.
* You have logged in to {openshift-platform} with the `cluster-admin` role and you can access the data science cluster.
endif::[]
ifdef::cloud-service[]
* You have logged in to OpenShift with the `cluster-admin` role.
* You have logged in to OpenShift with the `cluster-admin` role and you can access the data science cluster.
endif::[]
bredamc marked this conversation as resolved.
Show resolved Hide resolved

* You have access to the data science cluster.
* You have installed {productname-long}.

ifdef::cloud-service[]
Expand All @@ -27,27 +26,6 @@ ifdef::upstream[]
* You have sufficient resources. In addition to the minimum {productname-short} resources described in link:{odhdocshome}/installing-open-data-hub/#installing-the-odh-operator-v2_installv2[Installing the {productname-short} Operator version 2], you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
endif::[]

* You have access to a Ray cluster image. For information about how to create a Ray cluster, see the link:https://docs.ray.io/en/latest/cluster/getting-started.html[Ray Clusters documentation].
+
ifdef::self-managed[]
[NOTE]
====
Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in {productname-short}.
{productname-short} {vernum} does not support the `submissionMode=K8sJobMode` setting in the Ray job specification, so the KubeRay Operator cannot create a submitter Kubernetes Job to submit the Ray job.
Instead, users must configure the Ray job specification to set `submissionMode=HTTPMode` only, so that the KubeRay Operator sends a request to the RayCluster to create a Ray job.
====
endif::[]
ifdef::cloud-service[]
[NOTE]
====
Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in {productname-short}.
{productname-short}, does not support the `submissionMode=K8sJobMode` setting in the Ray job specification, so the KubeRay Operator cannot create a submitter Kubernetes Job to submit the Ray job.
Instead, users must configure the Ray job specification to set `submissionMode=HTTPMode` only, so that the KubeRay Operator sends a request to the RayCluster to create a Ray job.
====
endif::[]
* You have access to the data sets and models that the distributed workload uses.
* You have access to the Python dependencies for the distributed workload.

ifndef::upstream[]
* You have removed any previously installed instances of the CodeFlare Operator, as described in the Knowledgebase solution link:https://access.redhat.com/solutions/7043796[How to migrate from a separately installed CodeFlare Operator in your data science cluster].
endif::[]
Expand Down Expand Up @@ -159,10 +137,14 @@ endif::[]
. Click the default instance name (for example, *default-dsc*) to open the instance details page.
. Click the *YAML* tab to show the instance specifications.
. Enable the required distributed workloads components.
In the `spec:components` section, set the `managementState` field correctly for the required components.
The `trainingoperator` component is required only if you want to use the Kubeflow Training Operator to tune models.
The list of required components depends on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.
In the `spec:components` section, set the `managementState` field correctly for the required components:
bredamc marked this conversation as resolved.
Show resolved Hide resolved
+
* If you want to use the CodeFlare framework to tune models, enable the `codeflare`, `kueue`, and `ray` components.
* if you want to use the Kubeflow Training Operator to tune models, enable the `kueue` and `trainingoperator` components.
bredamc marked this conversation as resolved.
Show resolved Hide resolved
* The list of required components depends on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.

+

.Components required for distributed workloads
[cols="38,18,18,26"]
|===
Expand Down Expand Up @@ -230,3 +212,12 @@ In each case, check the status as follows:
+
When the status of the *codeflare-operator-manager-_<pod-id>_*, *kuberay-operator-_<pod-id>_*, and *kueue-controller-manager-_<pod-id>_* pods is *Running*, the pods are ready to use.
.. To see more information about each pod, click the pod name to open the pod details page, and then click the *Logs* tab.

.Next Step
ifdef::upstream[]
Configure the distributed workloads feature as described in link:{odhdocshome}/managing-openshift-ai/#managing_distributed_workloads[Managing distributed workloads].
endif::[]
ifndef::upstream[]
Configure the distributed workloads feature as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
endif::[]

4 changes: 2 additions & 2 deletions modules/monitoring-the-training-job.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ The example training job in this section is based on the IBM and Hugging Face tu
.Prerequisites

ifndef::upstream[]
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
endif::[]
ifdef::upstream[]
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You have access to a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/managing-openshift-ai/#managing_distributed_workloads[Managing distributed workloads].
endif::[]

ifndef::upstream[]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ endif::[]


.Procedure
. Configure the disconnected data science cluster to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
. Configure the disconnected data science cluster to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
. In the `ClusterConfiguration` section of the notebook or pipeline, ensure that the `image` value specifies a Ray cluster image that you can access from the disconnected environment:
* Notebooks use the Ray cluster image to create a Ray cluster when running the notebook.
* Pipelines use the Ray cluster image to create a Ray cluster during the pipeline run.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,45 +8,12 @@ To run a distributed workload from a pipeline, you must first update the pipelin

.Prerequisites
ifndef::upstream[]
* You can access a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You can access a data science cluster that is configured to run distributed workloads as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads].
endif::[]
ifdef::upstream[]
* You can access a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-distributed-workloads_distributed-workloads[Configuring distributed workloads].
* You can access a data science cluster that is configured to run distributed workloads as described in link:{odhdocshome}/managing-openshift-ai/#managing_distributed_workloads[Managing distributed workloads].
endif::[]

ifndef::upstream[]
* Your cluster administrator has created the required Kueue resources as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads#configuring-quota-management-for-distributed-workloads_distributed-workloads[Configuring quota management for distributed workloads].
endif::[]
ifdef::upstream[]
* Your cluster administrator has created the required Kueue resources as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-quota-management-for-distributed-workloads_distributed-workloads[Configuring quota management for distributed workloads].
endif::[]

ifndef::upstream[]
* Optional: Your cluster administrator has defined a _default_ local queue for the Ray cluster by creating a `LocalQueue` resource and adding the following annotation to the configuration details for that `LocalQueue` resource, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/configuring-distributed-workloads_distributed-workloads#configuring-quota-management-for-distributed-workloads_distributed-workloads[Configuring quota management for distributed workloads]:
+
[source,bash]
----
"kueue.x-k8s.io/default-queue": "true"
----
+
[NOTE]
====
If your cluster administrator does not define a default local queue, you must specify a local queue in each pipeline.
====
endif::[]
ifdef::upstream[]
* Optional: Your cluster administrator has defined a _default_ local queue for the Ray cluster by creating a `LocalQueue` resource and adding the following annotation to the configuration details for that `LocalQueue` resource, as described in link:{odhdocshome}/working-with-distributed-workloads/#configuring-quota-management-for-distributed-workloads_distributed-workloads[Configuring quota management for distributed workloads]:
+
[source,bash]
----
"kueue.x-k8s.io/default-queue": "true"
----
+
[NOTE]
====
If your cluster administrator does not define a default local queue, you must specify a local queue in each pipeline.
====
endif::[]

* You can access the following software from your data science cluster:
** A Ray cluster image that is compatible with your hardware architecture
Expand Down
Loading