From 0c3468788d698e77119ae79401fb8caad95a75f7 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 27 Mar 2023 14:46:57 -0700 Subject: [PATCH 01/13] create new pr --- .../rapids-azureml-hpo/notebook.ipynb | 648 ++++++++++++++++++ 1 file changed, 648 insertions(+) create mode 100644 source/examples/rapids-azureml-hpo/notebook.ipynb diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb new file mode 100644 index 00000000..b93a514b --- /dev/null +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -0,0 +1,648 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train and hyperparameter tune with RAPIDS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Create an Azure ML Workspace and setup environmnet on local computer following the steps in [Azure README.md](https://github.com/rapidsai/cloud-ml-examples/blob/main/azure/README.md)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install azure-ai-ml" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check Azure ML SDK version\n", + "\n", + "!pip show azure-ai-ml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize workspace" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load and initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the prerequisites step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml import MLClient\n", + "from azure.identity import DefaultAzureCredential\n", + "\n", + "# Enter details of your Azure Machine Learning workspace\n", + "subscription_id = ''\n", + "resource_group = ''\n", + "workspace = ''\n", + "datastore_name = ''\n", + "path_on_datastore 'airline_20000000.parquet'\n", + "\n", + "# connect to the workspace\n", + "ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Create a FileDataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example, we will use 20 million rows (samples) of the [airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html). The [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) below references parquet files that have been uploaded to a public [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview), you can download to your local computer or mount the files to your AML compute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# from azureml.fsspec import AzureMachineLearningFileSystem\n", + "\n", + "# fs = AzureMachineLearningFileSystem(\"azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapids-deployment-doc/workspaces/skirui-azureml-rapids/datastores/workspaceartifactstore/paths/airline_20000000.parquet\")\n", + "# fs.ls()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# long-form Datastore uri format:\n", + "uri = f\"azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.fsspec import AzureMachineLearningFileSystem\n", + "\n", + "fs = AzureMachineLearningFileSystem(uri)\n", + "fs.ls()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create AML compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.\n", + "\n", + "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, you will need to specify compute targets from one of `NC_v2`, `NC_v3`, `ND` or `ND_v2` [GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu); these are VMs that are provisioned with P40 and V100 GPUs. \n", + "\n", + "Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'azure'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mazure\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mai\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mml\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mentities\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m AmlCompute\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# specify aml compute name. # choose a name for your cluster\u001b[39;00m\n\u001b[1;32m 4\u001b[0m gpu_compute_target \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mgpu-cluster\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", + "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'azure'" + ] + } + ], + "source": [ + "from azure.ai.ml.entities import AmlCompute\n", + "\n", + "# specify aml compute name. # choose a name for your cluster\n", + "gpu_compute_target = \"gpu-cluster\"\n", + "\n", + "# check if desired amlcompute exists, if not create new\n", + "try:\n", + " ml_client.compute.get(gpu_compute_target)\n", + "except Exception:\n", + " print(\"Creating a new gpu compute target...\")\n", + "\n", + " gpu_compute = AmlCompute(\n", + " name=gpu_compute_target,\n", + " size=\"STANDARD_NC12S_V3\",\n", + " min_instances=0,\n", + " max_instances=4,\n", + " )\n", + " ml_client.compute.begin_create_or_update(compute).result()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prepare training script" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a project directory that contains your code, including the training script and additional files / dependencies. In this example, the training script is provided:\n", + "
\n", + "`train_rapids.py` - entry script for RAPIDS Estimator that includes loading dataset into cuDF data frame, training with Random Forest and inference using cuML." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "project_folder = \"./train_rapids\" # create folder in same dir\n", + "os.makedirs(project_folder, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will log some metrics by using the `Run` object within the training script:\n", + "\n", + "```python\n", + "from azureml.core.run import Run\n", + "run = Run.get_context()\n", + "```\n", + " \n", + "We will also log the parameters and highest accuracy the model achieves:\n", + "\n", + "```python\n", + "run.log('Accuracy', np.float(accuracy))\n", + "```\n", + "\n", + "These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy the training script `train_rapids.py` into your project directory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_path = os.path.realpath(\n", + " \"__file__\" + \"/../../code\"\n", + ") # dir containing the training scrips\n", + "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", + "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "shutil.copy(rapids_script, project_folder)\n", + "shutil.copy(azure_script, project_folder)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_path" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train model on the remote compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that you have your data and training script prepared, you are ready to train on your remote compute." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create experiment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Use Custom Docker Image" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll be using a custom docker image to setup the environment. This is available in [rapidsai/rapidsai-cloud-ml on DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-cloud-ml/tags?page=1&ordering=last_updated). This image contains all necessary packages to run the example on Azure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_env = Environment(\n", + " image=\"\", # base image to use\n", + " name=\"\", # name of the model\n", + " description=\"Rapids v23.02 docker container\",\n", + ")\n", + "\n", + "ml_client.environments.create_or_update(my_env) # register the environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core import Environment\n", + "from azureml.core.runconfig import DockerConfiguration\n", + "\n", + "environment_name = \"rapids_hpo\"\n", + "env = Environment(environment_name)\n", + "\n", + "# enable docker\n", + "docker_config = DockerConfiguration(use_docker=True)\n", + "\n", + "# rapids-cloud-ml image is available in Docker Hub\n", + "env.docker.base_image = \"rapidsai/rapidsai-cloud-ml:latest\"\n", + "\n", + "# use rapids environment in the container, don't build a new conda environment\n", + "env.python.user_managed_dependencies = True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare RAPIDS Training Script " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use the [ScriptRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py) class to submit our job, [Estimators](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-migrate-from-estimators-to-scriptrunconfig) have been deprecated. \n", + "\n", + "`arguments` is a dictionary of command-line arguments to pass to the training script." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# airline_ds=fs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# from azureml.core import ScriptRunConfig\n", + "\n", + "# arguments = [\n", + "# '--data_dir', airline_ds,\n", + "# '--n_bins', 32,\n", + "# '--compute', 'single-GPU', # set to multi-GPU for algorithms via Dask\n", + "# '--cv_folds', 5,\n", + "# ]\n", + "\n", + "# src = ScriptRunConfig(source_directory=project_folder,\n", + "# arguments=arguments,\n", + "# compute_target=gpu_cluster,\n", + "# script='train_rapids.py',\n", + "# environment=env, #docker is the environment\n", + "# docker_runtime_config=docker_config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tune model hyperparameters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a hyperparameter sweep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.train.hyperdrive.runconfig import HyperDriveConfig\n", + "from azureml.train.hyperdrive.sampling import RandomParameterSampling\n", + "from azureml.train.hyperdrive.run import PrimaryMetricGoal\n", + "from azureml.train.hyperdrive.parameter_expressions import choice, loguniform, uniform\n", + "\n", + "param_sampling = RandomParameterSampling(\n", + " {\n", + " \"--n_estimators\": choice(range(50, 500)),\n", + " \"--max_depth\": choice(range(5, 19)),\n", + " \"--max_features\": uniform(0.2, 1.0),\n", + " }\n", + ")\n", + "\n", + "hyperdrive_run_config = HyperDriveConfig(\n", + " run_config=src,\n", + " hyperparameter_sampling=param_sampling,\n", + " primary_metric_name=\"Accuracy\",\n", + " primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,\n", + " max_total_runs=10,\n", + " max_concurrent_runs=5,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This will launch the RAPIDS training script with parameters that were specified in the cell above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# start the HyperDrive run\n", + "run = Experiment(ws, experiment_name).submit(hyperdrive_run_config)\n", + "run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Monitor HyperDrive runs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Monitor and view the progress of the machine learning training run with a [Jupyter widget](https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py).The widget is asynchronous and provides live updates every 10-15 seconds until the job completes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "\n", + "RunDetails(run).show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "outputPrepend" + ] + }, + "outputs": [], + "source": [ + "run.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# run.cancel()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Find and register best model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "best_run = hyperdrive_run.get_best_run_by_primary_metric()\n", + "print(best_run.get_details()[\"runDefinition\"][\"arguments\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "List the model files uploaded during the run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(best_run.get_file_names())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Register the folder (and all files in it) as a model named `train-rapids` under the workspace for deployment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# model = best_run.register_model(model_name='train-rapids', model_path='outputs/model-rapids.joblib')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Delete cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# gpu_cluster.delete()" + ] + } + ], + "metadata": { + "kernel_info": { + "name": "rapids" + }, + "kernelspec": { + "display_name": "rapids", + "language": "python", + "name": "rapids" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + }, + "microsoft": { + "ms_spell_check": { + "ms_spell_check_language": "en" + } + }, + "nteract": { + "version": "nteract-front-end@1.0.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 9db7309e78bf7f7fd4dc30daf59a2681df25e384 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 3 Apr 2023 23:50:44 -0700 Subject: [PATCH 02/13] minor changes --- .../rapids-azureml-hpo/notebook.ipynb | 447 +++++++----------- 1 file changed, 184 insertions(+), 263 deletions(-) diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index b93a514b..2d2c92c6 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -18,27 +18,36 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "- Create an Azure ML Workspace and setup environmnet on local computer following the steps in [Azure README.md](https://github.com/rapidsai/cloud-ml-examples/blob/main/azure/README.md)" + "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) and setup environmnet on local computer following the steps in [??????] or run in Compute Instance\n" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# !pip install azure-ai-ml" - ] - }, - { - "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: azure-ai-ml\n", + "Version: 1.2.0\n", + "Summary: Microsoft Azure Machine Learning Client Library for Python\n", + "Home-page: https://github.com/Azure/azure-sdk-for-python\n", + "Author: Microsoft Corporation\n", + "Author-email: azuresdkengsysadmins@microsoft.com\n", + "License: MIT License\n", + "Location: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages\n", + "Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions\n", + "Required-by: \n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ - "# check Azure ML SDK version\n", + "# verify Azure ML SDK version\n", "\n", - "!pip show azure-ai-ml" + "%pip show azure-ai-ml" ] }, { @@ -52,27 +61,44 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Load and initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the prerequisites step." + "Initialize `MLClient` class to handle the workspace you created in the prerequisites step. `MLClient.from_config(credential, path)`\n", + "creates a workspace object from the details stored in `config.json`" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Workspace name: rapids-aml-cluster\n", + "Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62\n", + "Resource group: rapidsai-deployment\n" + ] + } + ], "source": [ "from azure.ai.ml import MLClient\n", "from azure.identity import DefaultAzureCredential\n", "\n", - "# Enter details of your Azure Machine Learning workspace\n", - "subscription_id = ''\n", - "resource_group = ''\n", - "workspace = ''\n", - "datastore_name = ''\n", - "path_on_datastore 'airline_20000000.parquet'\n", "\n", - "# connect to the workspace\n", - "ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)" + "# Get a handle to the workspace\n", + "ml_client = MLClient(\n", + " credential=DefaultAzureCredential(),\n", + " subscription_id=\"fc4f4a6b-4041-4b1c-8249-854d68edcf62\",\n", + " resource_group_name=\"rapidsai-deployment\",\n", + " workspace_name=\"rapids-aml-cluster\",\n", + ")\n", + "\n", + "print(\n", + " \"Workspace name: \" + ml_client.workspace_name,\n", + " \"Subscription id: \" + ml_client.subscription_id,\n", + " \"Resource group: \" + ml_client.resource_group_name,\n", + " sep=\"\\n\",\n", + ")" ] }, { @@ -81,48 +107,38 @@ "tags": [] }, "source": [ - "## Create a FileDataset" + "## Access data from Datastore URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this example, we will use 20 million rows (samples) of the [airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html). The [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) below references parquet files that have been uploaded to a public [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview), you can download to your local computer or mount the files to your AML compute." + "In this example, we will use 20 million rows (samples) of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files, you can download to your local computer or mount the files to your AML compute.\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "data uri: \n", + " azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet\n" + ] + } + ], "source": [ - "# from azureml.fsspec import AzureMachineLearningFileSystem\n", + "datastore_name = \"workspaceartifactstore\"\n", + "dataset = \"airline_20000000.parquet\"\n", "\n", - "# fs = AzureMachineLearningFileSystem(\"azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapids-deployment-doc/workspaces/skirui-azureml-rapids/datastores/workspaceartifactstore/paths/airline_20000000.parquet\")\n", - "# fs.ls()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# long-form Datastore uri format:\n", - "uri = f\"azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.fsspec import AzureMachineLearningFileSystem\n", + "# Datastore uri format:\n", + "data_uri = f\"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}\"\n", "\n", - "fs = AzureMachineLearningFileSystem(uri)\n", - "fs.ls()" + "print(\"data uri:\", \"\\n\", data_uri)" ] }, { @@ -138,102 +154,75 @@ "source": [ "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.\n", "\n", - "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." + "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, you will need to specify compute targets from one of `NC_v2`, `NC_v3`, `ND` or `ND_v2` [GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu); these are VMs that are provisioned with P40 and V100 GPUs. \n", + "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, so \n", + "you will need to select compute targets from one of the \n", + "[GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provisioned with P40 and V100 GPUs : `NC_v2`, `NC_v3`, `ND` or `ND_v2` \n", "\n", "Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 4, "metadata": {}, "outputs": [ { - "ename": "ModuleNotFoundError", - "evalue": "No module named 'azure'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mazure\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mai\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mml\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mentities\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m AmlCompute\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# specify aml compute name. # choose a name for your cluster\u001b[39;00m\n\u001b[1;32m 4\u001b[0m gpu_compute_target \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mgpu-cluster\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", - "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'azure'" + "name": "stdout", + "output_type": "stream", + "text": [ + "found compute target. Will use gpu-cluster\n" ] } ], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", - "# specify aml compute name. # choose a name for your cluster\n", + "# specify aml compute name.\n", "gpu_compute_target = \"gpu-cluster\"\n", "\n", - "# check if desired amlcompute exists, if not create new\n", "try:\n", - " ml_client.compute.get(gpu_compute_target)\n", - "except Exception:\n", + " # let's see if the compute target already exists\n", + " gpu_target = ml_client.compute.get(gpu_compute_target)\n", + " print(f\"found compute target. Will use {gpu_compute_target}\")\n", + "except:\n", " print(\"Creating a new gpu compute target...\")\n", "\n", - " gpu_compute = AmlCompute(\n", - " name=gpu_compute_target,\n", + " gpu_target = AmlCompute(\n", + " name=\"gpu-cluster\",\n", + " type=\"amlcompute\",\n", " size=\"STANDARD_NC12S_V3\",\n", - " min_instances=0,\n", - " max_instances=4,\n", + " max_instances=5,\n", + " idle_time_before_scale_down=300,\n", " )\n", - " ml_client.compute.begin_create_or_update(compute).result()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prepare training script" + " ml_client.compute.begin_create_or_update(gpu_target).result()\n", + "\n", + " print(\n", + " f\"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}\"\n", + " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Create a project directory that contains your code, including the training script and additional files / dependencies. In this example, the training script is provided:\n", - "
\n", - "`train_rapids.py` - entry script for RAPIDS Estimator that includes loading dataset into cuDF data frame, training with Random Forest and inference using cuML." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", "\n", - "project_folder = \"./train_rapids\" # create folder in same dir\n", - "os.makedirs(project_folder, exist_ok=True)" + "## Prepare training script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We will log some metrics by using the `Run` object within the training script:\n", - "\n", - "```python\n", - "from azureml.core.run import Run\n", - "run = Run.get_context()\n", - "```\n", - " \n", - "We will also log the parameters and highest accuracy the model achieves:\n", - "\n", - "```python\n", - "run.log('Accuracy', np.float(accuracy))\n", - "```\n", + "Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:\n", "\n", - "These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section." + "`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML." ] }, { @@ -249,11 +238,10 @@ "metadata": {}, "outputs": [], "source": [ - "notebook_path = os.path.realpath(\n", - " \"__file__\" + \"/../../code\"\n", - ") # dir containing the training scrips\n", - "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", - "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")" + "import os\n", + "\n", + "project_folder = \"./train_rapids\" # create folder in same dir\n", + "os.makedirs(project_folder, exist_ok=True)" ] }, { @@ -264,19 +252,16 @@ "source": [ "import shutil\n", "\n", + "\n", + "notebook_path = os.path.realpath(\"__file__\" + \"/../../code\")\n", + "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", + "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")\n", + "\n", + "\n", "shutil.copy(rapids_script, project_folder)\n", "shutil.copy(azure_script, project_folder)" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "notebook_path" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -288,21 +273,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that you have your data and training script prepared, you are ready to train on your remote compute." + "Now that you have your data and training script prepared, you are ready to train on your remote compute:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Create experiment" + "### Create experiment\n", + "\n", + "Track all the runs in your workspace" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace." + "experiment_name = \"test_rapids_gpu_cluster\"" ] }, { @@ -316,81 +305,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll be using a custom docker image to setup the environment. This is available in [rapidsai/rapidsai-cloud-ml on DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-cloud-ml/tags?page=1&ordering=last_updated). This image contains all necessary packages to run the example on Azure." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_env = Environment(\n", - " image=\"\", # base image to use\n", - " name=\"\", # name of the model\n", - " description=\"Rapids v23.02 docker container\",\n", - ")\n", - "\n", - "ml_client.environments.create_or_update(my_env) # register the environment" + "We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub." ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "from azureml.core import Environment\n", - "from azureml.core.runconfig import DockerConfiguration\n", - "\n", - "environment_name = \"rapids_hpo\"\n", - "env = Environment(environment_name)\n", - "\n", - "# enable docker\n", - "docker_config = DockerConfiguration(use_docker=True)\n", + "from azure.ai.ml.entities import Environment, BuildContext\n", "\n", - "# rapids-cloud-ml image is available in Docker Hub\n", - "env.docker.base_image = \"rapidsai/rapidsai-cloud-ml:latest\"\n", + "env_docker_image = Environment(\n", + " build=BuildContext(path=\"./docker\"),\n", + " name=\"rapids-docker-image-2302\",\n", + " description=\"Rapids v23.02 Environment\",\n", + ")\n", "\n", - "# use rapids environment in the container, don't build a new conda environment\n", - "env.python.user_managed_dependencies = True" + "ml_client.environments.create_or_update(env_docker_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Prepare RAPIDS Training Script " + "### Submit the training job " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We will use the [ScriptRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py) class to submit our job, [Estimators](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-migrate-from-estimators-to-scriptrunconfig) have been deprecated. \n", - "\n", - "`arguments` is a dictionary of command-line arguments to pass to the training script." + "We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines.\n", + "`inputs` is a dictionary of command-line arguments to pass to the training script.\n" ] }, { @@ -399,30 +346,33 @@ "metadata": {}, "outputs": [], "source": [ - "# airline_ds=fs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# from azureml.core import ScriptRunConfig\n", + "from azure.ai.ml import command, Input\n", "\n", - "# arguments = [\n", - "# '--data_dir', airline_ds,\n", - "# '--n_bins', 32,\n", - "# '--compute', 'single-GPU', # set to multi-GPU for algorithms via Dask\n", - "# '--cv_folds', 5,\n", - "# ]\n", "\n", - "# src = ScriptRunConfig(source_directory=project_folder,\n", - "# arguments=arguments,\n", - "# compute_target=gpu_cluster,\n", - "# script='train_rapids.py',\n", - "# environment=env, #docker is the environment\n", - "# docker_runtime_config=docker_config)" + "command_job = command(\n", + " environment=\"rapids-docker-image-2302:2\",\n", + " experiment_name=experiment_name,\n", + " code=project_folder,\n", + " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", + " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", + " inputs={\n", + " \"data_dir\": Input(type=\"uri_file\", path=data_uri),\n", + " \"n_bins\": 32,\n", + " \"compute\": \"single-GPU\", # multi-GPU for algorithms via Dask\n", + " \"cv_folds\": 5,\n", + " \"n_estimators\": 50,\n", + " \"max_depth\": 10,\n", + " \"max_features\": 1.0,\n", + " },\n", + " compute=\"gpu-cluster\",\n", + ")\n", + "\n", + "\n", + "# submit the command\n", + "returned_job = ml_client.jobs.create_or_update(command_job)\n", + "\n", + "# get a URL for the status of the job\n", + "returned_job.studio_url" ] }, { @@ -459,27 +409,24 @@ "metadata": {}, "outputs": [], "source": [ - "from azureml.train.hyperdrive.runconfig import HyperDriveConfig\n", - "from azureml.train.hyperdrive.sampling import RandomParameterSampling\n", - "from azureml.train.hyperdrive.run import PrimaryMetricGoal\n", - "from azureml.train.hyperdrive.parameter_expressions import choice, loguniform, uniform\n", - "\n", - "param_sampling = RandomParameterSampling(\n", - " {\n", - " \"--n_estimators\": choice(range(50, 500)),\n", - " \"--max_depth\": choice(range(5, 19)),\n", - " \"--max_features\": uniform(0.2, 1.0),\n", - " }\n", + "from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy\n", + "\n", + "command_job_for_sweep = command_job(\n", + " n_estimators=Choice(values=range(50, 500)),\n", + " max_depth=Choice(values=range(5, 19)),\n", + " max_features=Uniform(min_value=0.2, max_value=1.0),\n", ")\n", "\n", - "hyperdrive_run_config = HyperDriveConfig(\n", - " run_config=src,\n", - " hyperparameter_sampling=param_sampling,\n", - " primary_metric_name=\"Accuracy\",\n", - " primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,\n", - " max_total_runs=10,\n", - " max_concurrent_runs=5,\n", - ")" + "# apply the sweep parameter to obtain the sweep_job\n", + "sweep_job = command_job_for_sweep.sweep(\n", + " compute=\"gpu-cluster\",\n", + " sampling_algorithm=\"random\",\n", + " primary_metric=\"Accuracy\",\n", + " goal=\"Maximize\",\n", + ")\n", + "\n", + "# define the limits for this sweep\n", + "sweep_job.set_limits(max_total_trials=10, max_concurrent_trials=5, timeout=300)" ] }, { @@ -495,23 +442,25 @@ "metadata": {}, "outputs": [], "source": [ - "# start the HyperDrive run\n", - "run = Experiment(ws, experiment_name).submit(hyperdrive_run_config)\n", - "run" + "# submit the hpo job\n", + "returned_sweep_job = ml_client.create_or_update(sweep_job)\n", + "\n", + "# get a URL for the status of the job\n", + "returned_sweep_job.studio_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Monitor HyperDrive runs" + "## Monitor SweepJobs runs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Monitor and view the progress of the machine learning training run with a [Jupyter widget](https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py).The widget is asynchronous and provides live updates every 10-15 seconds until the job completes." + "Monitor and view the progress of the machine learning training run with Mlflow" ] }, { @@ -519,33 +468,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "from azureml.widgets import RunDetails\n", - "\n", - "RunDetails(run).show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "outputPrepend" - ] - }, - "outputs": [], - "source": [ - "run.wait_for_completion(show_output=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# run.cancel()" - ] + "source": [] }, { "cell_type": "markdown", @@ -560,8 +483,10 @@ "metadata": {}, "outputs": [], "source": [ - "best_run = hyperdrive_run.get_best_run_by_primary_metric()\n", - "print(best_run.get_details()[\"runDefinition\"][\"arguments\"])" + "# Download best trial model output\n", + "\n", + "best_sweep = ml_client.jobs.download(returned_sweep_job.name, output_name=\"best_model\")\n", + "print(best_sweep)" ] }, { @@ -576,9 +501,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "print(best_run.get_file_names())" - ] + "source": [] }, { "cell_type": "markdown", @@ -592,9 +515,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# model = best_run.register_model(model_name='train-rapids', model_path='outputs/model-rapids.joblib')" - ] + "source": [] }, { "cell_type": "markdown", @@ -609,7 +530,7 @@ "metadata": {}, "outputs": [], "source": [ - "# gpu_cluster.delete()" + "# gpu_target.delete()" ] } ], @@ -618,9 +539,9 @@ "name": "rapids" }, "kernelspec": { - "display_name": "rapids", + "display_name": "Python 3.10 - SDK v2", "language": "python", - "name": "rapids" + "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { @@ -632,7 +553,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.10" + "version": "3.10.9" }, "microsoft": { "ms_spell_check": { From 256e0ead98add41784b07666c28d20dd68bcea71 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 26 Jun 2023 22:37:15 -0700 Subject: [PATCH 03/13] delete old notebook --- .../rapids-azureml-hpo/notebook.ipynb | 569 ------------------ 1 file changed, 569 deletions(-) delete mode 100644 source/examples/rapids-azureml-hpo/notebook.ipynb diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb deleted file mode 100644 index 2d2c92c6..00000000 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ /dev/null @@ -1,569 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Train and hyperparameter tune with RAPIDS" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) and setup environmnet on local computer following the steps in [??????] or run in Compute Instance\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Name: azure-ai-ml\n", - "Version: 1.2.0\n", - "Summary: Microsoft Azure Machine Learning Client Library for Python\n", - "Home-page: https://github.com/Azure/azure-sdk-for-python\n", - "Author: Microsoft Corporation\n", - "Author-email: azuresdkengsysadmins@microsoft.com\n", - "License: MIT License\n", - "Location: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages\n", - "Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions\n", - "Required-by: \n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "# verify Azure ML SDK version\n", - "\n", - "%pip show azure-ai-ml" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initialize workspace" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Initialize `MLClient` class to handle the workspace you created in the prerequisites step. `MLClient.from_config(credential, path)`\n", - "creates a workspace object from the details stored in `config.json`" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Workspace name: rapids-aml-cluster\n", - "Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62\n", - "Resource group: rapidsai-deployment\n" - ] - } - ], - "source": [ - "from azure.ai.ml import MLClient\n", - "from azure.identity import DefaultAzureCredential\n", - "\n", - "\n", - "# Get a handle to the workspace\n", - "ml_client = MLClient(\n", - " credential=DefaultAzureCredential(),\n", - " subscription_id=\"fc4f4a6b-4041-4b1c-8249-854d68edcf62\",\n", - " resource_group_name=\"rapidsai-deployment\",\n", - " workspace_name=\"rapids-aml-cluster\",\n", - ")\n", - "\n", - "print(\n", - " \"Workspace name: \" + ml_client.workspace_name,\n", - " \"Subscription id: \" + ml_client.subscription_id,\n", - " \"Resource group: \" + ml_client.resource_group_name,\n", - " sep=\"\\n\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Access data from Datastore URI" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this example, we will use 20 million rows (samples) of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files, you can download to your local computer or mount the files to your AML compute.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "data uri: \n", - " azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet\n" - ] - } - ], - "source": [ - "datastore_name = \"workspaceartifactstore\"\n", - "dataset = \"airline_20000000.parquet\"\n", - "\n", - "# Datastore uri format:\n", - "data_uri = f\"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}\"\n", - "\n", - "print(\"data uri:\", \"\\n\", data_uri)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create AML compute" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.\n", - "\n", - "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, so \n", - "you will need to select compute targets from one of the \n", - "[GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provisioned with P40 and V100 GPUs : `NC_v2`, `NC_v3`, `ND` or `ND_v2` \n", - "\n", - "Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "found compute target. Will use gpu-cluster\n" - ] - } - ], - "source": [ - "from azure.ai.ml.entities import AmlCompute\n", - "\n", - "# specify aml compute name.\n", - "gpu_compute_target = \"gpu-cluster\"\n", - "\n", - "try:\n", - " # let's see if the compute target already exists\n", - " gpu_target = ml_client.compute.get(gpu_compute_target)\n", - " print(f\"found compute target. Will use {gpu_compute_target}\")\n", - "except:\n", - " print(\"Creating a new gpu compute target...\")\n", - "\n", - " gpu_target = AmlCompute(\n", - " name=\"gpu-cluster\",\n", - " type=\"amlcompute\",\n", - " size=\"STANDARD_NC12S_V3\",\n", - " max_instances=5,\n", - " idle_time_before_scale_down=300,\n", - " )\n", - " ml_client.compute.begin_create_or_update(gpu_target).result()\n", - "\n", - " print(\n", - " f\"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}\"\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## Prepare training script" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:\n", - "\n", - "`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copy the training script `train_rapids.py` into your project directory:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "project_folder = \"./train_rapids\" # create folder in same dir\n", - "os.makedirs(project_folder, exist_ok=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import shutil\n", - "\n", - "\n", - "notebook_path = os.path.realpath(\"__file__\" + \"/../../code\")\n", - "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", - "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")\n", - "\n", - "\n", - "shutil.copy(rapids_script, project_folder)\n", - "shutil.copy(azure_script, project_folder)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train model on the remote compute" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that you have your data and training script prepared, you are ready to train on your remote compute:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create experiment\n", - "\n", - "Track all the runs in your workspace" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "experiment_name = \"test_rapids_gpu_cluster\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Use Custom Docker Image" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azure.ai.ml.entities import Environment, BuildContext\n", - "\n", - "env_docker_image = Environment(\n", - " build=BuildContext(path=\"./docker\"),\n", - " name=\"rapids-docker-image-2302\",\n", - " description=\"Rapids v23.02 Environment\",\n", - ")\n", - "\n", - "ml_client.environments.create_or_update(env_docker_image)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Submit the training job " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines.\n", - "`inputs` is a dictionary of command-line arguments to pass to the training script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azure.ai.ml import command, Input\n", - "\n", - "\n", - "command_job = command(\n", - " environment=\"rapids-docker-image-2302:2\",\n", - " experiment_name=experiment_name,\n", - " code=project_folder,\n", - " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", - " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", - " inputs={\n", - " \"data_dir\": Input(type=\"uri_file\", path=data_uri),\n", - " \"n_bins\": 32,\n", - " \"compute\": \"single-GPU\", # multi-GPU for algorithms via Dask\n", - " \"cv_folds\": 5,\n", - " \"n_estimators\": 50,\n", - " \"max_depth\": 10,\n", - " \"max_features\": 1.0,\n", - " },\n", - " compute=\"gpu-cluster\",\n", - ")\n", - "\n", - "\n", - "# submit the command\n", - "returned_job = ml_client.jobs.create_or_update(command_job)\n", - "\n", - "# get a URL for the status of the job\n", - "returned_job.studio_url" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tune model hyperparameters" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Start a hyperparameter sweep" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy\n", - "\n", - "command_job_for_sweep = command_job(\n", - " n_estimators=Choice(values=range(50, 500)),\n", - " max_depth=Choice(values=range(5, 19)),\n", - " max_features=Uniform(min_value=0.2, max_value=1.0),\n", - ")\n", - "\n", - "# apply the sweep parameter to obtain the sweep_job\n", - "sweep_job = command_job_for_sweep.sweep(\n", - " compute=\"gpu-cluster\",\n", - " sampling_algorithm=\"random\",\n", - " primary_metric=\"Accuracy\",\n", - " goal=\"Maximize\",\n", - ")\n", - "\n", - "# define the limits for this sweep\n", - "sweep_job.set_limits(max_total_trials=10, max_concurrent_trials=5, timeout=300)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This will launch the RAPIDS training script with parameters that were specified in the cell above." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# submit the hpo job\n", - "returned_sweep_job = ml_client.create_or_update(sweep_job)\n", - "\n", - "# get a URL for the status of the job\n", - "returned_sweep_job.studio_url" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Monitor SweepJobs runs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Monitor and view the progress of the machine learning training run with Mlflow" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Find and register best model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Download best trial model output\n", - "\n", - "best_sweep = ml_client.jobs.download(returned_sweep_job.name, output_name=\"best_model\")\n", - "print(best_sweep)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "List the model files uploaded during the run:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Register the folder (and all files in it) as a model named `train-rapids` under the workspace for deployment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Delete cluster" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# gpu_target.delete()" - ] - } - ], - "metadata": { - "kernel_info": { - "name": "rapids" - }, - "kernelspec": { - "display_name": "Python 3.10 - SDK v2", - "language": "python", - "name": "python310-sdkv2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.9" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} From 9b506003da8681da94a5349aaefcf502948e067b Mon Sep 17 00:00:00 2001 From: skirui-source Date: Thu, 6 Jul 2023 17:39:39 -0700 Subject: [PATCH 04/13] add notebook with cell tags in title. goof to go --- .../rapids-azureml-hpo/notebook.ipynb | 538 ++++++++++++++++++ 1 file changed, 538 insertions(+) create mode 100644 source/examples/rapids-azureml-hpo/notebook.ipynb diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb new file mode 100644 index 00000000..8f154c78 --- /dev/null +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -0,0 +1,538 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "workflows/hpo", + "cloud/azure/ml" + ] + }, + "source": [ + "# Train and hyperparameter tune with RAPIDS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) and setup environment on local computer or Azure ML Compute Instance, following these [instructions](https://docs.rapids.ai/deployment/stable/cloud/azure/azureml/#azure-ml-compute-instance).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# verify Azure ML SDK version\n", + "\n", + "%pip show azure-ai-ml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize workspace" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Initialize`MLClient` class to handle the workspace you created in the prerequisites step. \n", + "\n", + "You can manually provide the workspace details or call `MLClient.from_config(credential, path)`\n", + "to create a workspace object from the details stored in `config.json`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml import MLClient\n", + "from azure.identity import DefaultAzureCredential\n", + "\n", + "\n", + "# Get a handle to the workspace\n", + "ml_client = MLClient(\n", + " credential=DefaultAzureCredential(),\n", + " subscription_id=\"fc4f4a6b-4041-4b1c-8249-854d68edcf62\",\n", + " resource_group_name=\"rapidsai-deployment\",\n", + " workspace_name=\"rapids-aml-cluster\",\n", + ")\n", + "\n", + "print(\n", + " \"Workspace name: \" + ml_client.workspace_name,\n", + " \"Subscription id: \" + ml_client.subscription_id,\n", + " \"Resource group: \" + ml_client.resource_group_name,\n", + " sep=\"\\n\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Access data from Datastore URI" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example, we will use 20 million rows of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "datastore_name = \"workspaceartifactstore\"\n", + "dataset = \"airline_20000000.parquet\"\n", + "\n", + "# Datastore uri format:\n", + "data_uri = f\"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}\"\n", + "\n", + "print(\"data uri:\", \"\\n\", data_uri)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create AML compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You will need to create an Azure ML managed compute target ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for training your model.\n", + "\n", + "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, so \n", + "you will need to select compute targets from one of the \n", + "[GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provisioned with P40 and V100 GPUs : `NC_v2`, `NC_v3`, `ND` or `ND_v2` \n", + "\n", + "Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml.entities import AmlCompute\n", + "\n", + "# specify aml compute name.\n", + "gpu_compute_target = \"rapids-cluster\"\n", + "\n", + "try:\n", + " # let's see if the compute target already exists\n", + " gpu_target = ml_client.compute.get(gpu_compute_target)\n", + " print(f\"found compute target. Will use {gpu_compute_target}\")\n", + "except:\n", + " print(\"Creating a new gpu compute target...\")\n", + "\n", + " gpu_target = AmlCompute(\n", + " name=\"rapids-cluster\",\n", + " type=\"amlcompute\",\n", + " size=\"STANDARD_NC12S_V3\",\n", + " max_instances=5,\n", + " idle_time_before_scale_down=300,\n", + " )\n", + " ml_client.compute.begin_create_or_update(gpu_target).result()\n", + "\n", + " print(\n", + " f\"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Prepare training script" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "library/cuml" + ] + }, + "source": [ + "Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:\n", + "\n", + "`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "project_folder = \"./train_rapids\" # create folder in same dir\n", + "os.makedirs(project_folder, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will log some parameters and metrics including highest accuracy, using mlflow within the training script:\n", + "\n", + "```console\n", + "import mlflow\n", + "\n", + "mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))\n", + "```\n", + "\n", + "These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy the training script `train_rapids.py` into your project directory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "\n", + "notebook_path = os.path.realpath(\n", + " \"__file__\" + \"/../../code\"\n", + ") # dir containing the training scrips\n", + "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", + "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")\n", + "\n", + "\n", + "shutil.copy(rapids_script, project_folder)\n", + "shutil.copy(azure_script, project_folder)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Train model on the remote compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that you have your data and training script prepared, you are ready to train on your remote compute:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create experiment\n", + "\n", + "Track all the runs in your workspace" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment_name = \"test_rapids_gpu_cluster\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup Environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# RUN THIS CODE ONCE TO SETUP ENVIRONMENT\n", + "from azure.ai.ml.entities import Environment, BuildContext\n", + "\n", + "env_docker_image = Environment(\n", + " build=BuildContext(path=\"./docker\"),\n", + " name=\"rapids-mlflow\",\n", + " description=\"RAPIDS environment with azureml-mlflow\",\n", + ")\n", + "\n", + "ml_client.environments.create_or_update(env_docker_image)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Submit the training job " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines.\n", + "`inputs` is a dictionary of command-line arguments to pass to the training script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "library/randomforest", + "library/cudf" + ] + }, + "outputs": [], + "source": [ + "from azure.ai.ml import command, Input\n", + "\n", + "\n", + "command_job = command(\n", + " environment=\"rapids-mlflow:1\",\n", + " experiment_name=experiment_name,\n", + " code=project_folder,\n", + " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", + " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", + " inputs={\n", + " \"data_dir\": Input(type=\"uri_file\", path=data_uri),\n", + " \"n_bins\": 32,\n", + " \"compute\": \"single-GPU\", # multi-GPU for algorithms via Dask\n", + " \"cv_folds\": 5,\n", + " \"n_estimators\": 100,\n", + " \"max_depth\": 6,\n", + " \"max_features\": 0.3,\n", + " },\n", + " compute=\"rapids-cluster\",\n", + ")\n", + "\n", + "\n", + "# submit the command\n", + "returned_job = ml_client.jobs.create_or_update(command_job)\n", + "\n", + "# get a URL for the status of the job\n", + "returned_job.studio_url" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tune model hyperparameters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a hyperparameter sweep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy\n", + "\n", + "command_job_for_sweep = command_job(\n", + " n_estimators=Choice(values=range(50, 500)),\n", + " max_depth=Choice(values=range(5, 19)),\n", + " max_features=Uniform(min_value=0.2, max_value=1.0),\n", + ")\n", + "\n", + "# apply sweep parameter to obtain the sweep_job\n", + "sweep_job = command_job_for_sweep.sweep(\n", + " compute=\"rapids-cluster\",\n", + " sampling_algorithm=\"random\",\n", + " primary_metric=\"Accuracy\",\n", + " goal=\"Maximize\",\n", + ")\n", + "\n", + "\n", + "# Define the limits for this sweep\n", + "sweep_job.set_limits(\n", + " max_total_trials=5, max_concurrent_trials=2, timeout=18000, trial_timeout=3600\n", + ")\n", + "\n", + "\n", + "# Specify your experiment details\n", + "sweep_job.display_name = \"RF-rapids-sweep-job\"\n", + "sweep_job.description = \"Run RAPIDS hyperparameter sweep job\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This will launch the RAPIDS training script with parameters that were specified in the cell above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# submit the hpo job\n", + "returned_sweep_job = ml_client.create_or_update(sweep_job)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Monitor SweepJobs runs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "aml_url = returned_sweep_job.studio_url\n", + "\n", + "print(\"Monitor your job at\", aml_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Find and register best model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download the best trial model output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ml_client.jobs.download(returned_sweep_job.name, output_name=\"model\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Delete cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ml_client.compute.begin_delete(gpu_compute_target.name).wait()" + ] + } + ], + "metadata": { + "kernel_info": { + "name": "rapids" + }, + "kernelspec": { + "display_name": "rapids", + "language": "python", + "name": "rapids" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "microsoft": { + "ms_spell_check": { + "ms_spell_check_language": "en" + } + }, + "nteract": { + "version": "nteract-front-end@1.0.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 862168a7487a2aef7db6bdf7c8a895c2984c1f01 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Thu, 6 Jul 2023 17:43:11 -0700 Subject: [PATCH 05/13] fixed cell tags --- source/examples/rapids-azureml-hpo/notebook.ipynb | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index 8f154c78..d9e42bf4 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -5,7 +5,10 @@ "metadata": { "tags": [ "workflows/hpo", - "cloud/azure/ml" + "cloud/azure/ml", + "library/cudf", + "library/cuml", + "library/randomforest" ] }, "source": [ From 6462e8b4e61182b3fa360885b10f58dab480aa45 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Thu, 6 Jul 2023 17:48:14 -0700 Subject: [PATCH 06/13] add notebook to examples toctree --- source/examples/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/source/examples/index.md b/source/examples/index.md index f92b927b..852bdc2d 100644 --- a/source/examples/index.md +++ b/source/examples/index.md @@ -13,4 +13,5 @@ rapids-sagemaker-hpo/notebook rapids-ec2-mnmg/notebook rapids-autoscaling-multi-tenant-kubernetes/notebook xgboost-randomforest-gpu-hpo-dask/notebook +rapids-azureml-hpo/notebook ``` From a3634910ee140128689d531d4dd0d400991f8bbd Mon Sep 17 00:00:00 2001 From: Jacob Tomlinson Date: Mon, 10 Jul 2023 12:06:52 +0100 Subject: [PATCH 07/13] Fix regex --- extensions/rapids_notebook_files.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extensions/rapids_notebook_files.py b/extensions/rapids_notebook_files.py index 5d867258..eef81d94 100644 --- a/extensions/rapids_notebook_files.py +++ b/extensions/rapids_notebook_files.py @@ -28,7 +28,7 @@ def walk_files(app, dir, outdir): with open(str(outdir / page.name), "w") as writer: writer.write( re.sub( - r"\{\{.*?\}\}", + r"(? Date: Wed, 12 Jul 2023 02:23:25 -0700 Subject: [PATCH 08/13] added relevant files, add docref --- source/examples/rapids-azureml-hpo/Dockerfile | 10 + .../rapids-azureml-hpo/notebook.ipynb | 97 ++-- .../rapids-azureml-hpo/rapids_csp_azure.py | 501 ++++++++++++++++++ .../rapids-azureml-hpo/train_rapids.py | 175 ++++++ 4 files changed, 719 insertions(+), 64 deletions(-) create mode 100644 source/examples/rapids-azureml-hpo/Dockerfile create mode 100644 source/examples/rapids-azureml-hpo/rapids_csp_azure.py create mode 100644 source/examples/rapids-azureml-hpo/train_rapids.py diff --git a/source/examples/rapids-azureml-hpo/Dockerfile b/source/examples/rapids-azureml-hpo/Dockerfile new file mode 100644 index 00000000..bb90d5a1 --- /dev/null +++ b/source/examples/rapids-azureml-hpo/Dockerfile @@ -0,0 +1,10 @@ +# Use rapids base image v23.02 with the necessary dependencies +FROM rapidsai/rapidsai:23.02-cuda11.8-runtime-ubuntu22.04-py3.10 + +# Update package information and install required packages +RUN apt-get update && \ + apt-get install -y --no-install-recommends build-essential fuse && \ + rm -rf /var/lib/apt/lists/* + +# Activate rapids conda environment +RUN /bin/bash -c "source activate rapids && pip install azureml-mlflow azureml-dataprep" diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index d9e42bf4..7dc55365 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -19,14 +19,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Prerequisites" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) and setup environment on local computer or Azure ML Compute Instance, following these [instructions](https://docs.rapids.ai/deployment/stable/cloud/azure/azureml/#azure-ml-compute-instance).\n" + "Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on [NVIDIA DGX Cloud](https://www.nvidia.com/en-us/data-center/dgx-cloud/).\n", + "\n", + "# Prerequisites\n", + "\n", + "````{docref} /cloud/azure/azureml\n", + "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) then follow instructions in [Microsoft Azure Machine Learning](../../cloud/azure/azureml) to launch an Azure ML Compute instance with RAPIDS.\n", + "\n", + "\n", + "Once your instance is running and you have access to Jupyter save this notebook and run through the cells.\n", + "\n", + "````" ] }, { @@ -44,7 +47,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Initialize workspace" + "# Initialize workspace" ] }, { @@ -89,7 +92,7 @@ "tags": [] }, "source": [ - "## Access data from Datastore URI" + "# Access data from Datastore URI" ] }, { @@ -118,14 +121,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create AML compute" + "# Create AML compute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "You will need to create an Azure ML managed compute target ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for training your model.\n", + "You will need to create an Azure ML managed compute target ([AmlCompute](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?view=azureml-api-2&tabs=python)) to serve as the environment for training your model.\n", "\n", "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." ] @@ -178,7 +181,7 @@ "metadata": {}, "source": [ "\n", - "## Prepare training script" + "# Prepare training script" ] }, { @@ -189,23 +192,11 @@ ] }, "source": [ - "Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:\n", + "Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided:\n", "\n", "`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML." ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "project_folder = \"./train_rapids\" # create folder in same dir\n", - "os.makedirs(project_folder, exist_ok=True)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -221,31 +212,14 @@ "These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section." ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copy the training script `train_rapids.py` into your project directory:" - ] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "import shutil\n", - "\n", - "\n", - "notebook_path = os.path.realpath(\n", - " \"__file__\" + \"/../../code\"\n", - ") # dir containing the training scrips\n", - "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", - "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")\n", - "\n", - "\n", - "shutil.copy(rapids_script, project_folder)\n", - "shutil.copy(azure_script, project_folder)" + "rapids_script = \"./train_rapids.py\"\n", + "azure_script = \"./rapids_csp_azure.py\"" ] }, { @@ -254,21 +228,14 @@ "tags": [] }, "source": [ - "## Train model on the remote compute" + "# Train Model on remote compute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now that you have your data and training script prepared, you are ready to train on your remote compute:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create experiment\n", + "## Create experiment\n", "\n", "Track all the runs in your workspace" ] @@ -286,14 +253,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Setup Environment" + "## Setup Environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub." + "We'll be using a custom RAPIDS docker image to [setup the environment]((https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image). This is available in `rapidsai/rapidsai` repo on [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai/).\n", + "\n", + "Make sure you have the correct path to the docker build context as `os.getcwd()`," ] }, { @@ -306,7 +275,7 @@ "from azure.ai.ml.entities import Environment, BuildContext\n", "\n", "env_docker_image = Environment(\n", - " build=BuildContext(path=\"./docker\"),\n", + " build=BuildContext(path=os.getcwd()),\n", " name=\"rapids-mlflow\",\n", " description=\"RAPIDS environment with azureml-mlflow\",\n", ")\n", @@ -318,7 +287,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Submit the training job " + "## Submit the training job " ] }, { @@ -373,7 +342,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Tune model hyperparameters" + "# Tune model hyperparameters" ] }, { @@ -387,7 +356,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Start a hyperparameter sweep" + "## Start a hyperparameter sweep" ] }, { @@ -452,7 +421,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Monitor SweepJobs runs" + "## Monitor SweepJobs runs" ] }, { @@ -493,7 +462,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Delete cluster" + "# Delete cluster" ] }, { @@ -511,9 +480,9 @@ "name": "rapids" }, "kernelspec": { - "display_name": "rapids", + "display_name": "rapids-23.06", "language": "python", - "name": "rapids" + "name": "rapids-23.06" }, "language_info": { "codemirror_mode": { diff --git a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py new file mode 100644 index 00000000..2a32a92a --- /dev/null +++ b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py @@ -0,0 +1,501 @@ +# +# Copyright (c) 2019-2021, NVIDIA CORPORATION. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import json +import logging +import pprint +import time + +import cudf +import cuml +import dask +import dask_cudf +import numpy as np +import pandas as pd +import sklearn +import xgboost +from cuml.dask.common import utils as dask_utils +from cuml.metrics.accuracy import accuracy_score +from cuml.model_selection import train_test_split as cuml_train_test_split +from dask.distributed import Client +from dask_cuda import LocalCUDACluster +from dask_ml.model_selection import train_test_split as dask_train_test_split +from sklearn.model_selection import train_test_split as sklearn_train_test_split + +default_azureml_paths = { + "train_script": "./train_script", + "train_data": "./data_airline", + "output": "./output", +} + + +class RapidsCloudML: + def __init__( + self, + cloud_type="Azure", + model_type="RandomForest", + data_type="Parquet", + compute_type="single-GPU", + verbose_estimator=False, + CSP_paths=default_azureml_paths, + ): + self.CSP_paths = CSP_paths + self.cloud_type = cloud_type + self.model_type = model_type + self.data_type = data_type + self.compute_type = compute_type + self.verbose_estimator = verbose_estimator + self.log_to_file( + f"\n> RapidsCloudML\n\tCompute, Data , Model, Cloud types {self.compute_type, self.data_type, self.model_type, self.cloud_type}" + ) + + # Setting up client for multi-GPU option + if "multi" in self.compute_type: + self.log_to_file("\n\tMulti-GPU selected") + # This will use all GPUs on the local host by default + cluster = LocalCUDACluster(threads_per_worker=1) + self.client = Client(cluster) + + # Query the client for all connected workers + self.workers = self.client.has_what().keys() + self.n_workers = len(self.workers) + self.log_to_file(f"\n\tClient information {self.client}") + + def load_hyperparams(self, model_name="XGBoost"): + """ + Selecting model paramters based on the model we select for execution. + Checks if there is a config file present in the path self.CSP_paths['hyperparams'] with + the parameters for the experiment. If not present, it returns the default parameters. + + Parameters + ---------- + model_name : string + Selects which model to set the parameters for. Takes either 'XGBoost' or 'RandomForest'. + + Returns + ---------- + model_params : dict + Loaded model parameters (dict) + """ + + self.log_to_file("\n> Loading Hyperparameters") + + # Default parameters of the models + if self.model_type == "XGBoost": + # https://xgboost.readthedocs.io/en/latest/parameter.html + model_params = { + "max_depth": 6, + "num_boost_round": 100, + "learning_rate": 0.3, + "gamma": 0.0, + "lambda": 1.0, + "alpha": 0.0, + "objective": "binary:logistic", + "random_state": 0, + } + + elif self.model_type == "RandomForest": + # https://docs.rapids.ai/api/cuml/stable/ -> cuml.ensemble.RandomForestClassifier + model_params = { + "n_estimators": 10, + "max_depth": 10, + "n_bins": 16, + "max_features": 1.0, + "seed": 0, + } + + hyperparameters = {} + try: + with open(self.CSP_paths["hyperparams"]) as file_handle: + hyperparameters = json.load(file_handle) + for key, value in hyperparameters.items(): + model_params[key] = value + pprint.pprint(model_params) + return model_params + + except Exception as error: + self.log_to_file(str(error)) + return + + def load_data( + self, filename="dataset.orc", col_labels=None, y_label="ArrDelayBinary" + ): + """ + Loading the data into the object from the filename and based on the columns that we are + interested in. Also, generates y_label from 'ArrDelay' column to convert this into a binary + classification problem. + + Parameters + ---------- + filename : string + the path of the dataset to be loaded + + col_labels : list of strings + The input columns that we are interested in. None selects all the columns + + y_label : string + The column to perform the prediction task in. + + Returns + ---------- + dataset : dataframe (Pandas, cudf or dask-cudf) + Ingested dataset in the format of a dataframe + + col_labels : list of strings + The input columns selected + + y_label : string + The generated y_label name for binary classification + + duration : float + The time it took to execute the function + """ + target_filename = filename + self.log_to_file(f"\n> Loading dataset from {target_filename}") + + with PerfTimer() as ingestion_timer: + if "CPU" in self.compute_type: + # CPU Reading options + self.log_to_file("\n\tCPU read") + + if self.data_type == "ORC": + with open(target_filename, mode="rb") as file: + dataset = pyarrow_orc.ORCFile(file).read().to_pandas() + elif self.data_type == "CSV": + dataset = pd.read_csv(target_filename, names=col_labels) + + elif self.data_type == "Parquet": + if "single" in self.compute_type: + dataset = pd.read_parquet(target_filename) + + elif "multi" in self.compute_type: + self.log_to_file("\n\tReading using dask dataframe") + dataset = dask.dataframe.read_parquet( + target_filename, columns=columns + ) + + elif "GPU" in self.compute_type: + # GPU Reading Option + + self.log_to_file("\n\tGPU read") + if self.data_type == "ORC": + dataset = cudf.read_orc(target_filename) + + elif self.data_type == "CSV": + dataset = cudf.read_csv(target_filename, names=col_labels) + + elif self.data_type == "Parquet": + if "single" in self.compute_type: + dataset = cudf.read_parquet(target_filename) + + elif "multi" in self.compute_type: + self.log_to_file("\n\tReading using dask_cudf") + dataset = dask_cudf.read_parquet( + target_filename, columns=col_labels + ) + + # cast all columns to float32 + for col in dataset.columns: + dataset[col] = dataset[col].astype(np.float32) # needed for random forest + + # Adding y_label column if it is not present + if y_label not in dataset.columns: + dataset[y_label] = 1.0 * (dataset["ArrDelay"] > 10) + + dataset[y_label] = dataset[y_label].astype(np.int32) # Needed for cuml RF + + dataset = dataset.fillna(0.0) # Filling the null values. Needed for dask-cudf + + self.log_to_file(f"\n\tIngestion completed in {ingestion_timer.duration}") + self.log_to_file( + f"\n\tDataset descriptors: {dataset.shape}\n\t{dataset.dtypes}" + ) + return dataset, col_labels, y_label, ingestion_timer.duration + + def split_data( + self, dataset, y_label, train_size=0.8, random_state=0, shuffle=True + ): + """ + Splitting data into train and test split, has appropriate imports for different compute modes. + CPU compute - Uses sklearn, we manually filter y_label column in the split call + GPU Compute - Single GPU uses cuml and multi GPU uses dask, both split y_label internally. + + Parameters + ---------- + dataset : dataframe + The dataframe on which we wish to perform the split + y_label : string + The name of the column (not the series itself) + train_size : float + The size for the split. Takes values between 0 to 1. + random_state : int + Useful for running reproducible splits. + shuffle : binary + Specifies if the data must be shuffled before splitting. + + Returns + ---------- + X_train : dataframe + The data to be used for training. Has same type as input dataset. + X_test : dataframe + The data to be used for testing. Has same type as input dataset. + y_train : dataframe + The label to be used for training. Has same type as input dataset. + y_test : dataframe + The label to be used for testing. Has same type as input dataset. + duration : float + The time it took to perform the split + """ + self.log_to_file("\n> Splitting train and test data") + time.perf_counter() + + with PerfTimer() as split_timer: + if "CPU" in self.compute_type: + X_train, X_test, y_train, y_test = sklearn_train_test_split( + dataset.loc[:, dataset.columns != y_label], + dataset[y_label], + train_size=train_size, + shuffle=shuffle, + random_state=random_state, + ) + + elif "GPU" in self.compute_type: + if "single" in self.compute_type: + X_train, X_test, y_train, y_test = cuml_train_test_split( + X=dataset, + y=y_label, + train_size=train_size, + shuffle=shuffle, + random_state=random_state, + ) + elif "multi" in self.compute_type: + X_train, X_test, y_train, y_test = dask_train_test_split( + dataset, + y_label, + train_size=train_size, + shuffle=False, # shuffle not available for dask_cudf yet + random_state=random_state, + ) + + self.log_to_file(f"\n\tX_train shape and type{X_train.shape} {type(X_train)}") + self.log_to_file(f"\n\tSplit completed in {split_timer.duration}") + return X_train, X_test, y_train, y_test, split_timer.duration + + def train_model(self, X_train, y_train, model_params): + """ + Trains a model with the model_params specified by calling fit_xgboost or + fit_random_forest depending on the model_type. + + Parameters + ---------- + X_train : dataframe + The data for traning + y_train : dataframe + The label to be used for training. + model_params : dict + The model params to use for this training + Returns + ---------- + trained_model : The object of the trained model either of XGBoost or RandomForest + + training_time : float + The time it took to train the model + """ + self.log_to_file(f"\n> Training {self.model_type} estimator w/ hyper-params") + training_time = 0 + + try: + if self.model_type == "XGBoost": + trained_model, training_time = self.fit_xgboost( + X_train, y_train, model_params + ) + elif self.model_type == "RandomForest": + trained_model, training_time = self.fit_random_forest( + X_train, y_train, model_params + ) + except Exception as error: + self.log_to_file("\n\n!error during model training: " + str(error)) + self.log_to_file(f"\n\tFinished training in {training_time:.4f} s") + return trained_model, training_time + + def fit_xgboost(self, X_train, y_train, model_params): + """ + Trains a XGBoost model on X_train and y_train with model_params + + Parameters and Objects returned are same as trained_model + """ + if "GPU" in self.compute_type: + model_params.update({"tree_method": "gpu_hist"}) + else: + model_params.update({"tree_method": "hist"}) + + with PerfTimer() as train_timer: + if "single" in self.compute_type: + train_DMatrix = xgboost.DMatrix(data=X_train, label=y_train) + trained_model = xgboost.train( + dtrain=train_DMatrix, + params=model_params, + num_boost_round=model_params["num_boost_round"], + ) + elif "multi" in self.compute_type: + self.log_to_file("\n\tTraining multi-GPU XGBoost") + train_DMatrix = xgboost.dask.DaskDMatrix( + self.client, data=X_train, label=y_train + ) + trained_model = xgboost.dask.train( + self.client, + dtrain=train_DMatrix, + params=model_params, + num_boost_round=model_params["num_boost_round"], + ) + return trained_model, train_timer.duration + + def fit_random_forest(self, X_train, y_train, model_params): + """ + Trains a RandomForest model on X_train and y_train with model_params. + Depending on compute_type, estimators from appropriate packages are used. + CPU - sklearn + Single-GPU - cuml + multi_gpu - cuml.dask + + Parameters and Objects returned are same as trained_model + """ + if "CPU" in self.compute_type: + rf_model = sklearn.ensemble.RandomForestClassifier( + n_estimators=model_params["n_estimators"], + max_depth=model_params["max_depth"], + max_features=model_params["max_features"], + n_jobs=int(self.n_workers), + verbose=self.verbose_estimator, + ) + elif "GPU" in self.compute_type: + if "single" in self.compute_type: + rf_model = cuml.ensemble.RandomForestClassifier( + n_estimators=model_params["n_estimators"], + max_depth=model_params["max_depth"], + n_bins=model_params["n_bins"], + max_features=model_params["max_features"], + verbose=self.verbose_estimator, + ) + elif "multi" in self.compute_type: + self.log_to_file("\n\tFitting multi-GPU daskRF") + X_train, y_train = dask_utils.persist_across_workers( + self.client, + [X_train.fillna(0.0), y_train.fillna(0.0)], + workers=self.workers, + ) + rf_model = cuml.dask.ensemble.RandomForestClassifier( + n_estimators=model_params["n_estimators"], + max_depth=model_params["max_depth"], + n_bins=model_params["n_bins"], + max_features=model_params["max_features"], + verbose=self.verbose_estimator, + ) + with PerfTimer() as train_timer: + try: + trained_model = rf_model.fit(X_train, y_train) + except Exception as error: + self.log_to_file("\n\n! Error during fit " + str(error)) + return trained_model, train_timer.duration + + def evaluate_test_perf(self, trained_model, X_test, y_test, threshold=0.5): + """ + Evaluates the model performance on the inference set. For XGBoost we need + to generate a DMatrix and then we can evaluate the model. + For Random Forest, in single GPU case, we can just call .score function. + And multi-GPU Random Forest needs to predict on the model and then compute + the accuracy score. + + Parameters + ---------- + trained_model : The object of the trained model either of XGBoost or RandomForest + X_test : dataframe + The data for testing + y_test : dataframe + The label to be used for testing. + Returns + ---------- + test_accuracy : float + The accuracy achieved on test set + duration : float + The time it took to evaluate the model + """ + self.log_to_file("\n> Inferencing on test set") + test_accuracy = None + with PerfTimer() as inference_timer: + try: + if self.model_type == "XGBoost": + if "multi" in self.compute_type: + test_DMatrix = xgboost.dask.DaskDMatrix( + self.client, data=X_test, label=y_test + ) + xgb_pred = xgboost.dask.predict( + self.client, trained_model, test_DMatrix + ).compute() + xgb_pred = (xgb_pred > threshold) * 1.0 + test_accuracy = accuracy_score(y_test.compute(), xgb_pred) + elif "single" in self.compute_type: + test_DMatrix = xgboost.DMatrix(data=X_test, label=y_test) + xgb_pred = trained_model.predict(test_DMatrix) + xgb_pred = (xgb_pred > threshold) * 1.0 + test_accuracy = accuracy_score(y_test, xgb_pred) + + elif self.model_type == "RandomForest": + if "multi" in self.compute_type: + cuml_pred = trained_model.predict(X_test).compute() + self.log_to_file("\n\tPrediction complete") + test_accuracy = accuracy_score( + y_test.compute(), cuml_pred, convert_dtype=True + ) + elif "single" in self.compute_type: + test_accuracy = trained_model.score( + X_test, y_test.astype("int32") + ) + + except Exception as error: + self.log_to_file("\n\n!error during inference: " + str(error)) + + self.log_to_file(f"\n\tFinished inference in {inference_timer.duration:.4f} s") + self.log_to_file(f"\n\tTest-accuracy: {test_accuracy}") + return test_accuracy, inference_timer.duration + + def set_up_logging(self): + """ + Function to set up logging for the object. + """ + logging_path = self.CSP_paths["output"] + "/log.txt" + logging.basicConfig(filename=logging_path, level=logging.INFO) + + def log_to_file(self, text): + """ + Logs the text that comes in as input. + """ + logging.info(text) + print(text) + + +# perf_counter = highest available timer resolution +class PerfTimer: + def __init__(self): + self.start = None + self.duration = None + + def __enter__(self): + self.start = time.perf_counter() + return self + + def __exit__(self, *args): + self.duration = time.perf_counter() - self.start diff --git a/source/examples/rapids-azureml-hpo/train_rapids.py b/source/examples/rapids-azureml-hpo/train_rapids.py new file mode 100644 index 00000000..63ce4f5f --- /dev/null +++ b/source/examples/rapids-azureml-hpo/train_rapids.py @@ -0,0 +1,175 @@ +# +# Copyright (c) 2019-2021, NVIDIA CORPORATION. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import argparse +import os + +import cudf +import cuml +import mlflow +import numpy as np +from rapids_csp_azure import PerfTimer, RapidsCloudML + + +def main(): + parser = argparse.ArgumentParser() + + parser.add_argument("--data_dir", type=str, help="location of data") + parser.add_argument( + "--n_estimators", type=int, default=100, help="Number of trees in RF" + ) + parser.add_argument( + "--max_depth", type=int, default=16, help="Max depth of each tree" + ) + parser.add_argument( + "--n_bins", + type=int, + default=8, + help="Number of bins used in split point calculation", + ) + parser.add_argument( + "--max_features", + type=float, + default=1.0, + help="Number of features for best split", + ) + parser.add_argument( + "--compute", + type=str, + default="single-GPU", + help="set to multi-GPU for algorithms via dask", + ) + parser.add_argument( + "--cv_folds", type=int, default=5, help="Number of CV fold splits" + ) + + args = parser.parse_args() + data_dir = args.data_dir + compute = args.compute + cv_folds = args.cv_folds + + n_estimators = args.n_estimators + mlflow.log_param("n_estimators", np.int(args.n_estimators)) + max_depth = args.max_depth + mlflow.log_param("max_depth", np.int(args.max_depth)) + n_bins = args.n_bins + mlflow.log_param("n_bins", np.int(args.n_bins)) + max_features = args.max_features + mlflow.log_param("max_features", np.str(args.max_features)) + + print("\n---->>>> cuDF version <<<<----\n", cudf.__version__) + print("\n---->>>> cuML version <<<<----\n", cuml.__version__) + + azure_ml = RapidsCloudML( + cloud_type="Azure", + model_type="RandomForest", + data_type="Parquet", + compute_type=compute, + ) + print(args.compute) + + if compute == "single-GPU": + dataset, _, y_label, _ = azure_ml.load_data(filename=data_dir) + else: + # use parquet files from 'https://airlinedataset.blob.core.windows.net/airline-10years' for multi-GPU training + dataset, _, y_label, _ = azure_ml.load_data( + filename=os.path.join(data_dir, "part*.parquet"), + col_labels=[ + "Flight_Number_Reporting_Airline", + "Year", + "Quarter", + "Month", + "DayOfWeek", + "DOT_ID_Reporting_Airline", + "OriginCityMarketID", + "DestCityMarketID", + "DepTime", + "DepDelay", + "DepDel15", + "ArrDel15", + "ArrDelay", + "AirTime", + "Distance", + ], + y_label="ArrDel15", + ) + + X = dataset[dataset.columns.difference(["ArrDelay", y_label])] + y = dataset[y_label] + del dataset + + print("\n---->>>> Training using GPUs <<<<----\n") + + # ---------------------------------------------------------------------------------------------------- + # cross-validation folds + # ---------------------------------------------------------------------------------------------------- + accuracy_per_fold = [] + train_time_per_fold = [] + infer_time_per_fold = [] + trained_model = [] + global_best_test_accuracy = 0 + + model_params = { + "n_estimators": n_estimators, + "max_depth": max_depth, + "max_features": max_features, + "n_bins": n_bins, + } + + # optional cross-validation w/ model_params['n_train_folds'] > 1 + for i_train_fold in range(cv_folds): + print(f"\n CV fold { i_train_fold } of { cv_folds }\n") + + # split data + X_train, X_test, y_train, y_test, _ = azure_ml.split_data( + X, y, random_state=i_train_fold + ) + # train model + trained_model, training_time = azure_ml.train_model( + X_train, y_train, model_params + ) + + train_time_per_fold.append(round(training_time, 4)) + + # evaluate perf + test_accuracy, infer_time = azure_ml.evaluate_test_perf( + trained_model, X_test, y_test + ) + accuracy_per_fold.append(round(test_accuracy, 4)) + infer_time_per_fold.append(round(infer_time, 4)) + + # update best model [ assumes maximization of perf metric ] + if test_accuracy > global_best_test_accuracy: + global_best_test_accuracy = test_accuracy + + mlflow.log_metric( + "Total training inference time", np.float(training_time + infer_time) + ) + mlflow.log_metric("Accuracy", np.float(global_best_test_accuracy)) + print("\n Accuracy :", global_best_test_accuracy) + print("\n accuracy per fold :", accuracy_per_fold) + print("\n train-time per fold :", train_time_per_fold) + print("\n train-time all folds :", sum(train_time_per_fold)) + print("\n infer-time per fold :", infer_time_per_fold) + print("\n infer-time all folds :", sum(infer_time_per_fold)) + + +if __name__ == "__main__": + with PerfTimer() as total_script_time: + main() + print(f"Total runtime: {total_script_time.duration:.2f}") + mlflow.log_metric("Total runtime", np.float(total_script_time.duration)) + print("\n Exiting script") From 588e36d9619e9ef893591769f224d4fd2aec3047 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Fri, 14 Jul 2023 10:14:52 -0700 Subject: [PATCH 09/13] updated the notebook intro --- .../rapids-azureml-hpo/notebook.ipynb | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index 7dc55365..48e039a1 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -19,14 +19,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on [NVIDIA DGX Cloud](https://www.nvidia.com/en-us/data-center/dgx-cloud/).\n", + "Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. \n", "\n", + "In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on [Azure Machine Learning (AzureML)](https://azure.microsoft.com/en-us/products/machine-learning) service.\n", "# Prerequisites\n", "\n", "````{docref} /cloud/azure/azureml\n", "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) then follow instructions in [Microsoft Azure Machine Learning](../../cloud/azure/azureml) to launch an Azure ML Compute instance with RAPIDS.\n", "\n", - "\n", "Once your instance is running and you have access to Jupyter save this notebook and run through the cells.\n", "\n", "````" @@ -54,7 +54,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Initialize`MLClient` class to handle the workspace you created in the prerequisites step. \n", + "Initialize`MLClient`[class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.mlclient?view=azure-python) to handle the workspace you created in the prerequisites step. \n", "\n", "You can manually provide the workspace details or call `MLClient.from_config(credential, path)`\n", "to create a workspace object from the details stored in `config.json`" @@ -246,7 +246,7 @@ "metadata": {}, "outputs": [], "source": [ - "experiment_name = \"test_rapids_gpu_cluster\"" + "experiment_name = \"test_rapids_aml_cluster\"" ] }, { @@ -276,7 +276,7 @@ "\n", "env_docker_image = Environment(\n", " build=BuildContext(path=os.getcwd()),\n", - " name=\"rapids-mlflow\",\n", + " name=\"test-rapids-mlflow\",\n", " description=\"RAPIDS environment with azureml-mlflow\",\n", ")\n", "\n", @@ -313,9 +313,9 @@ "\n", "\n", "command_job = command(\n", - " environment=\"rapids-mlflow:1\",\n", + " environment=\"test-rapids-mlflow:1\",\n", " experiment_name=experiment_name,\n", - " code=project_folder,\n", + " code=os.getcwd(),\n", " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", " inputs={\n", @@ -480,9 +480,9 @@ "name": "rapids" }, "kernelspec": { - "display_name": "rapids-23.06", + "display_name": "rapids", "language": "python", - "name": "rapids-23.06" + "name": "rapids" }, "language_info": { "codemirror_mode": { @@ -494,7 +494,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.10.11" }, "microsoft": { "ms_spell_check": { From ccf85d24c17011474cef45d12f5e53f0e388ac0b Mon Sep 17 00:00:00 2001 From: skirui-source Date: Fri, 14 Jul 2023 12:10:22 -0700 Subject: [PATCH 10/13] fix pre-commit style issue --- source/examples/rapids-azureml-hpo/rapids_csp_azure.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py index 2a32a92a..e4f53026 100644 --- a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py +++ b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py @@ -59,7 +59,8 @@ def __init__( self.compute_type = compute_type self.verbose_estimator = verbose_estimator self.log_to_file( - f"\n> RapidsCloudML\n\tCompute, Data , Model, Cloud types {self.compute_type, self.data_type, self.model_type, self.cloud_type}" + f"\n> RapidsCloudML\n\tCompute, Data, Model, Cloud types " + f"{self.compute_type, self.data_type, self.model_type, self.cloud_type}" ) # Setting up client for multi-GPU option From 8ade533418dc2ccf383d34b4fa7027f04455c22b Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 17 Jul 2023 10:43:19 -0700 Subject: [PATCH 11/13] need to fix failing pre-commit --- source/examples/rapids-azureml-hpo/notebook.ipynb | 10 +++++----- source/examples/rapids-azureml-hpo/rapids_csp_azure.py | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index 48e039a1..53e54e1f 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -316,8 +316,6 @@ " environment=\"test-rapids-mlflow:1\",\n", " experiment_name=experiment_name,\n", " code=os.getcwd(),\n", - " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", - " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", " inputs={\n", " \"data_dir\": Input(type=\"uri_file\", path=data_uri),\n", " \"n_bins\": 32,\n", @@ -327,6 +325,8 @@ " \"max_depth\": 6,\n", " \"max_features\": 0.3,\n", " },\n", + " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", + " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", " compute=\"rapids-cluster\",\n", ")\n", "\n", @@ -480,9 +480,9 @@ "name": "rapids" }, "kernelspec": { - "display_name": "rapids", + "display_name": "rapids-23.06", "language": "python", - "name": "rapids" + "name": "rapids-23.06" }, "language_info": { "codemirror_mode": { @@ -494,7 +494,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.12" }, "microsoft": { "ms_spell_check": { diff --git a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py index e4f53026..32982edb 100644 --- a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py +++ b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. # - +# import json import logging import pprint @@ -185,7 +185,7 @@ def load_data( elif "multi" in self.compute_type: self.log_to_file("\n\tReading using dask dataframe") dataset = dask.dataframe.read_parquet( - target_filename, columns=columns + target_filename, columns=col_labels ) elif "GPU" in self.compute_type: From ce4ccaa68b8a781a3963c53c6a8cfdbbdcc06092 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 17 Jul 2023 11:41:10 -0700 Subject: [PATCH 12/13] all hooks passing now.ready for review --- source/examples/rapids-azureml-hpo/rapids_csp_azure.py | 1 + 1 file changed, 1 insertion(+) diff --git a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py index 32982edb..ea7724ea 100644 --- a/source/examples/rapids-azureml-hpo/rapids_csp_azure.py +++ b/source/examples/rapids-azureml-hpo/rapids_csp_azure.py @@ -25,6 +25,7 @@ import dask_cudf import numpy as np import pandas as pd +import pyarrow.orc as pyarrow_orc import sklearn import xgboost from cuml.dask.common import utils as dask_utils From 398360268c928b34ece567a0cff70d8cb430f961 Mon Sep 17 00:00:00 2001 From: skirui-source Date: Mon, 17 Jul 2023 14:31:46 -0700 Subject: [PATCH 13/13] uploaded notebook with cell outputs --- .../rapids-azureml-hpo/notebook.ipynb | 159 ++++++++++++++---- 1 file changed, 129 insertions(+), 30 deletions(-) diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb index 53e54e1f..4cb4321f 100644 --- a/source/examples/rapids-azureml-hpo/notebook.ipynb +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -34,9 +34,27 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: azure-ai-ml\n", + "Version: 1.8.0\n", + "Summary: Microsoft Azure Machine Learning Client Library for Python\n", + "Home-page: https://github.com/Azure/azure-sdk-for-python\n", + "Author: Microsoft Corporation\n", + "Author-email: azuresdkengsysadmins@microsoft.com\n", + "License: MIT License\n", + "Location: /anaconda/envs/rapids/lib/python3.10/site-packages\n", + "Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions\n", + "Required-by: \n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "# verify Azure ML SDK version\n", "\n", @@ -62,9 +80,19 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Workspace name: rapids-aml-cluster\n", + "Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62\n", + "Resource group: rapidsai-deployment\n" + ] + } + ], "source": [ "from azure.ai.ml import MLClient\n", "from azure.identity import DefaultAzureCredential\n", @@ -104,9 +132,18 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "data uri: \n", + " azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet\n" + ] + } + ], "source": [ "datastore_name = \"workspaceartifactstore\"\n", "dataset = \"airline_20000000.parquet\"\n", @@ -146,9 +183,17 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "found compute target. Will use rapids-cluster\n" + ] + } + ], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", @@ -214,7 +259,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -242,7 +287,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -267,16 +312,36 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[32mUploading code (0.33 MBs): 100%|██████████| 325450/325450 [00:00<00:00, 2363322.62it/s]\n", + "\u001b[39m\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "Environment({'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'rapids-mlflow', 'description': 'RAPIDS environment with azureml-mlflow', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourceGroups/rapidsai-deployment/providers/Microsoft.MachineLearningServices/workspaces/rapids-aml-cluster/environments/rapids-mlflow/versions/10', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/skirui1/code', 'creation_context': , 'serialize': , 'version': '10', 'latest_version': None, 'conda_file': None, 'image': None, 'build': , 'inference_config': None, 'os_type': 'Linux', 'arm_type': 'environment_version', 'conda_file_path': None, 'path': None, 'datastore': None, 'upload_hash': None, 'translated_conda_file': None})" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# RUN THIS CODE ONCE TO SETUP ENVIRONMENT\n", "from azure.ai.ml.entities import Environment, BuildContext\n", "\n", "env_docker_image = Environment(\n", " build=BuildContext(path=os.getcwd()),\n", - " name=\"test-rapids-mlflow\",\n", + " name=\"rapids-mlflow\",\n", " description=\"RAPIDS environment with azureml-mlflow\",\n", ")\n", "\n", @@ -300,20 +365,46 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": { "tags": [ "library/randomforest", "library/cudf" ] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.\n", + "\u001b[32mUploading code (0.33 MBs): 100%|██████████| 327210/327210 [00:00<00:00, 1802654.05it/s]\n", + "\u001b[39m\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "'https://ml.azure.com/runs/zen_eye_lm7dcp68jz?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "from azure.ai.ml import command, Input\n", "\n", "\n", "command_job = command(\n", - " environment=\"test-rapids-mlflow:1\",\n", + " environment=\"rapids-mlflow:1\",\n", " experiment_name=experiment_name,\n", " code=os.getcwd(),\n", " inputs={\n", @@ -368,7 +459,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -391,7 +482,7 @@ "\n", "# Define the limits for this sweep\n", "sweep_job.set_limits(\n", - " max_total_trials=5, max_concurrent_trials=2, timeout=18000, trial_timeout=3600\n", + " max_total_trials=10, max_concurrent_trials=2, timeout=18000, trial_timeout=3600\n", ")\n", "\n", "\n", @@ -409,7 +500,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -426,9 +517,17 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Monitor your job at https://ml.azure.com/runs/eager_turtle_r7fs2xzcty?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a\n" + ] + } + ], "source": [ "aml_url = returned_sweep_job.studio_url\n", "\n", @@ -451,7 +550,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -467,11 +566,11 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ - "ml_client.compute.begin_delete(gpu_compute_target.name).wait()" + "ml_client.compute.begin_delete(gpu_compute_target).wait()" ] } ],