diff --git a/source/examples/rapids-azureml-hpo/notebook.ipynb b/source/examples/rapids-azureml-hpo/notebook.ipynb new file mode 100644 index 00000000..8f154c78 --- /dev/null +++ b/source/examples/rapids-azureml-hpo/notebook.ipynb @@ -0,0 +1,538 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "workflows/hpo", + "cloud/azure/ml" + ] + }, + "source": [ + "# Train and hyperparameter tune with RAPIDS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) and setup environment on local computer or Azure ML Compute Instance, following these [instructions](https://docs.rapids.ai/deployment/stable/cloud/azure/azureml/#azure-ml-compute-instance).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# verify Azure ML SDK version\n", + "\n", + "%pip show azure-ai-ml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize workspace" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Initialize`MLClient` class to handle the workspace you created in the prerequisites step. \n", + "\n", + "You can manually provide the workspace details or call `MLClient.from_config(credential, path)`\n", + "to create a workspace object from the details stored in `config.json`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml import MLClient\n", + "from azure.identity import DefaultAzureCredential\n", + "\n", + "\n", + "# Get a handle to the workspace\n", + "ml_client = MLClient(\n", + " credential=DefaultAzureCredential(),\n", + " subscription_id=\"fc4f4a6b-4041-4b1c-8249-854d68edcf62\",\n", + " resource_group_name=\"rapidsai-deployment\",\n", + " workspace_name=\"rapids-aml-cluster\",\n", + ")\n", + "\n", + "print(\n", + " \"Workspace name: \" + ml_client.workspace_name,\n", + " \"Subscription id: \" + ml_client.subscription_id,\n", + " \"Resource group: \" + ml_client.resource_group_name,\n", + " sep=\"\\n\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Access data from Datastore URI" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example, we will use 20 million rows of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "datastore_name = \"workspaceartifactstore\"\n", + "dataset = \"airline_20000000.parquet\"\n", + "\n", + "# Datastore uri format:\n", + "data_uri = f\"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}\"\n", + "\n", + "print(\"data uri:\", \"\\n\", data_uri)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create AML compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You will need to create an Azure ML managed compute target ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for training your model.\n", + "\n", + "This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, so \n", + "you will need to select compute targets from one of the \n", + "[GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provisioned with P40 and V100 GPUs : `NC_v2`, `NC_v3`, `ND` or `ND_v2` \n", + "\n", + "Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml.entities import AmlCompute\n", + "\n", + "# specify aml compute name.\n", + "gpu_compute_target = \"rapids-cluster\"\n", + "\n", + "try:\n", + " # let's see if the compute target already exists\n", + " gpu_target = ml_client.compute.get(gpu_compute_target)\n", + " print(f\"found compute target. Will use {gpu_compute_target}\")\n", + "except:\n", + " print(\"Creating a new gpu compute target...\")\n", + "\n", + " gpu_target = AmlCompute(\n", + " name=\"rapids-cluster\",\n", + " type=\"amlcompute\",\n", + " size=\"STANDARD_NC12S_V3\",\n", + " max_instances=5,\n", + " idle_time_before_scale_down=300,\n", + " )\n", + " ml_client.compute.begin_create_or_update(gpu_target).result()\n", + "\n", + " print(\n", + " f\"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Prepare training script" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "library/cuml" + ] + }, + "source": [ + "Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:\n", + "\n", + "`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "project_folder = \"./train_rapids\" # create folder in same dir\n", + "os.makedirs(project_folder, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will log some parameters and metrics including highest accuracy, using mlflow within the training script:\n", + "\n", + "```console\n", + "import mlflow\n", + "\n", + "mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))\n", + "```\n", + "\n", + "These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy the training script `train_rapids.py` into your project directory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "\n", + "notebook_path = os.path.realpath(\n", + " \"__file__\" + \"/../../code\"\n", + ") # dir containing the training scrips\n", + "rapids_script = os.path.join(notebook_path, \"train_rapids.py\")\n", + "azure_script = os.path.join(notebook_path, \"rapids_csp_azure.py\")\n", + "\n", + "\n", + "shutil.copy(rapids_script, project_folder)\n", + "shutil.copy(azure_script, project_folder)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Train model on the remote compute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that you have your data and training script prepared, you are ready to train on your remote compute:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create experiment\n", + "\n", + "Track all the runs in your workspace" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment_name = \"test_rapids_gpu_cluster\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup Environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# RUN THIS CODE ONCE TO SETUP ENVIRONMENT\n", + "from azure.ai.ml.entities import Environment, BuildContext\n", + "\n", + "env_docker_image = Environment(\n", + " build=BuildContext(path=\"./docker\"),\n", + " name=\"rapids-mlflow\",\n", + " description=\"RAPIDS environment with azureml-mlflow\",\n", + ")\n", + "\n", + "ml_client.environments.create_or_update(env_docker_image)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Submit the training job " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines.\n", + "`inputs` is a dictionary of command-line arguments to pass to the training script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "library/randomforest", + "library/cudf" + ] + }, + "outputs": [], + "source": [ + "from azure.ai.ml import command, Input\n", + "\n", + "\n", + "command_job = command(\n", + " environment=\"rapids-mlflow:1\",\n", + " experiment_name=experiment_name,\n", + " code=project_folder,\n", + " command=\"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\\\n", + " --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}\",\n", + " inputs={\n", + " \"data_dir\": Input(type=\"uri_file\", path=data_uri),\n", + " \"n_bins\": 32,\n", + " \"compute\": \"single-GPU\", # multi-GPU for algorithms via Dask\n", + " \"cv_folds\": 5,\n", + " \"n_estimators\": 100,\n", + " \"max_depth\": 6,\n", + " \"max_features\": 0.3,\n", + " },\n", + " compute=\"rapids-cluster\",\n", + ")\n", + "\n", + "\n", + "# submit the command\n", + "returned_job = ml_client.jobs.create_or_update(command_job)\n", + "\n", + "# get a URL for the status of the job\n", + "returned_job.studio_url" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tune model hyperparameters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a hyperparameter sweep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy\n", + "\n", + "command_job_for_sweep = command_job(\n", + " n_estimators=Choice(values=range(50, 500)),\n", + " max_depth=Choice(values=range(5, 19)),\n", + " max_features=Uniform(min_value=0.2, max_value=1.0),\n", + ")\n", + "\n", + "# apply sweep parameter to obtain the sweep_job\n", + "sweep_job = command_job_for_sweep.sweep(\n", + " compute=\"rapids-cluster\",\n", + " sampling_algorithm=\"random\",\n", + " primary_metric=\"Accuracy\",\n", + " goal=\"Maximize\",\n", + ")\n", + "\n", + "\n", + "# Define the limits for this sweep\n", + "sweep_job.set_limits(\n", + " max_total_trials=5, max_concurrent_trials=2, timeout=18000, trial_timeout=3600\n", + ")\n", + "\n", + "\n", + "# Specify your experiment details\n", + "sweep_job.display_name = \"RF-rapids-sweep-job\"\n", + "sweep_job.description = \"Run RAPIDS hyperparameter sweep job\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This will launch the RAPIDS training script with parameters that were specified in the cell above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# submit the hpo job\n", + "returned_sweep_job = ml_client.create_or_update(sweep_job)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Monitor SweepJobs runs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "aml_url = returned_sweep_job.studio_url\n", + "\n", + "print(\"Monitor your job at\", aml_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Find and register best model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download the best trial model output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ml_client.jobs.download(returned_sweep_job.name, output_name=\"model\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Delete cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ml_client.compute.begin_delete(gpu_compute_target.name).wait()" + ] + } + ], + "metadata": { + "kernel_info": { + "name": "rapids" + }, + "kernelspec": { + "display_name": "rapids", + "language": "python", + "name": "rapids" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "microsoft": { + "ms_spell_check": { + "ms_spell_check_language": "en" + } + }, + "nteract": { + "version": "nteract-front-end@1.0.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}