From f48c1f976eee34d114c44e958ac805ef08627845 Mon Sep 17 00:00:00 2001
From: skirui-source <kirui.sheilah@gmail.com>
Date: Tue, 18 Jul 2023 09:28:51 -0700
Subject: [PATCH] add more details / description

---
 .../notebook.ipynb                            | 121 +++++++++++-------
 1 file changed, 73 insertions(+), 48 deletions(-)
diff --git a/source/examples/xgboost-rf-gpu-cpu-benchmark/notebook.ipynb b/source/examples/xgboost-rf-gpu-cpu-benchmark/notebook.ipynb
index df7fe7a9..c3abae86 100644
--- a/source/examples/xgboost-rf-gpu-cpu-benchmark/notebook.ipynb
+++ b/source/examples/xgboost-rf-gpu-cpu-benchmark/notebook.ipynb
@@ -53,7 +53,7 @@
    "source": [
     "<span style=\"display: block; color:#8735fb; font-size:22pt\"> **ML Workflow** </span>\n",
     "\n",
-    "In order to work with RAPIDS container, the entrypoint logic should parse arguments, load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting.\n",
+    "In order to work with RAPIDS container, the entrypoint logic should parse arguments, load, preprocess and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting.\n",
     "\n",
     "Let's have a step-by-step look at each stage of the ML workflow:"
    ]
@@ -69,12 +69,11 @@
     "\n",
     "We host the demo dataset in public S3 demo buckets in both the `us-east-1` or `us-west-2`. To optimize performance, we recommend that you access the s3 bucket in the same region as your EC2 instance to reduce network latency and data transfer costs. \n",
     "\n",
-    "For this demo, we are using the 3_year dataset, which includes the following features to mention a few:\n",
+    "For this demo, we are using the **`3_year`** dataset, which includes the following features to mention a few:\n",
     "\n",
-    "* Locations and distance ( Origin, Dest, Distance )\n",
-    "* Airline / carrier ( Reporting_Airline )\n",
-    "* Scheduled departure and arrival times ( CRSDepTime and CRSArrTime )\n",
-    "* Actual departure and arrival times ( DpTime and ArrTime )\n",
+    "* Date and distance ( Year, Month, Distance )\n",
+    "* Airline / carrier ( Flight_Number_Reporting_Airline )\n",
+    "* Actual departure and arrival times ( DepTime and ArrTime )\n",
     "* Difference between scheduled & actual times ( ArrDelay and DepDelay )\n",
     "* Binary encoded version of late, aka our target variable ( ArrDelay15 )\n",
     "\n"
@@ -124,6 +123,35 @@
     "```\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "3b1759de-98af-4628-a79b-a236a2dee5a2",
+   "metadata": {},
+   "source": [
+    "<span style=\"display: block; font-size:20pt\"> Dask Cluster </span>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "533be0b1-0d5e-46b3-9ff1-dd71751fe68f",
+   "metadata": {},
+   "source": [
+    "To maximize on efficiency, we launch a Dask `LocalCluster` for cpu or `LocalCUDACluster` that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster. \n",
+    "\n",
+    "We can then ingest the data, and \"persist\" it in memory using dask as follows:\n",
+    "\n",
+    "```python\n",
+    "if args.mode == \"gpu\":\n",
+    "    cluster = LocalCUDACluster()\n",
+    "else: # mode == \"cpu\"\n",
+    "    cluster = LocalCluster(n_workers=os.cpu_count())\n",
+    "\n",
+    "with Client(cluster) as client:\n",
+    "    dataset = ingest_data(mode=args.mode)\n",
+    "    client.persist(dataset)\n",
+    "```\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "feef53b4-d7e0-43e2-b85b-372bd2d882f7",
@@ -131,12 +159,21 @@
    "source": [
     "<span style=\"display: block; font-size:20pt\"> Search Range </span>\n",
     "\n",
-    "One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. In this notebook, we leverage the power of `Optuna`, a widely used Python library for hyperparameter optimization as such:\n",
+    "One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. In this notebook, we leverage the power of `Optuna`, a widely used Python library for hyperparameter optimization.\n",
     "\n",
+    "Here's the quick steps on getting started with Optuna:\n",
     "\n",
-    "1) Define the Objective Function. Represent the model training and evaluation process, takes hyperparameters as inputs and returns a metric to optimize (e.g., accuracy, loss). Refer to `train_xgboost()` and `train_randomforest()` in `hpo.py`\n",
     "\n",
-    "2. Specify the search space for your hyperparameters using \n",
+    "1) Define the Objective Function, which represents the model training and evaluation process. It takes hyperparameters as inputs and returns a metric to optimize (e.g, accuracy in our case,). Refer to `train_xgboost()` and `train_randomforest()` in `hpo.py`\n",
+    "\n",
+    "2. Specify the search space using the `Trial` object's methods to define the hyperparameters and their corresponding value ranges or distributions. For example:\n",
+    "\n",
+    "```python\n",
+    "\"max_depth\": trial.suggest_int(\"max_depth\", 4, 8),\n",
+    "\"max_features\": trial.suggest_float(\"max_features\", 0.1, 1.0),\n",
+    "\"learning_rate\": trial.suggest_float(\"learning_rate\", 0.001, 0.1, log=True),\n",
+    "\"min_samples_split\": trial.suggest_int(\"min_samples_split\", 2, 1000, log=True),\n",
+    "```\n",
     "\n",
     "3. Create an Optuna study object to keep track of trials and their corresponding hyperparameter configurations and evaluation metrics.\n",
     "\n",
@@ -146,7 +183,7 @@
     "    )\n",
     "```\n",
     "\n",
-    "4. Select an optimization algorithm to determine how Optuna explores and exploits the search space to find optimal configurations. As shown in the code above, \n",
+    "4. Select an optimization algorithm to determine how Optuna explores and exploits the search space to find optimal configurations.  For instance, the `RandomSampler` is an algorithm provided by the Optuna library that samples hyperparameter configurations randomly from the search space.\n",
     "\n",
     "5. Run the Optimization by calling the Optuna's `optimize()` function on the study object. You can specify the number of trials or number of parallel jobs to run.\n",
     "\n",
@@ -160,35 +197,6 @@
     "```"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "3b1759de-98af-4628-a79b-a236a2dee5a2",
-   "metadata": {},
-   "source": [
-    "<span style=\"display: block; font-size:20pt\"> Dask Cluster </span>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "533be0b1-0d5e-46b3-9ff1-dd71751fe68f",
-   "metadata": {},
-   "source": [
-    "To maximize on efficiency, we launch a Dask `LocalCluster` for cpu or `LocalCUDACluster` that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster. \n",
-    "\n",
-    "We can then ingest the data, and \"persist\" it in memory using dask as follows:\n",
-    "\n",
-    "```python\n",
-    "if args.mode == \"gpu\":\n",
-    "    cluster = LocalCUDACluster()\n",
-    "else: # mode == \"cpu\"\n",
-    "    cluster = LocalCluster(n_workers=os.cpu_count())\n",
-    "\n",
-    "with Client(cluster) as client:\n",
-    "    dataset = ingest_data(mode=args.mode)\n",
-    "    client.persist(dataset)\n",
-    "```\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "a89edfea-ca14-4d26-94c6-0ef8eaf02d77",
@@ -196,9 +204,7 @@
    "source": [
     "<span style=\"display: block; color:#8735fb; font-size:22pt\"> **Build RAPIDS Container** </span>\n",
     "\n",
-    "Now that we have a fundamental understanding of our workflow process, we can test the code in custom docker container. \n",
-    "\n",
-    "Starting with latest rapids docker image, we only need to install `optuna` as the container comes with most necessary packages"
+    "Now that we have a fundamental understanding of our workflow process, we can test the code. Starting with latest rapids docker image, we only need to install `optuna` as the container comes with most necessary packages."
    ]
   },
   {
@@ -218,6 +224,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# make sure you have the correct CUDAtoolkit version to build latest rapids container\n",
     "!nvidia-smi"
    ]
   },
@@ -236,7 +243,7 @@
    "id": "a825e0de-a1a1-4c2b-82bb-b33dfc494fd1",
    "metadata": {},
    "source": [
-    "Be sure to build and tag appropriately"
+    "The build step will be dominated by the download of the RAPIDS image (base layer). If it's already been downloaded the build will take less than 1 minute."
    ]
   },
   {
@@ -264,9 +271,9 @@
    "id": "baca52e2-09e7-42f3-bc98-5ee38f9e274f",
    "metadata": {},
    "source": [
-    "a tedius lengthy process, use a tool like tmux to handle SSH disconnection, avoid hpo runs interruptions \n",
+    "Executing benchmark tests can be an arduous and time-consuming procedure that may extend over multiple days.\n",
     "\n",
-    "also while running the container, be sure to expose all gpus (why?) and jupyter lab via ports ..."
+    "By using  a tool like [tmux](https://www.redhat.com/sysadmin/introduction-tmux-linux), you can maintain active terminal sessions, ensuring that your tasks continue running even if the SSH connection is interrupted. This allows you to resume your work seamlessly, without losing any progress or requiring you to restart the entire process."
    ]
   },
   {
@@ -276,7 +283,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# !tmux"
+    "# start a tmux session using this command\n",
+    "\n",
+    "!tmux"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "77df8ce3-39b8-41d9-a538-ae404be25b45",
+   "metadata": {},
+   "source": [
+    "When starting the container, be sure to expose the  `--gpus all` flag to make all available GPUs on the host machine accessible to the containers within the Docker environment. \n",
+    "\n",
+    "Use `-v` (or `--volume`) option to mount working dir from the host machine into the docker container. This enables data or directories on the host machine to be accessible within the container, and any changes made to the mounted files or directories will be reflected in both the host and container environments.\n",
+    "\n",
+    "Optional to expose jupyter via ports 8786-8888."
    ]
   },
   {
@@ -298,9 +319,13 @@
    "source": [
     "<span style=\"display: block; color:#8735fb; font-size:22pt\"> **Run HPO** </span>\n",
     "\n",
-    "Navigate to the host directory inside the container and run the python script with the following command : \n",
+    "Navigate to the host directory inside the container and run the python training script with the following command : \n",
+    "\n",
+    "```python\n",
+    "python ./hpo.py --model-type \"XGBoost\" --mode \"gpu\"  > xgboost_gpu.txt 2>&1\n",
+    "```\n",
     "\n",
-    "    python ./hpo.py --model-type \"XGBoost\" --mode \"gpu\"  > xgboost_gpu.txt 2>&1\n"
+    "The code above will run XGBoost HPO jobs on the gpu and output the benchmark results to a text file. You can run the same for RandomForest by simple chaning `--model type` arg and chnage mode to `cpu` accordingly."
    ]
   }
  ],