From de30777ef51ee06f0a03f2200eb19bc3ce3be2e2 Mon Sep 17 00:00:00 2001
From: Balaji Alwar <balajialwar@berkeley.edu>
Date: Fri, 10 May 2024 22:15:05 -0700
Subject: [PATCH] test nb

---
 test_notebooks/lab14.ipynb | 995 +++++++++++++++++++++++++++++++++++++
 1 file changed, 995 insertions(+)
 create mode 100644 test_notebooks/lab14.ipynb
diff --git a/test_notebooks/lab14.ipynb b/test_notebooks/lab14.ipynb
new file mode 100644
index 0000000..074082e
--- /dev/null
+++ b/test_notebooks/lab14.ipynb
@@ -0,0 +1,995 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "# Initialize Otter\n",
+    "import otter\n",
+    "grader = otter.Notebook(\"lab14.ipynb\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "intro",
+     "locked": true,
+     "schema_version": 2,
+     "solution": false
+    }
+   },
+   "source": [
+    "# Lab 14: Clustering\n",
+    "\n",
+    "In this lab you will explore K-Means, Hierarchical Agglomerative Clustering, and Spectral Clustering. Spectral Clustering is out of scope for Spring 2023.\n",
+    "\n",
+    "### Due Date\n",
+    "\n",
+    "The on-time deadline is **Tuesday, May 2nd, 11:59 PM PT**. Please read the syllabus for the grace period policy. No late submissions beyond the grace period will be accepted."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "locked": true,
+     "schema_version": 2,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from sklearn import cluster\n",
+    "\n",
+    "# Comment out the following line for more readable exceptions. \n",
+    "# You may need to restart your server to load the new package after running this cell. \n",
+    "# %pip install --quiet iwut\n",
+    "# %load_ext iwut\n",
+    "# %wut on"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Lab Walk-Through\n",
+    "In addition to the lab notebook, we have also released a prerecorded walk-through video of the lab. We encourage you to reference this video as you work through the lab. Run the cell below to display the video."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.display import YouTubeVideo\n",
+    "YouTubeVideo(\"JutlJczX9RE\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "In the first part of this lab, we work with three different toy datasets, all with different clustering characteristics. In the second part, we explore a real-world dataset from the World Bank.\n",
+    "\n",
+    "<br/><br/><br/>\n",
+    "\n",
+    "<hr style=\"border: 5px solid #003262;\" />\n",
+    "<hr style=\"border: 1px solid #fdb515;\" />\n",
+    "\n",
+    "## Toy Data 1: Balanced Clusters\n",
+    "\n",
+    "Let's begin with a toy dataset with three groups that are completely separated with the variables given in a 2D space. All groups/clusters have the same number of points and the variance in all groups is the same."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to plot the datapoints.\n",
+    "np.random.seed(1337)\n",
+    "\n",
+    "c1 = np.random.normal(size = (25, 2))\n",
+    "c2 = np.array([2, 8]) + np.random.normal(size = (25, 2))\n",
+    "c3 = np.array([8, 4]) + np.random.normal(size = (25, 2))\n",
+    "\n",
+    "x1 = np.vstack((c1, c2, c3))\n",
+    "\n",
+    "sns.scatterplot(x = x1[:, 0], y = x1[:, 1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below, we create a `cluster.KMeans` object ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)) which implements the K-Means algorithm."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plotting K-Means output\n",
+    "\n",
+    "def plot_k_means(data, classifier):\n",
+    "    palette=['orange','brown','dodgerblue', 'red']\n",
+    "    sns.scatterplot(x = data[:, 0], y = data[:, 1], \n",
+    "                hue = classifier.labels_, \n",
+    "                palette = palette[:classifier.n_clusters])\n",
+    "    sns.scatterplot(x = classifier.cluster_centers_[:, 0], \n",
+    "                    y = classifier.cluster_centers_[:, 1],\n",
+    "                    color = 'green', marker = 'x', s = 150, linewidth = 3);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Run this cell to plot the K-Means cluster.\n",
+    "\n",
+    "kmeans = cluster.KMeans(n_clusters = 3, random_state = 42).fit(x1)\n",
+    "plot_k_means(x1, kmeans)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We observe that K-Means is able to accurately pick out the three initial clusters. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Question 1: Initial Centers\n",
+    "\n",
+    "In the previous example, the K-Means algorithm was able to accurately find the three initial clusters. However, changing the starting centers for K-Means can change the final clusters that K-Means gives us. Change the initial centers to the points `[0, 1]`, `[1, 1]`, and `[2, 2]`; and fit a `cluster.KMeans` object ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)) called `kmeans_q1` on the toy dataset from the previous example. Keep the `random_state` parameter as 42 and the `n_clusters` parameter as 3.\n",
+    "\n",
+    "**Hint:** You will need to change the `init` and `n_init` parameters in `cluster.KMeans`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "kmeans_q1 = ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "grader.check(\"q1\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Running the K-Means algorithm with these centers gives us a different result from before, and this particular run of K-Means was unable to accurately find the three initial clusters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "plot_k_means(x1, kmeans_q1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "*Note*: Instead of using predetermined centers, one option is to use `init=\"random\"` and sklearn will randomly select initial centers. As shown above, the clustering result varies significantly based on your initial centers. Therefore, `init=\"random` is often combined with `n_init` of greater than 1. sklearn will then run `n_init` number of times with different randomly initial centers and return the best result. In our case, we are interested in specific initial centers, so we set `n_init=1`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "<hr style=\"border: 1px solid #fdb515;\" />\n",
+    "\n",
+    "## Toy Data 2: Clusters of Different Sizes\n",
+    "\n",
+    "Sometimes, K-Means will have a difficult time finding the \"correct\" clusters even with ideal starting centers. For example, consider the data below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to plot the datapoints.\n",
+    "np.random.seed(1337)\n",
+    "\n",
+    "c1 = 0.5 * np.random.normal(size = (25, 2))\n",
+    "c2 = np.array([10, 10]) + 3 * np.random.normal(size = (475, 2))\n",
+    "x2 = np.vstack((c1, c2))\n",
+    "sns.scatterplot(x = x2[:, 0], y = x2[:, 1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are two groups of different sizes in two different senses: **variability** (i.e., spread) and **number of datapoints**. The smaller group has both smaller variability and has fewer datapoints, and the larger of the two groups is more diffuse and populated."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Question 2a: K-Means\n",
+    "\n",
+    "Fit a `cluster.KMeans` object ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)) called `kmeans_q2a` on the dataset above with two clusters and a `random_state` parameter of 42.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "kmeans_q2a = ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "grader.check(\"q2a\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/>\n",
+    "\n",
+    "For notational simplicity we will call the initial cluster on the bottom left $A$ and the initial cluster on the top right $B$. We will call the bottom left cluster found by K-Means as cluster $a$ and the top right cluster found by K-Means as cluster $b$. \n",
+    "\n",
+    "As seen below, K-Means is unable to find the two intial clusters because cluster $A$ includes points from cluster $B$. Recall that K-Means attempts to minimize inertia, so it makes sense that points in the bottom left of cluster $B$ would prefer to be in cluster $A$ rather than cluster $B$. If these points were in cluster $B$ instead, then the resulting cluster assignments would have a larger distortion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to plot the kmeans cluster.\n",
+    "\n",
+    "plot_k_means(x2, kmeans_q2a)\n",
+    "\n",
+    "plt.annotate('B', xy=kmeans_q2a.cluster_centers_[0], fontsize=20)\n",
+    "plt.annotate('A', xy=kmeans_q2a.cluster_centers_[1], fontsize=20);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Hierarchical Agglomerative Clustering: The Linkage Criterion\n",
+    "\n",
+    "As also shown in the lecture 27, hierarchical agglomerative clustering works better for this task, as long as we choose the right definition of distance between two clusters. Recall that hierarchical clustering starts with every data point in its own cluster and iteratively joins the two closest clusters until there are $k$ clusters remaining. However, the \"distance\" between two clusters is ambiguous. \n",
+    "\n",
+    "In lecture, we used the maximum distance between a point in the first cluster and a point in the second as this notion of distance (Complete Linkage hierarchical clusyering), but there are other ways to define the distance between two clusters. \n",
+    "\n",
+    "Our choice of definition for the distance is sometimes called the \"linkage criterion.\" We will discuss three linkage criteria, each of which is a different definition of \"distance\" between two clusters:\n",
+    "\n",
+    "- **Complete linkage** considers the distance between two clusters as the **maximum** distance between a point in the first cluster and a point in the second. This is what you saw in Lecture 27.\n",
+    "- **Single linkage** considers the distance between two clusters as the **minimum** distance between a point in the first cluster and a point in the second.\n",
+    "- **Average linkage** considers the distance between two clusters as the **average** distance between points in the first and second clusters.\n",
+    "\n",
+    "Below, we fit a `cluster.AgglomerativeClustering` object ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)) called `agg_complete` on the dataset above with two clusters, using the **complete linkage criterion**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to fit perform agglomerative clustering on the data.\n",
+    "agg_complete = cluster.AgglomerativeClustering(n_clusters = 2, linkage = 'complete').fit(x2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below we visualize the results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to plot the agglomerative clusters.\n",
+    "sns.scatterplot(x = x2[:, 0], y = x2[:, 1], hue = agg_complete.labels_);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It looks like complete linkage agglomerative clustering has the same issue as K-Means! The bottom left cluster found by complete linkage agglomerative clustering includes points from the top right cluster. However, we can remedy this by picking a different linkage criterion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Question 2b: Agglomerative Clustering\n",
+    "\n",
+    "Now, use the **single linkage criterion** to fit a `cluster.AgglomerativeClustering` object called `agg_single` on the dataset above with two clusters.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "agg_single = ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "grader.check(\"q2b\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we see that single linkage agglomerative clustering is able to find the two initial clusters. (We have also included code to plot a dendrogram, but dendrogram is not in-scope for the exam.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.cluster.hierarchy import dendrogram\n",
+    "def plot_dendrogram(model, **kwargs):\n",
+    "    # Create linkage matrix and then plot the dendrogram\n",
+    "    # Create the counts of samples under each node\n",
+    "    counts = np.zeros(model.children_.shape[0])\n",
+    "    n_samples = len(model.labels_)\n",
+    "    for i, merge in enumerate(model.children_):\n",
+    "        current_count = 0\n",
+    "        for child_idx in merge:\n",
+    "            if child_idx < n_samples:\n",
+    "                current_count += 1  # leaf node\n",
+    "            else:\n",
+    "                current_count += counts[child_idx - n_samples]\n",
+    "        counts[i] = current_count\n",
+    "\n",
+    "    distance = np.arange(model.children_.shape[0])\n",
+    "\n",
+    "    linkage_matrix = np.column_stack(\n",
+    "        [model.children_, distance, counts]\n",
+    "    ).astype(float)\n",
+    "\n",
+    "    # Plot the corresponding dendrogram\n",
+    "    dendrogram(linkage_matrix, **kwargs)\n",
+    "\n",
+    "plt.figure(figsize=(15, 5))\n",
+    "plt.subplot(121)\n",
+    "plt.title(\"Agglomerative Cluster\")\n",
+    "sns.scatterplot(x = x2[:, 0], y = x2[:, 1], hue = agg_single.labels_);\n",
+    "\n",
+    "plt.subplot(122)\n",
+    "plt.title(\"Dendrogram\")\n",
+    "plot_dendrogram(agg_single, truncate_mode=\"level\", p=3)\n",
+    "plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You might be curious why single linkage \"works\" while complete linkage does not in this scenario; we will leave this as an exercise for students who are interested."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/><br/>\n",
+    "\n",
+    "<hr style=\"border: 1px solid #fdb515;\" />\n",
+    "\n",
+    "## Toy Data 3: Oddly-Shaped Clusters\n",
+    "\n",
+    "Another example when k-means fails is when the clusters have odd shapes. For example, look at the following dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.random.seed(100)\n",
+    "\n",
+    "data = np.random.normal(0, 7, size = (1000, 2))\n",
+    "lengths = np.linalg.norm(data, axis = 1, ord = 2)\n",
+    "x3 = data[(lengths < 2) | ((lengths > 5) & (lengths < 7)) | ((lengths > 11) & (lengths < 15))]\n",
+    "\n",
+    "sns.scatterplot(x = x3[:, 0], y = x3[:, 1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looking at this data, we might say there are 3 clusters, corresponding to each of the 3 concentric circles, with the same center. However, k-means will fail."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kmeans_q3 = cluster.KMeans(n_clusters = 3, random_state = 42).fit(x3)\n",
+    "plot_k_means(x3, kmeans_q3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## (Bonus) Question 3: Spectral Clustering\n",
+    "\n",
+    "(Note in Spring 2023 we did not go over Spectral Clustering. Spectral Clustering is out of scope for exams.) \n",
+    "\n",
+    "Let's try spectral clustering instead. \n",
+    "\n",
+    "In the cell below, create and fit a `cluster.SpectralClustering` object ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html)), and assign it to `spectral`. Use 3 clusters, and make sure you set `affinity` to `\"nearest_neighbors\"` and a `random_state` of 10.\n",
+    "\n",
+    "**Note:** Ignore any warnings about the graph not being fully connected.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "spectral = ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity check\n",
+    "sorted(np.unique(spectral.labels_, return_counts = True)[1]) == [42, 170, 174]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below, we see that spectral clustering is able to find the three rings, when k-means does not."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.scatterplot(x = x3[:, 0], y = x3[:, 1], hue = spectral.labels_);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/><br/>\n",
+    "\n",
+    "<hr style=\"border: 1px solid #fdb515;\" />\n",
+    "\n",
+    "## The World Bank Dataset\n",
+    "\n",
+    "In the previous three questions, we looked at clustering on two dimensional datasets. However, we can easily use clustering on data which have more than two dimensions. For this, let us turn to a World Bank dataset, containing various features for the world's countries.\n",
+    "\n",
+    "This data comes from https://databank.worldbank.org/source/world-development-indicators#.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "world_bank_data = pd.read_csv(\"world_bank_data.csv\", index_col = 'country')\n",
+    "world_bank_data.head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are some missing values. For the sake of convenience and of keeping the lab short, we will fill them all with zeros. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "world_bank_data = world_bank_data.fillna(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Like with PCA, it sometimes makes sense to center and scale our data so that features with higher variance don't dominate the analysis. For example, without standardization, statistics like population will completely dominate features like \"percent of total population that live in urban areas.\" This is because the range of populations is on the order of billions, whereas percentages are always between 0 and 100. The ultimate effect is that many of our columns are not really considered by our clustering algorithm.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br/><br/>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "\n",
+    "## Question 4\n",
+    "\n",
+    "Below, fit a `cluster.KMeans` object called `kmeans_q4` with four clusters and a `random_state` parameter of 42.\n",
+    "\n",
+    "\n",
+    "Make sure you should use a centered and scaled version of the world bank data. By centered and scaled we mean that the mean in each column should be zero and the variance should be 1. Assign this centered and scaled data to `world_bank_data_centered_and_scaled`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "...\n",
+    "...\n",
+    "kmeans_q4 = ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "grader.check(\"q4\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looking at these new clusters, we see that they seem to correspond to:\n",
+    "\n",
+    "* 0: Very small countries.\n",
+    "* 1: Developed countries.\n",
+    "* 2: Less developed countries.\n",
+    "* 3: Huge countries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Run this cell to print the clusters.\n",
+    "\n",
+    "labeled_world_bank_data_q4 = pd.Series(kmeans_q4.labels_, name = \"cluster\", index  = world_bank_data.index).to_frame()\n",
+    "\n",
+    "for c in range(4):\n",
+    "    print(f\">>> Cluster {c}:\")\n",
+    "    print(list(labeled_world_bank_data_q4.query(f'cluster == {c}').index))\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Silhouette Score\n",
+    "\n",
+    "How many number of cluster should we use? We can use silhouette score and silhouette plot to determine this. The following code is taken from the documentation [here](https://scikit-learn.org/1.1/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#). Note that we are using an older version of sklearn (1.1). sklearn 1.2 requires a slightly different code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import silhouette_samples, silhouette_score\n",
+    "import matplotlib.cm as cm\n",
+    "\n",
+    "X = world_bank_data_centered_and_scaled.to_numpy()\n",
+    "range_n_clusters = [2, 3, 4, 5, 6, 10, 20, 30]\n",
+    "inertia = []\n",
+    "\n",
+    "plt.figure(figsize=(15, 8))\n",
+    "for n_clusters in range_n_clusters:\n",
+    "    ax1 = plt.subplot(2, 4, range_n_clusters.index(n_clusters)+1)\n",
+    "    # The (n_clusters+1)*10 is for inserting blank space between silhouette\n",
+    "    # plots of individual clusters, to demarcate them clearly.\n",
+    "    ax1.set_xlim([-0.1, 0.8])\n",
+    "    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])\n",
+    "\n",
+    "    # Initialize the clusterer with n_clusters value and a random generator\n",
+    "    # seed of 42 for reproducibility.\n",
+    "    clusterer = cluster.KMeans(n_clusters=n_clusters, random_state=42)\n",
+    "    cluster_labels = clusterer.fit_predict(X)\n",
+    "    inertia += [clusterer.inertia_]\n",
+    "\n",
+    "    # The silhouette_score gives the average value for all the samples.\n",
+    "    # This gives a perspective into the density and separation of the formed clusters\n",
+    "    silhouette_avg = silhouette_score(X, cluster_labels)\n",
+    "\n",
+    "    # Compute the silhouette scores for each sample\n",
+    "    sample_silhouette_values = silhouette_samples(X, cluster_labels)\n",
+    "\n",
+    "    y_lower = 10\n",
+    "    for i in range(n_clusters):\n",
+    "        # Aggregate the silhouette scores for samples belonging to\n",
+    "        # cluster i, and sort them\n",
+    "        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]\n",
+    "        ith_cluster_silhouette_values.sort()\n",
+    "        size_cluster_i = ith_cluster_silhouette_values.shape[0]\n",
+    "        y_upper = y_lower + size_cluster_i\n",
+    "        color = cm.nipy_spectral(float(i) / n_clusters)\n",
+    "        ax1.fill_betweenx(\n",
+    "            np.arange(y_lower, y_upper),\n",
+    "            0,\n",
+    "            ith_cluster_silhouette_values,\n",
+    "            facecolor=color,\n",
+    "            edgecolor=color,\n",
+    "            alpha=0.7,\n",
+    "        )\n",
+    "\n",
+    "        # Label the silhouette plots with their cluster numbers at the middle\n",
+    "        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n",
+    "\n",
+    "        # Compute the new y_lower for next plot\n",
+    "        y_lower = y_upper + 10  # 10 for the 0 samples\n",
+    "\n",
+    "    ax1.set_title(\"n_clusters = %d\"% n_clusters, fontsize=10)\n",
+    "    ax1.set_xlabel(\"silhouette coefficient values\", fontsize=10)\n",
+    "    ax1.set_ylabel(\"cluster label\", fontsize=3)\n",
+    "\n",
+    "    # The vertical line for average silhouette score of all the values\n",
+    "    ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n",
+    "\n",
+    "    ax1.set_yticks([])  # Clear the yaxis labels / ticks\n",
+    "    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8])\n",
+    "\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- BEGIN QUESTION -->\n",
+    "\n",
+    "<br><br>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Question 5\n",
+    "\n",
+    "Based on the plots above how many number of cluster would you use and why? There is no single correct answer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "_Type your answer here, replacing this text._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- END QUESTION -->\n",
+    "\n",
+    "<!-- BEGIN QUESTION -->\n",
+    "\n",
+    "<br><br>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Question 6\n",
+    "\n",
+    "We have saved the inertia generated using different number of cluster above in the array `inertia`. Below is an elbow plot. Based on the plot, what number of cluster you would choose and why? There is no single correct answer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.plot(range_n_clusters, inertia)\n",
+    "plt.scatter(range_n_clusters, inertia)\n",
+    "plt.title(\"Elbow plot\")\n",
+    "plt.xlabel(\"number of cluster\")\n",
+    "plt.ylabel(\"inertia\")\n",
+    "plt.xticks(range_n_clusters);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "_Type your answer here, replacing this text._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- END QUESTION -->\n",
+    "\n",
+    "<hr style=\"border: 5px solid #003262;\" />\n",
+    "<hr style=\"border: 1px solid #fdb515;\" />\n",
+    "\n",
+    "# Congratulations! You finished the lab 14!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "source": [
+    "## Submission\n",
+    "\n",
+    "Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "# Save your notebook first, then run this cell to export your submission.\n",
+    "grader.export(pdf=False, run_tests=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    " "
+   ]
+  }
+ ],
+ "metadata": {
+  "celltoolbar": "Create Assignment",
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.0"
+  },
+  "otter": {
+   "OK_FORMAT": true,
+   "tests": {
+    "q1": {
+     "name": "q1",
+     "points": null,
+     "suites": [
+      {
+       "cases": [
+        {
+         "code": ">>> sorted(np.unique(kmeans_q1.labels_, return_counts = True)[1]) == [12, 13, 50]\nTrue",
+         "hidden": false,
+         "locked": false
+        }
+       ],
+       "scored": true,
+       "setup": "",
+       "teardown": "",
+       "type": "doctest"
+      }
+     ]
+    },
+    "q2a": {
+     "name": "q2a",
+     "points": null,
+     "suites": [
+      {
+       "cases": [
+        {
+         "code": ">>> sorted(np.unique(kmeans_q2a.labels_, return_counts = True)[1]) == [45, 455]\nTrue",
+         "hidden": false,
+         "locked": false
+        }
+       ],
+       "scored": true,
+       "setup": "",
+       "teardown": "",
+       "type": "doctest"
+      }
+     ]
+    },
+    "q2b": {
+     "name": "q2b",
+     "points": null,
+     "suites": [
+      {
+       "cases": [
+        {
+         "code": ">>> sorted(np.unique(agg_single.labels_, return_counts = True)[1]) == [25, 475]\nTrue",
+         "hidden": false,
+         "locked": false
+        }
+       ],
+       "scored": true,
+       "setup": "",
+       "teardown": "",
+       "type": "doctest"
+      }
+     ]
+    },
+    "q4": {
+     "name": "q4",
+     "points": null,
+     "suites": [
+      {
+       "cases": [
+        {
+         "code": ">>> np.allclose(np.mean(world_bank_data_centered_and_scaled, axis=0), 0)\nTrue",
+         "hidden": false,
+         "locked": false
+        },
+        {
+         "code": ">>> np.allclose(np.std(world_bank_data_centered_and_scaled, axis=0), 1)\nTrue",
+         "hidden": false,
+         "locked": false
+        },
+        {
+         "code": ">>> sorted(np.unique(kmeans_q4.labels_, return_counts = True)[1]) == [3, 25, 90, 99]\nTrue",
+         "hidden": false,
+         "locked": false
+        }
+       ],
+       "scored": true,
+       "setup": "",
+       "teardown": "",
+       "type": "doctest"
+      }
+     ]
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}