Move from pandas plot method to matplotlib exclusively in chapter 9 (#94

) * add intro to jupyter to preparation * remove/replace pandas plot method from chapter 9 * remove kernel change from preparation.ipynb * Update preparation.ipynb * Update book/data-science-with-pandas-4.ipynb
UtrechtUniversity · Nov 30, 2023 · e89ee4a · e89ee4a
1 parent 48ed75f
commit e89ee4a
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 36 deletions.
diff --git a/book/data-science-with-pandas-4.ipynb b/book/data-science-with-pandas-4.ipynb
@@ -162,7 +162,7 @@
    "id": "c07d856c-6497-4a66-83f6-98d664d56842",
    "metadata": {},
    "source": [
-    "As you already found in session 6.5, data stored in pandas DataFrames can be visualized using a 'method' called (surprisingly!) `plot`. This 'method' contains all the functionalities of the `pyplot` module of `matplotlib` and can be used on pandas DataFrames directly, without explicitly calling (and importing) pyplot. Let's have a look at a simple example, let's plot the weight of our penguins as a function of hindfoot length:"
+    "As you already found in session 6.5, data stored in pandas DataFrames can be visualized using a 'method' called (surprisingly!) `plot`:"
    ]
   },
   {
@@ -182,7 +182,9 @@
    "id": "f62b81da-cd14-4860-933e-97234267b113",
    "metadata": {},
    "source": [
-    "In the previous example we plotted weight VS hindfoot length using a scatter plot. The type of plot is specified via the argument `kind`, you can check out all the available plot categories [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html). One of the advantages of the DataFrame `plot` method is that we can specify the columns to plot simply referring to their names, these will also be automatically used as labels for the x and y axes (or only the x axis depending on the kind of plot). In a single line we can explore possible correlations between the columns of our DataFrame. \n",
+    "This is convenient since it allows for quick plotting with a limited amount of code. Check the [documentation of Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) to learn more about this. \n",
+    "\n",
+    "From now on we will create figures and plots using the `matplotlib` library. The `plot` method of pandas DataFrames is actually a wrapper around the plotting functionality of `matplotlib`.\n",
     "We will now try to create the same plot using the `scatter` function from `matplotlib`'s `pyplot` module: "
    ]
   },
@@ -197,16 +199,16 @@
    "source": [
     "plt.scatter(x = surveys['hindfoot_length'],y = surveys['weight'], s=16)\n",
     "plt.grid()\n",
-    "plt.ylabel('weight')\n",
-    "plt.xlabel('hindfoot_length')"
+    "plt.ylabel('Weight [g]')\n",
+    "plt.xlabel('Hindfoot length [mm]')"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "8e50a9d2",
    "metadata": {},
    "source": [
-    "As you can see the resulting plot is the same as above. When using `plt.scatter` we need some few more lines of code to get to this result, but `plt` provides more options to further customize the plot if we would want to. <br>\n"
+    "As you can see the resulting plot is the same as above. When using `plt.scatter` we need a few more lines of code to get to this result, but `plt` provides more options to further customize the plot if we would want to. <br>\n"
    ]
   },
   {
@@ -251,7 +253,7 @@
    "id": "ad5f1325-2ac5-4158-b456-d5cc61475172",
    "metadata": {},
    "source": [
-    "Looking at our previous visualization, it seems that x and y label are too small, data points often overlap each other, and a title for the plot is missing. Furthermore, for publication purposes, we want our plot to occupy a space of 6x6 inches. Let's apply these specifications to our simple visualization: "
+    "Looking at our previous visualization, it seems that x and y label are too small, data points often overlap each other, and a title for the plot is missing. Furthermore, for publication purposes, we want our plot to occupy a space of 6x6 inches. Let's apply these specifications to our visualization: "
    ]
   },
   {
@@ -263,25 +265,26 @@
    },
    "outputs": [],
    "source": [
-    "ax = surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, figsize=(6,6), title = 'Scatter plot')\n",
-    "ax.set_xlabel(xlabel = 'Hindfoot Length [cm]', fontsize=14)\n",
-    "ax.set_ylabel(ylabel = 'Weight [Kg]', fontsize=14)\n",
-    "\n",
-    "print(type(ax))"
+    "fig, ax = plt.subplots(figsize=(6, 6))\n",
+    "ax.scatter(x = surveys['hindfoot_length'],y = surveys['weight'], s=12)\n",
+    "ax.grid(True)\n",
+    "ax.set_title('Scatter plot of weight vs. hindfoot length', fontsize=16)\n",
+    "ax.set_ylabel('Weight [g]', fontsize=14)\n",
+    "ax.set_xlabel('Hindfoot length [mm]', fontsize=14)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "02919834-6764-473f-b497-114ec861ef5f",
    "metadata": {},
    "source": [
-    "This time we added some new parameters to our call: \n",
+    "This time we first created a Figure (`fig`) and an Axes (`ax`) object. On the Axes object we add the plot and specify some customizations: \n",
     "\n",
     "- `s` regulates the size of the data points \n",
     "- `figsize` the (x,y) dimention in inches\n",
-    "- `title` is a string containing the title of our plot\n",
+    "- `ax.set_xlabel` and `ax.set_title` are used to add the titles and labels to our plot\n",
     "\n",
-    "To modify the character sizes of the x and y labels we need to write two extra lines of code. To be able to do this we assigned the first line to a 'variable' or 'object' that we name `ax`. Then we used the `set_xlabel` method of the `ax` object to specify the x label and its character size. We did the same for the y label."
+    "To modify the character sizes of the x and y labels, we can use the `fontsize` parameter of the `set_xlabel` and `set_ylabel` methods."
    ]
   },
   {
@@ -295,11 +298,24 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "098c75c6",
+   "id": "8dc5920e",
    "metadata": {},
    "outputs": [],
    "source": [
-    "surveys.plot.hist(column=[\"hindfoot_length\"], by=\"sex\", bins=20, figsize=(10, 8))"
+    "fig, ax = plt.subplots(figsize=(10, 8))\n",
+    "\n",
+    "# Filter the data for the two groups\n",
+    "female_data = surveys[surveys[\"sex\"] == \"F\"]\n",
+    "male_data = surveys[surveys[\"sex\"] == \"M\"]\n",
+    "\n",
+    "# Plot the histograms\n",
+    "ax.hist([female_data[\"hindfoot_length\"].dropna(), male_data[\"hindfoot_length\"].dropna()], bins=20, rwidth=0.8, color=['green', 'blue'], label=['Female', 'Male'])\n",
+    "\n",
+    "ax.set_xlabel('Hindfoot length')\n",
+    "ax.set_ylabel('Frequency')\n",
+    "ax.legend(loc='upper right')\n",
+    "\n",
+    "plt.show()"
    ]
   },
   {
@@ -332,8 +348,16 @@
    "outputs": [],
    "source": [
     "fig, (ax1,ax2) = plt.subplots(nrows=1,ncols=2,figsize=(12,6))\n",
-    "surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, ax=ax1)\n",
-    "surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, ax=ax2, color='orange')"
+    "\n",
+    "ax1.scatter(x = female_data['hindfoot_length'],y = female_data['weight'], s=12)\n",
+    "ax1.set_xlabel('Hindfoot length [mm]')\n",
+    "ax1.set_ylabel('Weight [g]')\n",
+    "ax1.set_title('Females')\n",
+    "\n",
+    "ax2.scatter(x = male_data['hindfoot_length'],y = male_data['weight'], s=12, color='orange')\n",
+    "ax2.set_xlabel('Hindfoot length [mm]')\n",
+    "ax2.set_ylabel('Weight [g]')\n",
+    "ax2.set_title('Males')"
    ]
   },
   {
@@ -362,8 +386,19 @@
    "outputs": [],
    "source": [
     "fig, axes = plt.subplots(nrows=3,ncols=3,figsize=(12,12))\n",
-    "surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, ax=axes[1][1]) # row 1, column 1 (note that rows and columns start at 0)\n",
-    "surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, ax=axes[2][0], color='orange')  # row 2, column 0"
+    "\n",
+    "ax1 = axes[1][1] # row 1, column 1 (note that rows and columns start at 0)\n",
+    "\n",
+    "ax1.scatter(x = female_data['hindfoot_length'],y = female_data['weight'], s=12)\n",
+    "ax1.set_xlabel('Hindfoot length [mm]')\n",
+    "ax1.set_ylabel('Weight [g]')\n",
+    "ax1.set_title('Females')\n",
+    "\n",
+    "ax2 = axes[2][0] # row 2, column 0\n",
+    "ax2.scatter(x = male_data['hindfoot_length'],y = male_data['weight'], s=12, color='orange')\n",
+    "ax2.set_xlabel('Hindfoot length [mm]')\n",
+    "ax2.set_ylabel('Weight [g]')\n",
+    "ax2.set_title('Males')\n"
    ]
   },
   {
@@ -385,11 +420,14 @@
    "source": [
     "# prepare a matplotlib figure\n",
     "fig, ax1 = plt.subplots(figsize=(6,6))\n",
-    "surveys.plot(x='hindfoot_length',y='weight',kind='scatter',grid=True, s=12, ax=ax1)\n",
+    "ax1.scatter(x = female_data['hindfoot_length'],y = female_data['weight'], s=12)\n",
+    "ax1.set_xlabel('Hindfoot length [mm]')\n",
+    "ax1.set_ylabel('Weight [g]')\n",
+    "ax1.set_title('Females')\n",
     "\n",
-    "ax2 = fig.add_axes([0.5, 0.5, 0.33, 0.33])\n",
-    "ax2.scatter(surveys['hindfoot_length'],surveys['weight'], color='orange')\n",
-    "ax2.grid()"
+    "ax2 = fig.add_axes([0.65, 0.65, 0.25, 0.25])\n",
+    "ax2.scatter(x = male_data['hindfoot_length'],y = male_data['weight'], s=12, color='orange')\n",
+    "ax2.set_title('Males')"
    ]
   },
   {
@@ -420,7 +458,7 @@
    "id": "b42ee1de-479e-439b-91c3-113550acd474",
    "metadata": {},
    "source": [
-    "You have already seen how to group data stored in pandas DataFrames. If we want to quickly check if the pattern we observed in previously plotted data is the same in males and females, we can use a for loop and the groupby method to overlay two plots on top of each other in the same Axes object."
+    "You have already seen how to group data stored in pandas DataFrames in chapter 7. If we want to check if the pattern we observed in previously plotted data is the same in males and females without creating separate data objects (`female_data` and `male_data`), we can use a for loop and the groupby method, e.g. to overlay two plots on top of each other in the same Axes object."
    ]
   },
   {
@@ -557,7 +595,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.11"
   },
   "toc": {
    "base_numbering": 1,

diff --git a/course_materials/data/UUlogo.png b/course_materials/data/UUlogo.png
diff --git a/course_materials/preparation.ipynb b/course_materials/preparation.ipynb
@@ -31,8 +31,6 @@
     "\n",
     "The next cell is a `Code` cell. The current notebook uses a Python interpreter (check the top right corner to verify, it will display `Python 3 (ipykernel)`) which means it expects Python code and can only interpret Python code. Different setups are also possible (e.g. with an `R` or `Julia` interpreter). \n",
     "\n",
-    "Click `Python 3 (ipykernel)`, and change the kernel to `geo-kernel`. This is a different Python interpreter where we preinstalled all Python packages that we need for this workshop.\n",
-    "\n",
     "Select the following cell, verify at the top of this window if it displays `Code` instead of `Markdown`, and either click the 'play'-button above or press `<ctrl>-<enter>` simultaneously to run the Python code."
    ]
   },
@@ -163,14 +161,6 @@
     "print(np.__version__)\n",
     "print(\"No errors! Ready to code!\")"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b9d2ce0a-75fb-48c4-8524-b4cb8e4ffeff",
-   "metadata": {},
-   "source": [
-    "# Chapter 2"
-   ]
   }
  ],
  "metadata": {