Optional exercises and demos for pandas. (#107)

UtrechtUniversity · Feb 12, 2024 · 0917eee · 0917eee
1 parent ff81da5
commit 0917eee
Show file tree

Hide file tree

Showing 3 changed files with 66 additions and 41 deletions.
diff --git a/book/data-science-with-pandas-2.ipynb b/book/data-science-with-pandas-2.ipynb
@@ -390,12 +390,23 @@
     "not | **!** | **~**|"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b1dc9211",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-success\">\n",
+    "<b>Exercise 4</b>\n",
+    "    \n",
+    "Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercise 4."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "0fd42879",
    "metadata": {},
    "source": [
-    "## DataFrame Cleaning"
+    "## Optional: DataFrame Cleaning"
    ]
   },
   {
@@ -580,17 +591,17 @@
    "metadata": {},
    "source": [
     "<div class=\"alert alert-block alert-success\">\n",
-    "<b>Exercise 4 and 5</b>\n",
-    "\n",
-    "Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 5 and 6."
+    "<b>(Optional) Exercise 5</b>\n",
+    "    \n",
+    "To deepen the knowledge you can continue with Exercise 5."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "2cf8ba70-5aba-4c52-96e4-f4cda4c47268",
    "metadata": {},
    "source": [
-    "## Grouping"
+    "## Optional: Grouping"
    ]
   },
   {
@@ -663,7 +674,7 @@
    "id": "71794bfb",
    "metadata": {},
    "source": [
-    "## Structure of a groupby object\n",
+    "## Optional: Structure of a groupby object\n",
     "We can investigate which rows are assigned to which group as follows:"
    ]
   },
@@ -684,7 +695,7 @@
    "id": "0cb9e651",
    "metadata": {},
    "source": [
-    "## Grouping by multiple columns\n",
+    "## Optional: Grouping by multiple columns\n",
     "Now let's have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by *sex* and by *plot_id* at the same time.\n",
     "\n",
     "This will give us exactly 48 groups for our survey data:\n",
@@ -753,7 +764,7 @@
    "id": "70c925fa",
    "metadata": {},
    "source": [
-    "## Summary grouping\n",
+    "## Optional: Summary grouping\n",
     "Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same samples. In many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe. Pandas' groupby function gives us the means to compare these three groups with several built-in statistical methods."
    ]
   },
@@ -773,9 +784,9 @@
    "metadata": {},
    "source": [
     "<div class=\"alert alert-block alert-success\">\n",
-    "<b>Exercise 6 to 8</b>\n",
+    "<b>Optional Exercise 6 to 8</b>\n",
     "    \n",
-    "Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 6 to 8."
+    "To deepen the knowledge you can do Exercises 6 to 8."
    ]
   },
   {

diff --git a/book/solutions/afternoon_exercises_solutions.ipynb b/book/solutions/afternoon_exercises_solutions.ipynb
@@ -161,20 +161,41 @@
    "metadata": {},
    "source": [
     "### Exercise 4\n",
-    "- Create a new DataFrame that only contains observations from the original with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`)."
+    "- Find all entries in the column `sex` which do not contain an `M` or a `F`.\n",
+    "- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "53f7777a",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'surveys_df' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43msurveys_df\u001b[49m[(surveys_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msex\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mM\u001b[39m\u001b[38;5;124m'\u001b[39m) \u001b[38;5;241m&\u001b[39m (surveys_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msex\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mF\u001b[39m\u001b[38;5;124m'\u001b[39m)]\n\u001b[1;32m      2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNumber of rows not female or male:\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28mlen\u001b[39m(df))\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'surveys_df' is not defined"
+     ]
+    }
+   ],
    "source": [
     "df = surveys_df[(surveys_df['sex'] != 'M') & (surveys_df['sex'] != 'F')]\n",
-    "print(\"Number of rows not female or male:\", len(df))\n",
-    "print(\"Number of rows NaN:\", len(surveys_df['sex'].isnull()))\n",
-    "print(\"Unique values in column 'sex':\", df['sex'].unique())"
+    "print(\"Number of rows not female or male:\", len(df))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "93ab8968",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = surveys_df[((surveys_df['sex'] == 'M') | (surveys_df['sex'] == 'F')) & surveys_df['weight'] > 0]"
    ]
   },
   {

diff --git a/course_materials/afternoon_exercises.ipynb b/course_materials/afternoon_exercises.ipynb
@@ -13,7 +13,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "77038fb1-920d-4832-8b59-2d0cefa13bc5",
    "metadata": {
     "tags": []
@@ -34,12 +34,12 @@
     "\n",
     "Type the following commands and check the outputs. Can you tell what each command does? What is the difference between commands with and without parenthesis?\n",
     "\n",
-    "`surveys_df.shape` Answer:\n",
-    "`surveys_df.columns` Answer:\n",
-    "`surveys_df.index` Answer:\n",
-    "`surveys_df.dtypes` Answer:\n",
-    "`surveys_df.head(<try_various_integers_here>)` Answer:\n",
-    "`surveys_df.tail(<try_various_integers_here>)` Answer:\n",
+    "- `surveys_df.shape` Answer:\n",
+    "- `surveys_df.columns` Answer:\n",
+    "- `surveys_df.index` Answer:\n",
+    "- `surveys_df.dtypes` Answer:\n",
+    "- `surveys_df.head(<try_various_integers_here>)` Answer:\n",
+    "- `surveys_df.tail(<try_various_integers_here>)` Answer:\n",
     "\n",
     "[Course book chapter 5 for reference](https://utrechtuniversity.github.io/workshop-introduction-to-python/data-science-with-pandas-1.html)"
    ]
@@ -155,7 +155,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "40ad6cd7-7d9f-4c5b-b269-718e98f35bf6",
+   "id": "1426dfcc",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -247,7 +247,7 @@
    "metadata": {},
    "source": [
     "### Exercise 4\n",
-    "- Create a new DataFrame that only contains observations from the original DataFrame with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`).\n",
+    "- Find all entries in the column `sex` which do not contain an `M` or a `F`.\n",
     "- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0."
    ]
   },
@@ -272,7 +272,7 @@
    "id": "b34b321b",
    "metadata": {},
    "source": [
-    "### Exercise 5: Putting it all together \n",
+    "### Exercise 5 (optional): Putting it all together \n",
     "1. Clean the column *sex* (leave out samples of which we do not know whether they are male or female) and save the result as a new dataframe `clean_df`.\n",
     "2. Replace undefined *weight* values with the mean of all (defined) weights in `surveys_df`.\n",
     "3. Calculate the average weight of that new DataFrame `clean_df`"
@@ -291,7 +291,7 @@
    "id": "ccb33c2e",
    "metadata": {},
    "source": [
-    "### Exercise 6\n",
+    "### Exercise 6 (optional)\n",
     "Let's see in which plots animals get more food. Calculate the average weight per plot! Complete the code below."
    ]
   },
@@ -311,7 +311,7 @@
    "id": "2bccb9da",
    "metadata": {},
    "source": [
-    "### Exercise 7\n",
+    "### Exercise 7 (optional)\n",
     "See below a more complex grouping example. Investigate the group keys and row indexes for this more complex grouping example. \n",
     "Why are there more than 48 groups? Answer: \n",
     "Calculate the average weight per group.\n",
@@ -342,7 +342,7 @@
    "id": "b0f1ab75",
    "metadata": {},
    "source": [
-    "### Exercise 8\n",
+    "### Exercise 8 (optional)\n",
     "Would it make sense to group our data frame by the column *weight*? Why or why not?"
    ]
   },
@@ -351,7 +351,7 @@
    "id": "0c7ae97d",
    "metadata": {},
    "source": [
-    "### Exercise 9\n",
+    "### Exercise 9 (optional)\n",
     "In the given example of vertical concatenation, you concatenated two DataFrames with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?\n",
     "\n",
     "  1. Create a new DataFrame using the last 10 rows of the species DataFrame (`species_df`);\n",
@@ -365,18 +365,14 @@
    "id": "1a685e40",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "species_df = pd.read_csv(\"../data/species.csv\")\n",
-    "\n",
-    "surveys_df_sub_first10 = surveys_df.head(10)"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
    "id": "afa7dd9c",
    "metadata": {},
    "source": [
-    "### Exercise 10\n",
+    "### Exercise 10 (optional)\n",
     "  1. Looking at the `inner_join` example, can you explain how much of each of the two DataFrames is missing from the result?\n",
     "\n",
     "Now consider the other types of joins, for each one, can you predict the number of rows and the contents of the resulting DataFrame, based on the diagrams in the picture?\n",
@@ -392,10 +388,7 @@
    "id": "427314ec",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "left_df = surveys_df.head(10)\n",
-    "right_df = species_df.head(20)"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -406,7 +399,7 @@
     "\n",
     "Time to play with plots! Create a multiplot following these instructions:\n",
     "- Using the matplotlib.pyplot function `subplots()`, create a single figure (10x10 inches) with four subplots organized in two rows and two columns; \n",
-    "- In the top row plot hindfoot_length VS weight for female and male in two different plots with two differen colors;\n",
+    "- In the top row plot hindfoot_length VS weight for female and male in two different plots with two different colors;\n",
     "- In the bottom row, plot the same data of the top row, but using data collected before (left plot) and after (right plot) 1990;\n",
     "- Give to each plot an appropriate descriptive title and optimize plot labels.\n",
     "<br>\n",