Merge pull request #80 from UtrechtUniversity/67-reduce-chapter-8-to-…

…2-examples Limit the number of examples in chapter 8
UtrechtUniversity · Sep 28, 2023 · 7290e51 · 7290e51
2 parents c30141e + 572c63a
commit 7290e51
Show file tree

Hide file tree

Showing 7 changed files with 91 additions and 341 deletions.
diff --git a/book/course-materials.qmd b/book/course-materials.qmd
@@ -13,8 +13,6 @@ Don't forget to extract the contents of the zipped file after downloading!
 ```
 introduction-python
 ├── data
-│   ├── EU_capitals_tiny.csv
-│   ├── Netherlands_town_weather_tiny.csv
 │   ├── species.csv
 │   ├── surveys.csv
 │   └── plots.csv

diff --git a/book/data-science-with-pandas-3.ipynb b/book/data-science-with-pandas-3.ipynb
@@ -49,8 +49,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "surveys_df = pd.read_csv(\"../course_materials/data/surveys.csv\", keep_default_na=False, na_values=[\"\"])\n",
-    "species_df = pd.read_csv(\"../course_materials/data/species.csv\", keep_default_na=False, na_values=[\"\"])"
+    "surveys_df = pd.read_csv(\"../course_materials/data/surveys.csv\")\n",
+    "species_df = pd.read_csv(\"../course_materials/data/species.csv\")"
    ]
   },
   {
@@ -96,19 +96,21 @@
    "id": "0e7b909b-05ad-4cc6-a752-c803cf197750",
    "metadata": {},
    "source": [
-    "The first way we will combine DataFrames is **concatenation**, i.e. simply putting DataFrames one after the other either **verically** or **horizontally**.\n",
+    "The first way we will combine DataFrames is **concatenation**, i.e. simply putting DataFrames one after the other either **vertically** or **horizontally**.\n",
     "\n",
     "Concatenation can be used if the DataFrames are similar, meaning that they either have the same rows or columns. We will see examples of this later.\n",
     "\n",
-    "To concatenate two DataFrames you will use the function ```pd.concat()```, specifying as arguments the DataFrames to concatenate and ```axis=0``` or ```axis=1``` for vertical or horizontal concatenation, respectively."
+    "To concatenate two DataFrames you will use the function ```pd.concat()```, specifying as arguments the DataFrames to concatenate and ```axis=0``` or ```axis=1``` for vertical or horizontal concatenation, respectively.\n",
+    "\n",
+    "We will only be looking at vertical concatenation, but it is good to be aware that there is more to be discovered beyond what we describe here."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "c75b3d8a-0405-42d4-88d8-2b0e4fe80cea",
    "metadata": {},
    "source": [
-    "Let us first obtain two small DataFrames from the larger `surveys.csv` dataset."
+    "Let us first obtain two small DataFrames from the larger `surveys_df` dataset."
    ]
   },
   {
@@ -131,20 +133,12 @@
     "We now have two DataFrames, one with the first ten rows of the original dataset, and another with the last ten rows."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "d4f58cc7",
-   "metadata": {},
-   "source": [
-    "### Vertical concatenation"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "c7356bfc-eed9-4179-982c-1ee13a7bd06f",
    "metadata": {},
    "source": [
-    "Let's start with **vertical stacking**. In this case the two DataFrames are simply stacked 'on top of' eachother (remember to specify ```axis=0```).\n",
+    "Now, let us do some **vertical concatenation** or **stacking**. In this case the two DataFrames are simply stacked 'on top of' each other (remember to specify ```axis=0```).\n",
     "<div>\n",
     "<img src=\"images/vertical_stacking.jpeg\" width=\"300\"/>\n",
     "</div>\n",
@@ -193,122 +187,6 @@
     "vertical_stack.reset_index()"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "7bc415a6",
-   "metadata": {},
-   "source": [
-    "### Horizontal concatenation"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "73f9de9e-f237-49bd-8834-eb1e8c2e089a",
-   "metadata": {},
-   "source": [
-    "It's now time to try **horizontal stacking**. In this case the two DataFrames are simply put one after the other (remember to specify ```axis=1```).\n",
-    "<div>\n",
-    "<img src=\"images/horizontal_stacking.jpeg\" width=\"300\"/>\n",
-    "</div>\n",
-    "Horizontal stacking can be understood as combining two DataFrames that have different measurements on the same observed objects. In our example, it may be that one field researcher has registered the weight and hindfoot length of an individual, and another wrote down their species and sex. They both registered different information of the same individuals. If we combine them, we have one list with all the information of the individual, rather than two lists with partial information.\n",
-    "\n",
-    "We now go back to our DataFrames with 10 survey result each, and concatenate those. In this case, as a result, we would expect a DataFrame with the same number of rows of the original ones (10 row) and twice the number of columns (18 columns)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8014b0c4-e639-4da2-b343-9fe8f0e48c7c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Place the DataFrames side by side\n",
-    "horizontal_stack = pd.concat([surveys_df_sub_first10, surveys_df_sub_last10], axis=1)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e667112b-577c-4f3e-92cc-84e7315311ab",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(horizontal_stack.info())\n",
-    "horizontal_stack"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "be73380e-334f-4c75-b3e5-67bef66d071d",
-   "metadata": {},
-   "source": [
-    "Looking at the result of our horizontal concatenation, we may realise that something went wrong. The total number of rows on the resulting DataFrame is 20, instead of 10.\n",
-    "\n",
-    "This happens because horizontal stacking will only merge rows that actually \"belong together\". Rows that relate to the same observed object are merged. To determine this, it compares the indices of the rows. In our two DataFrames, the rows have different indices (1-9 and 35539-35548 respectively). It will therefore not merge any of the rows together, as it does not find any two rows that relate to the same observation.\n",
-    "\n",
-    "If we want to force the DataFrames into the form we had in mind, we need to reset the indices of the second DataFrame so that they will match the ones of the first DataFrame."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5f2e244b-22d4-4a3f-9bec-7d517916c20c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "surveys_df_sub_last10 = surveys_df_sub_last10.reset_index(drop=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2f04ee12-4f49-4712-82bb-b7cad48dfa7a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "surveys_df_sub_last10"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "49387832-b435-4899-80f8-226408280f3a",
-   "metadata": {},
-   "source": [
-    "Now that the index has been reset, we can concatenate the two DataFrames."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fac34bb5-d7de-4d5a-894e-14308678946f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "horizontal_stack = pd.concat([surveys_df_sub_first10, surveys_df_sub_last10], axis=1)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2759da9f-f5d0-4e14-bec7-b09d378ac5ed",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(horizontal_stack.info())\n",
-    "horizontal_stack"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9d03dd2c-5f89-4869-b2e1-48c45985fe37",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-block alert-success\">\n",
-    "<b>Exercise 10 and 11</b>\n",
-    "    \n",
-    "Now go to the Jupyter Dashboard in your internet browser and continue with exercise 10 and 11.\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "29d2e1ab-5a9d-4c6e-a6cb-8f153dd85aca",
@@ -331,15 +209,17 @@
     "  3. You choose the type of join;\n",
     "  4. You perform the join running the function `pd.merge()` with the specified inputs and options.\n",
     "\n",
-    "What it means for a DataFrame to be 'left' or 'right' depends on the type of join, and will become clear in the examples below. For now, just remember that it matters which DataFrame you mention first when performing a join."
+    "What it means for a DataFrame to be 'left' or 'right' depends on the type of join. The main thing to remember is that it matters which DataFrame you mention first when performing a join.\n",
+    "\n",
+    "We will only be looking at what is know as an *inner join*. Investigating the other types of joins is left as an exercise."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "ec5cb030-5a87-4331-9caa-bb86c87eafdd",
    "metadata": {},
    "source": [
-    "Let's see some join example considering two tiny (few rows) DataFrames. Our left DataFrame contains general data of European capitals, and our right DataFrame contains weather measurements for some Dutch towns. We first need to import these datasets:"
+    "Let's see some join example using the first ten rows of our `surveys_df` DataFrame (`surveys_df_sub_first10`). This will be our left DataFrame, containing data on individual animals. Our right DataFrame will be the first 20 rows of `species_df`, containing some information on each species. Let us rename them for the purpose of the join:"
    ]
   },
   {
@@ -349,8 +229,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "left_df = pd.read_csv(\"../course_materials/data/EU_capitals_tiny.csv\", sep=\",\", header=0)\n",
-    "right_df = pd.read_csv(\"../course_materials/data/Netherlands_town_weather_tiny.csv\", sep=\",\", header=0)"
+    "left_df = surveys_df_sub_first10\n",
+    "right_df = species_df.head(20)"
    ]
   },
   {
@@ -378,7 +258,7 @@
    "id": "ab5817c8-6f2d-4aa4-b1de-d6f3e35a928c",
    "metadata": {},
    "source": [
-    "The column we want to perform the join on is the one containing information about the town. In the left DataFrame this has name *Capital* while in the right one *Town*."
+    "The column we want to perform the join on is the one containing information about the `species_id`. Conveniently, this column has the same label in both DataFrames. Note that this is not always the case."
    ]
   },
   {
@@ -388,7 +268,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "inner_join = pd.merge(left_df,right_df,left_on='Capital',right_on='Town',how='inner')\n",
+    "inner_join = pd.merge(left_df, right_df, left_on='species_id', right_on='species_id', how='inner')\n",
     "inner_join"
    ]
   },
@@ -397,58 +277,15 @@
    "id": "399cd8cb-1fcf-45f2-a654-48e1a3f004fe",
    "metadata": {},
    "source": [
-    "As you may notice, the resulting DataFrame has only one line, the only row that the columns *Capital* and *Town* have in common (*Amsterdam*). This is because an inner join selects only those row values that are **the same** in the two columns (mathematically, an intersection).\n",
-    "\n",
-    "The columns of the two DataFrames are all preserved, even if they have the same name. In our case, both left and right DataFrames have a column with the same name (*Elevation*). After merging, the two columns are preserved, but with a suffix to distinguish them. If you are not happy with the default suffix, you may specify yours in the list of arguments of the ```pd.merge``` functions."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8abc7b44-4857-474f-91ad-cd9704fda396",
-   "metadata": {},
-   "source": [
-    "Let's now look at the other joins:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "56a56e33-3a0a-41ad-baf2-7c5d3fa8d01a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "left_join = pd.merge(left_df,right_df,left_on='Capital',right_on='Town',how='left')\n",
-    "left_join"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "05684c4e-34bd-4c8f-b0f4-0f030deb3196",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "right_join = pd.merge(left_df,right_df,left_on='Capital',right_on='Town',how='right')\n",
-    "right_join"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "80356c8a-3a56-4692-adbf-662908005f3a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "outer_join = pd.merge(left_df,right_df,left_on='Capital',right_on='Town',how='outer')\n",
-    "outer_join"
+    "As you may notice, the resulting DataFrame has only seven rows, while our original DataFrames had 10 and 20 rows, respectively. This is because an inner join selects only those rows where the value of the joined column occurs in both DataFrames (mathematically, an intersection)."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "8fe2c0d4-5db7-4de4-b092-502ad6568c99",
    "metadata": {},
    "source": [
-    "To resume: a join always merges rows that have matching values in the columns that you merge on. Which rows you find in the resulting DataFrame, depends on the type of join:\n",
+    "Aside from the inner join, there are three more types of join that you can do using the `merge()` function:\n",
     "- An inner join selects only the rows that result from the combination of matching rows in both the original left and right DataFrames (intersection);\n",
     "- A left join selects all rows that were in the original left DataFrame, some of which may have been joined with a matching entry from the right DataFrame;\n",
     "- A right join selects all rows that were in the original right DataFrame, some of which may have been joined with a matching entry from the left DataFrame;\n",
@@ -472,6 +309,17 @@
     "  - Do you want to get **all the information** from the two DataFrames? Then you use an **outer join**."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9d03dd2c-5f89-4869-b2e1-48c45985fe37",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-success\">\n",
+    "<b>Exercise 12</b>\n",
+    "    \n",
+    "Now go to the Jupyter Dashboard in your internet browser and continue with exercise 10 and 11.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "56911fb1",
@@ -482,22 +330,8 @@
   }
  ],
  "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
   "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "name": "python"
   }
  },
  "nbformat": 4,