Skip to content

Commit

Permalink
Optional exercises and demos for pandas. (#107)
Browse files Browse the repository at this point in the history
  • Loading branch information
chStaiger authored Feb 12, 2024
1 parent ff81da5 commit 0917eee
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 41 deletions.
31 changes: 21 additions & 10 deletions book/data-science-with-pandas-2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -390,12 +390,23 @@
"not | **!** | **~**|"
]
},
{
"cell_type": "markdown",
"id": "b1dc9211",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\">\n",
"<b>Exercise 4</b>\n",
" \n",
"Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercise 4."
]
},
{
"cell_type": "markdown",
"id": "0fd42879",
"metadata": {},
"source": [
"## DataFrame Cleaning"
"## Optional: DataFrame Cleaning"
]
},
{
Expand Down Expand Up @@ -580,17 +591,17 @@
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\">\n",
"<b>Exercise 4 and 5</b>\n",
"\n",
"Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 5 and 6."
"<b>(Optional) Exercise 5</b>\n",
" \n",
"To deepen the knowledge you can continue with Exercise 5."
]
},
{
"cell_type": "markdown",
"id": "2cf8ba70-5aba-4c52-96e4-f4cda4c47268",
"metadata": {},
"source": [
"## Grouping"
"## Optional: Grouping"
]
},
{
Expand Down Expand Up @@ -663,7 +674,7 @@
"id": "71794bfb",
"metadata": {},
"source": [
"## Structure of a groupby object\n",
"## Optional: Structure of a groupby object\n",
"We can investigate which rows are assigned to which group as follows:"
]
},
Expand All @@ -684,7 +695,7 @@
"id": "0cb9e651",
"metadata": {},
"source": [
"## Grouping by multiple columns\n",
"## Optional: Grouping by multiple columns\n",
"Now let's have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by *sex* and by *plot_id* at the same time.\n",
"\n",
"This will give us exactly 48 groups for our survey data:\n",
Expand Down Expand Up @@ -753,7 +764,7 @@
"id": "70c925fa",
"metadata": {},
"source": [
"## Summary grouping\n",
"## Optional: Summary grouping\n",
"Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same samples. In many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe. Pandas' groupby function gives us the means to compare these three groups with several built-in statistical methods."
]
},
Expand All @@ -773,9 +784,9 @@
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\">\n",
"<b>Exercise 6 to 8</b>\n",
"<b>Optional Exercise 6 to 8</b>\n",
" \n",
"Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 6 to 8."
"To deepen the knowledge you can do Exercises 6 to 8."
]
},
{
Expand Down
33 changes: 27 additions & 6 deletions book/solutions/afternoon_exercises_solutions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -161,20 +161,41 @@
"metadata": {},
"source": [
"### Exercise 4\n",
"- Create a new DataFrame that only contains observations from the original with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`)."
"- Find all entries in the column `sex` which do not contain an `M` or a `F`.\n",
"- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "53f7777a",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "NameError",
"evalue": "name 'surveys_df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43msurveys_df\u001b[49m[(surveys_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msex\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mM\u001b[39m\u001b[38;5;124m'\u001b[39m) \u001b[38;5;241m&\u001b[39m (surveys_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msex\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mF\u001b[39m\u001b[38;5;124m'\u001b[39m)]\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNumber of rows not female or male:\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28mlen\u001b[39m(df))\n",
"\u001b[0;31mNameError\u001b[0m: name 'surveys_df' is not defined"
]
}
],
"source": [
"df = surveys_df[(surveys_df['sex'] != 'M') & (surveys_df['sex'] != 'F')]\n",
"print(\"Number of rows not female or male:\", len(df))\n",
"print(\"Number of rows NaN:\", len(surveys_df['sex'].isnull()))\n",
"print(\"Unique values in column 'sex':\", df['sex'].unique())"
"print(\"Number of rows not female or male:\", len(df))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93ab8968",
"metadata": {},
"outputs": [],
"source": [
"df = surveys_df[((surveys_df['sex'] == 'M') | (surveys_df['sex'] == 'F')) & surveys_df['weight'] > 0]"
]
},
{
Expand Down
43 changes: 18 additions & 25 deletions course_materials/afternoon_exercises.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "77038fb1-920d-4832-8b59-2d0cefa13bc5",
"metadata": {
"tags": []
Expand All @@ -34,12 +34,12 @@
"\n",
"Type the following commands and check the outputs. Can you tell what each command does? What is the difference between commands with and without parenthesis?\n",
"\n",
"`surveys_df.shape` Answer:\n",
"`surveys_df.columns` Answer:\n",
"`surveys_df.index` Answer:\n",
"`surveys_df.dtypes` Answer:\n",
"`surveys_df.head(<try_various_integers_here>)` Answer:\n",
"`surveys_df.tail(<try_various_integers_here>)` Answer:\n",
"- `surveys_df.shape` Answer:\n",
"- `surveys_df.columns` Answer:\n",
"- `surveys_df.index` Answer:\n",
"- `surveys_df.dtypes` Answer:\n",
"- `surveys_df.head(<try_various_integers_here>)` Answer:\n",
"- `surveys_df.tail(<try_various_integers_here>)` Answer:\n",
"\n",
"[Course book chapter 5 for reference](https://utrechtuniversity.github.io/workshop-introduction-to-python/data-science-with-pandas-1.html)"
]
Expand Down Expand Up @@ -155,7 +155,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "40ad6cd7-7d9f-4c5b-b269-718e98f35bf6",
"id": "1426dfcc",
"metadata": {},
"outputs": [],
"source": []
Expand Down Expand Up @@ -247,7 +247,7 @@
"metadata": {},
"source": [
"### Exercise 4\n",
"- Create a new DataFrame that only contains observations from the original DataFrame with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`).\n",
"- Find all entries in the column `sex` which do not contain an `M` or a `F`.\n",
"- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0."
]
},
Expand All @@ -272,7 +272,7 @@
"id": "b34b321b",
"metadata": {},
"source": [
"### Exercise 5: Putting it all together \n",
"### Exercise 5 (optional): Putting it all together \n",
"1. Clean the column *sex* (leave out samples of which we do not know whether they are male or female) and save the result as a new dataframe `clean_df`.\n",
"2. Replace undefined *weight* values with the mean of all (defined) weights in `surveys_df`.\n",
"3. Calculate the average weight of that new DataFrame `clean_df`"
Expand All @@ -291,7 +291,7 @@
"id": "ccb33c2e",
"metadata": {},
"source": [
"### Exercise 6\n",
"### Exercise 6 (optional)\n",
"Let's see in which plots animals get more food. Calculate the average weight per plot! Complete the code below."
]
},
Expand All @@ -311,7 +311,7 @@
"id": "2bccb9da",
"metadata": {},
"source": [
"### Exercise 7\n",
"### Exercise 7 (optional)\n",
"See below a more complex grouping example. Investigate the group keys and row indexes for this more complex grouping example. \n",
"Why are there more than 48 groups? Answer: \n",
"Calculate the average weight per group.\n",
Expand Down Expand Up @@ -342,7 +342,7 @@
"id": "b0f1ab75",
"metadata": {},
"source": [
"### Exercise 8\n",
"### Exercise 8 (optional)\n",
"Would it make sense to group our data frame by the column *weight*? Why or why not?"
]
},
Expand All @@ -351,7 +351,7 @@
"id": "0c7ae97d",
"metadata": {},
"source": [
"### Exercise 9\n",
"### Exercise 9 (optional)\n",
"In the given example of vertical concatenation, you concatenated two DataFrames with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?\n",
"\n",
" 1. Create a new DataFrame using the last 10 rows of the species DataFrame (`species_df`);\n",
Expand All @@ -365,18 +365,14 @@
"id": "1a685e40",
"metadata": {},
"outputs": [],
"source": [
"species_df = pd.read_csv(\"../data/species.csv\")\n",
"\n",
"surveys_df_sub_first10 = surveys_df.head(10)"
]
"source": []
},
{
"cell_type": "markdown",
"id": "afa7dd9c",
"metadata": {},
"source": [
"### Exercise 10\n",
"### Exercise 10 (optional)\n",
" 1. Looking at the `inner_join` example, can you explain how much of each of the two DataFrames is missing from the result?\n",
"\n",
"Now consider the other types of joins, for each one, can you predict the number of rows and the contents of the resulting DataFrame, based on the diagrams in the picture?\n",
Expand All @@ -392,10 +388,7 @@
"id": "427314ec",
"metadata": {},
"outputs": [],
"source": [
"left_df = surveys_df.head(10)\n",
"right_df = species_df.head(20)"
]
"source": []
},
{
"cell_type": "markdown",
Expand All @@ -406,7 +399,7 @@
"\n",
"Time to play with plots! Create a multiplot following these instructions:\n",
"- Using the matplotlib.pyplot function `subplots()`, create a single figure (10x10 inches) with four subplots organized in two rows and two columns; \n",
"- In the top row plot hindfoot_length VS weight for female and male in two different plots with two differen colors;\n",
"- In the top row plot hindfoot_length VS weight for female and male in two different plots with two different colors;\n",
"- In the bottom row, plot the same data of the top row, but using data collected before (left plot) and after (right plot) 1990;\n",
"- Give to each plot an appropriate descriptive title and optimize plot labels.\n",
"<br>\n",
Expand Down

0 comments on commit 0917eee

Please sign in to comment.