From 7a59ca7612ed9dd68010f58015b9a41416012c45 Mon Sep 17 00:00:00 2001 From: Jose Castillo <114josecastillo@gmail.com> Date: Mon, 7 Nov 2022 22:12:50 -0500 Subject: [PATCH] finished --- index.ipynb | 1530 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 1529 insertions(+), 1 deletion(-) diff --git a/index.ipynb b/index.ipynb index 12aaa708..45db1177 100644 --- a/index.ipynb +++ b/index.ipynb @@ -1 +1,1529 @@ -{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# EDA with Pandas - Cumulative Lab\n", "\n", "## Introduction\n", "\n", "In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.\n", "\n", "## Objectives\n", "\n", "You will be able to:\n", "\n", "* Practice loading data with pandas\n", "* Practice calculating measures of centrality and dispersion with pandas\n", "* Practice creating subsets of data with pandas\n", "* Practice using data visualizations to explore data, and interpreting those visualizations\n", "* Perform a full exploratory data analysis process to gain insight about a dataset "]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Your Task: Explore the Ames Housing Dataset with Pandas\n", "\n", "![aerial photo of a neighborhood](images/neighborhood_aerial.jpg)\n", "\n", "Photo by Matt Donders on Unsplash\n", "\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Understanding\n", "\n", "Each record (row) in this dataset represents a home that was sold in Ames, IA.\n", "\n", "Each feature (column) in this dataset is some attribute of that home sale. You can view the file `data/data_description.txt` in this repository for a full explanation of all variables in this dataset \u2014 80 columns in total.\n", "\n", "We are going to focus on the following features:\n", "\n", "**SalePrice**: `Sale price of the house in dollars`\n", "\n", "**TotRmsAbvGrd**: `Total rooms above grade (does not include bathrooms)`\n", "\n", "**OverallCond**: `Rates the overall condition of the house`\n", "```\n", " 10\tVery Excellent\n", " 9\t Excellent\n", " 8\t Very Good\n", " 7\t Good\n", " 6\t Above Average\t\n", " 5\t Average\n", " 4\t Below Average\t\n", " 3\t Fair\n", " 2\t Poor\n", " 1\t Very Poor\n", "```\n", "\n", "**YrSold**: `Year Sold (YYYY)`\n", "\n", "**YearBuilt**: `Original construction date`\n", "\n", "**LandSlope**: `Slope of property`\n", "```\n", " Gtl\tGentle slope\n", " Mod\tModerate Slope\t\n", " Sev\tSevere Slope\n", "```"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Requirements\n", "\n", "In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.\n", "\n", "#### 1. Load the Dataset with Pandas\n", "\n", "Import pandas with the standard alias `pd` and load the data into a dataframe with the standard name `df`.\n", "\n", "#### 2. Explore Data Distributions\n", "\n", "Produce summary statistics, visualizations, and interpretive text describing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", "\n", "#### 3. Explore Differences between Subsets\n", "\n", "Separate the data into subsets based on `OverallCond`, then demonstrate how this split impacts the distribution of `SalePrice`.\n", "\n", "#### 4. Explore Correlations\n", "\n", "Find the features that have the strongest positive and negative correlations with `SalePrice`, and produce plots representing these relationships.\n", "\n", "#### 5. Engineer and Explore a New Feature\n", "\n", "Create a new feature `Age`, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 1. Load the Dataset with Pandas\n", "\n", "In the cell below, import:\n", "* `pandas` with the standard alias `pd`\n", "* `matplotlib.pyplot` with the standard alias `plt`\n", "\n", "And set `%matplotlib inline` so the graphs will display immediately below the cell that creates them."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, use pandas to open the file located at `data/ames.csv` ([documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Specify the argument `index_col=0` in order to avoid creating an extra `Id` column. Name the resulting dataframe `df`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code checks that you loaded the data correctly:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that df is a dataframe\n", "assert type(df) == pd.DataFrame\n", "\n", "# Check that there are the correct number of rows\n", "assert df.shape[0] == 1460\n", "\n", "# Check that there are the correct number of columns\n", "# (if this crashes, make sure you specified `index_col=0`)\n", "assert df.shape[1] == 80"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Inspect the contents of the dataframe:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df.info()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 2. Explore Data Distributions\n", "\n", "Write code to produce histograms showing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", "\n", "Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for [plotting histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html), [customizing axes](https://matplotlib.org/stable/api/axes_api.html#axis-labels-title-and-legend), and [plotting vertical lines](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axvline.html#matplotlib.axes.Axes.axvline) as needed."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Sale Price\n", "\n", "In the cell below, produce a histogram for `SalePrice`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Total Rooms Above Grade\n", "\n", "In the cell below, produce a histogram for `TotRmsAbvGrd`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Overall Condition\n", "\n", "In the cell below, produce a histogram for `OverallCond`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 3. Explore Differences between Subsets\n", "\n", "As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.\n", "\n", "One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use `SalePrice` as that other variable.\n", "\n", "In the cell below, create three variables, each of which represents a record-wise subset of `df` (meaning, it has the same columns as `df`, but only some of the rows).\n", "\n", "* `below_average_condition`: home sales where the overall condition was less than 5\n", "* `average_condition`: home sales where the overall condition was exactly 5\n", "* `above_average_condition`: home sales where the overall condition was greater than 5"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "below_average_condition = None\n", "average_condition = None\n", "above_average_condition = None"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code checks that you created the subsets correctly:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that all of them still have 80 columns\n", "assert below_average_condition.shape[1] == 80\n", "assert average_condition.shape[1] == 80\n", "assert above_average_condition.shape[1] == 80\n", "\n", "# Check the numbers of rows of each subset\n", "assert below_average_condition.shape[0] == 88\n", "assert average_condition.shape[0] == 821\n", "assert above_average_condition.shape[0] == 551"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code will produce a plot of the distributions of sale price for each of these subsets:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Set up plot\n", "fig, ax = plt.subplots(figsize=(15,5))\n", "\n", "# Create custom bins so all are on the same scale\n", "bins = range(df[\"SalePrice\"].min(), df[\"SalePrice\"].max(), int(df[\"SalePrice\"].median()) // 20)\n", "\n", "# Plot three histograms, with reduced opacity (alpha) so we\n", "# can see them overlapping\n", "ax.hist(\n", " x=above_average_condition[\"SalePrice\"],\n", " label=\"above average condition\",\n", " bins=bins,\n", " color=\"cyan\",\n", " alpha=0.5\n", ")\n", "ax.hist(\n", " x=average_condition[\"SalePrice\"],\n", " label=\"average condition\",\n", " bins=bins,\n", " color=\"gray\",\n", " alpha=0.3\n", ")\n", "ax.hist(\n", " x=below_average_condition[\"SalePrice\"],\n", " label=\"below average condition\",\n", " bins=bins,\n", " color=\"yellow\",\n", " alpha=0.5\n", ")\n", "\n", "# Customize labels\n", "ax.set_title(\"Distributions of Sale Price Grouped by Condition\")\n", "ax.set_xlabel(\"Sale Price\")\n", "ax.set_ylabel(\"Number of Houses\")\n", "ax.legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 4. Explore Correlations\n", "\n", "To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full `df`, rather than the subsets.\n", "\n", "In the cell below, print out both the name of the column and the Pearson correlation for the column that is ***most positively correlated*** with `SalePrice` (other than `SalePrice`, which is perfectly correlated with itself).\n", "\n", "We'll only check the correlations with some kind of numeric data type.\n", "\n", "You can import additional libraries, although it is possible to do this just using pandas."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, find the ***most negatively correlated*** column:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Once you have your answer, edit the code below so that it produces a box plot of the relevant columns."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "import seaborn as sns\n", "\n", "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))\n", "\n", "# Plot distribution of column with highest correlation\n", "sns.boxplot(\n", " x=None,\n", " y=df[\"SalePrice\"],\n", " ax=ax1\n", ")\n", "# Plot distribution of column with most negative correlation\n", "sns.boxplot(\n", " x=None,\n", " y=df[\"SalePrice\"],\n", " ax=ax2\n", ")\n", "\n", "# Customize labels\n", "ax1.set_title(None)\n", "ax1.set_xlabel(None)\n", "ax1.set_ylabel(\"Sale Price\")\n", "ax2.set_title(None)\n", "ax2.set_xlabel(None)\n", "ax2.set_ylabel(\"Sale Price\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret the results below. Consult `data/data_description.txt` as needed."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 5. Engineer and Explore a New Feature\n", "\n", "Here the code is written for you, all you need to do is interpret it.\n", "\n", "We note that the data spans across several years of sales:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df[\"YrSold\"].value_counts().sort_index()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the `YrBuilt` and `YrSold` columns, but represents a truly distinct feature."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Make a new column, Age\n", "df[\"Age\"] = df[\"YrSold\"] - df[\"YearBuilt\"]\n", "\n", "# Set up plot\n", "fig, ax = plt.subplots(figsize=(15,5))\n", "\n", "# Plot Age vs. SalePrice\n", "ax.scatter(df[\"Age\"], df[\"SalePrice\"], alpha=0.3, color=\"green\")\n", "ax.set_title(\"Home Age vs. Sale Price\")\n", "ax.set_xlabel(\"Age of Home at Time of Sale\")\n", "ax.set_ylabel(\"Sale Price\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret this plot below:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Summary\n", "\n", "Congratulations, you've completed an exploratory data analysis of a popular dataset! You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features."]}], "metadata": {"kernelspec": {"display_name": "python (learn-env)", "language": "python", "name": "learn-env"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5"}}, "nbformat": 4, "nbformat_minor": 2} \ No newline at end of file +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# EDA with Pandas - Cumulative Lab\n", + "\n", + "## Introduction\n", + "\n", + "In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.\n", + "\n", + "## Objectives\n", + "\n", + "You will be able to:\n", + "\n", + "* Practice loading data with pandas\n", + "* Practice calculating measures of centrality and dispersion with pandas\n", + "* Practice creating subsets of data with pandas\n", + "* Practice using data visualizations to explore data, and interpreting those visualizations\n", + "* Perform a full exploratory data analysis process to gain insight about a dataset " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Your Task: Explore the Ames Housing Dataset with Pandas\n", + "\n", + "![aerial photo of a neighborhood](images/neighborhood_aerial.jpg)\n", + "\n", + "Photo by Matt Donders on Unsplash\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Understanding\n", + "\n", + "Each record (row) in this dataset represents a home that was sold in Ames, IA.\n", + "\n", + "Each feature (column) in this dataset is some attribute of that home sale. You can view the file `data/data_description.txt` in this repository for a full explanation of all variables in this dataset — 80 columns in total.\n", + "\n", + "We are going to focus on the following features:\n", + "\n", + "**SalePrice**: `Sale price of the house in dollars`\n", + "\n", + "**TotRmsAbvGrd**: `Total rooms above grade (does not include bathrooms)`\n", + "\n", + "**OverallCond**: `Rates the overall condition of the house`\n", + "```\n", + " 10\tVery Excellent\n", + " 9\t Excellent\n", + " 8\t Very Good\n", + " 7\t Good\n", + " 6\t Above Average\t\n", + " 5\t Average\n", + " 4\t Below Average\t\n", + " 3\t Fair\n", + " 2\t Poor\n", + " 1\t Very Poor\n", + "```\n", + "\n", + "**YrSold**: `Year Sold (YYYY)`\n", + "\n", + "**YearBuilt**: `Original construction date`\n", + "\n", + "**LandSlope**: `Slope of property`\n", + "```\n", + " Gtl\tGentle slope\n", + " Mod\tModerate Slope\t\n", + " Sev\tSevere Slope\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Requirements\n", + "\n", + "In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.\n", + "\n", + "#### 1. Load the Dataset with Pandas\n", + "\n", + "Import pandas with the standard alias `pd` and load the data into a dataframe with the standard name `df`.\n", + "\n", + "#### 2. Explore Data Distributions\n", + "\n", + "Produce summary statistics, visualizations, and interpretive text describing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", + "\n", + "#### 3. Explore Differences between Subsets\n", + "\n", + "Separate the data into subsets based on `OverallCond`, then demonstrate how this split impacts the distribution of `SalePrice`.\n", + "\n", + "#### 4. Explore Correlations\n", + "\n", + "Find the features that have the strongest positive and negative correlations with `SalePrice`, and produce plots representing these relationships.\n", + "\n", + "#### 5. Engineer and Explore a New Feature\n", + "\n", + "Create a new feature `Age`, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Load the Dataset with Pandas\n", + "\n", + "In the cell below, import:\n", + "* `pandas` with the standard alias `pd`\n", + "* `matplotlib.pyplot` with the standard alias `plt`\n", + "\n", + "And set `%matplotlib inline` so the graphs will display immediately below the cell that creates them." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "\n", + "%matplotlib inline\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, use pandas to open the file located at `data/ames.csv` ([documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Specify the argument `index_col=0` in order to avoid creating an extra `Id` column. Name the resulting dataframe `df`." + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('data/ames.csv', index_col=0)\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code checks that you loaded the data correctly:" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Check that df is a dataframe\n", + "assert type(df) == pd.DataFrame\n", + "\n", + "# Check that there are the correct number of rows\n", + "assert df.shape[0] == 1460\n", + "\n", + "# Check that there are the correct number of columns\n", + "# (if this crashes, make sure you specified `index_col=0`)\n", + "assert df.shape[1] == 80" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the contents of the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
Id
160RL65.08450PaveNaNRegLvlAllPubInside...0NaNNaNNaN022008WDNormal208500
220RL80.09600PaveNaNRegLvlAllPubFR2...0NaNNaNNaN052007WDNormal181500
360RL68.011250PaveNaNIR1LvlAllPubInside...0NaNNaNNaN092008WDNormal223500
470RL60.09550PaveNaNIR1LvlAllPubCorner...0NaNNaNNaN022006WDAbnorml140000
560RL84.014260PaveNaNIR1LvlAllPubFR2...0NaNNaNNaN0122008WDNormal250000
..................................................................
145660RL62.07917PaveNaNRegLvlAllPubInside...0NaNNaNNaN082007WDNormal175000
145720RL85.013175PaveNaNRegLvlAllPubInside...0NaNMnPrvNaN022010WDNormal210000
145870RL66.09042PaveNaNRegLvlAllPubInside...0NaNGdPrvShed250052010WDNormal266500
145920RL68.09717PaveNaNRegLvlAllPubInside...0NaNNaNNaN042010WDNormal142125
146020RL75.09937PaveNaNRegLvlAllPubInside...0NaNNaNNaN062008WDNormal147500
\n", + "

1460 rows × 80 columns

\n", + "
" + ], + "text/plain": [ + " MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "Id \n", + "1 60 RL 65.0 8450 Pave NaN Reg \n", + "2 20 RL 80.0 9600 Pave NaN Reg \n", + "3 60 RL 68.0 11250 Pave NaN IR1 \n", + "4 70 RL 60.0 9550 Pave NaN IR1 \n", + "5 60 RL 84.0 14260 Pave NaN IR1 \n", + "... ... ... ... ... ... ... ... \n", + "1456 60 RL 62.0 7917 Pave NaN Reg \n", + "1457 20 RL 85.0 13175 Pave NaN Reg \n", + "1458 70 RL 66.0 9042 Pave NaN Reg \n", + "1459 20 RL 68.0 9717 Pave NaN Reg \n", + "1460 20 RL 75.0 9937 Pave NaN Reg \n", + "\n", + " LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature \\\n", + "Id ... \n", + "1 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "2 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", + "3 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "4 Lvl AllPub Corner ... 0 NaN NaN NaN \n", + "5 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", + "... ... ... ... ... ... ... ... ... \n", + "1456 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "1457 Lvl AllPub Inside ... 0 NaN MnPrv NaN \n", + "1458 Lvl AllPub Inside ... 0 NaN GdPrv Shed \n", + "1459 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "1460 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "\n", + " MiscVal MoSold YrSold SaleType SaleCondition SalePrice \n", + "Id \n", + "1 0 2 2008 WD Normal 208500 \n", + "2 0 5 2007 WD Normal 181500 \n", + "3 0 9 2008 WD Normal 223500 \n", + "4 0 2 2006 WD Abnorml 140000 \n", + "5 0 12 2008 WD Normal 250000 \n", + "... ... ... ... ... ... ... \n", + "1456 0 8 2007 WD Normal 175000 \n", + "1457 0 2 2010 WD Normal 210000 \n", + "1458 2500 5 2010 WD Normal 266500 \n", + "1459 0 4 2010 WD Normal 142125 \n", + "1460 0 6 2008 WD Normal 147500 \n", + "\n", + "[1460 rows x 80 columns]" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Run this cell without changes\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 1460 entries, 1 to 1460\n", + "Data columns (total 80 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 MSSubClass 1460 non-null int64 \n", + " 1 MSZoning 1460 non-null object \n", + " 2 LotFrontage 1201 non-null float64\n", + " 3 LotArea 1460 non-null int64 \n", + " 4 Street 1460 non-null object \n", + " 5 Alley 91 non-null object \n", + " 6 LotShape 1460 non-null object \n", + " 7 LandContour 1460 non-null object \n", + " 8 Utilities 1460 non-null object \n", + " 9 LotConfig 1460 non-null object \n", + " 10 LandSlope 1460 non-null object \n", + " 11 Neighborhood 1460 non-null object \n", + " 12 Condition1 1460 non-null object \n", + " 13 Condition2 1460 non-null object \n", + " 14 BldgType 1460 non-null object \n", + " 15 HouseStyle 1460 non-null object \n", + " 16 OverallQual 1460 non-null int64 \n", + " 17 OverallCond 1460 non-null int64 \n", + " 18 YearBuilt 1460 non-null int64 \n", + " 19 YearRemodAdd 1460 non-null int64 \n", + " 20 RoofStyle 1460 non-null object \n", + " 21 RoofMatl 1460 non-null object \n", + " 22 Exterior1st 1460 non-null object \n", + " 23 Exterior2nd 1460 non-null object \n", + " 24 MasVnrType 1452 non-null object \n", + " 25 MasVnrArea 1452 non-null float64\n", + " 26 ExterQual 1460 non-null object \n", + " 27 ExterCond 1460 non-null object \n", + " 28 Foundation 1460 non-null object \n", + " 29 BsmtQual 1423 non-null object \n", + " 30 BsmtCond 1423 non-null object \n", + " 31 BsmtExposure 1422 non-null object \n", + " 32 BsmtFinType1 1423 non-null object \n", + " 33 BsmtFinSF1 1460 non-null int64 \n", + " 34 BsmtFinType2 1422 non-null object \n", + " 35 BsmtFinSF2 1460 non-null int64 \n", + " 36 BsmtUnfSF 1460 non-null int64 \n", + " 37 TotalBsmtSF 1460 non-null int64 \n", + " 38 Heating 1460 non-null object \n", + " 39 HeatingQC 1460 non-null object \n", + " 40 CentralAir 1460 non-null object \n", + " 41 Electrical 1459 non-null object \n", + " 42 1stFlrSF 1460 non-null int64 \n", + " 43 2ndFlrSF 1460 non-null int64 \n", + " 44 LowQualFinSF 1460 non-null int64 \n", + " 45 GrLivArea 1460 non-null int64 \n", + " 46 BsmtFullBath 1460 non-null int64 \n", + " 47 BsmtHalfBath 1460 non-null int64 \n", + " 48 FullBath 1460 non-null int64 \n", + " 49 HalfBath 1460 non-null int64 \n", + " 50 BedroomAbvGr 1460 non-null int64 \n", + " 51 KitchenAbvGr 1460 non-null int64 \n", + " 52 KitchenQual 1460 non-null object \n", + " 53 TotRmsAbvGrd 1460 non-null int64 \n", + " 54 Functional 1460 non-null object \n", + " 55 Fireplaces 1460 non-null int64 \n", + " 56 FireplaceQu 770 non-null object \n", + " 57 GarageType 1379 non-null object \n", + " 58 GarageYrBlt 1379 non-null float64\n", + " 59 GarageFinish 1379 non-null object \n", + " 60 GarageCars 1460 non-null int64 \n", + " 61 GarageArea 1460 non-null int64 \n", + " 62 GarageQual 1379 non-null object \n", + " 63 GarageCond 1379 non-null object \n", + " 64 PavedDrive 1460 non-null object \n", + " 65 WoodDeckSF 1460 non-null int64 \n", + " 66 OpenPorchSF 1460 non-null int64 \n", + " 67 EnclosedPorch 1460 non-null int64 \n", + " 68 3SsnPorch 1460 non-null int64 \n", + " 69 ScreenPorch 1460 non-null int64 \n", + " 70 PoolArea 1460 non-null int64 \n", + " 71 PoolQC 7 non-null object \n", + " 72 Fence 281 non-null object \n", + " 73 MiscFeature 54 non-null object \n", + " 74 MiscVal 1460 non-null int64 \n", + " 75 MoSold 1460 non-null int64 \n", + " 76 YrSold 1460 non-null int64 \n", + " 77 SaleType 1460 non-null object \n", + " 78 SaleCondition 1460 non-null object \n", + " 79 SalePrice 1460 non-null int64 \n", + "dtypes: float64(3), int64(34), object(43)\n", + "memory usage: 923.9+ KB\n" + ] + } + ], + "source": [ + "# Run this cell without changes\n", + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Explore Data Distributions\n", + "\n", + "Write code to produce histograms showing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", + "\n", + "Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for [plotting histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html), [customizing axes](https://matplotlib.org/stable/api/axes_api.html#axis-labels-title-and-legend), and [plotting vertical lines](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axvline.html#matplotlib.axes.Axes.axvline) as needed." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sale Price\n", + "\n", + "In the cell below, produce a histogram for `SalePrice`." + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_histogram(df, column, title, xlabel, ylabel):\n", + " \n", + " # Extract the relevant data\n", + " data = df[column]\n", + " mean = data.mean()\n", + " \n", + " # Set up plot\n", + " fig, ax = plt.subplots(figsize=(7,7))\n", + " \n", + " # Plot histogram\n", + " ax.hist(data, bins ='auto')\n", + " \n", + " # Plot vertical line\n", + " ax.axvline(mean, color=\"black\")\n", + " \n", + " # Customize title and axes labels\n", + " ax.set_title(title)\n", + " ax.set_xlabel(xlabel)\n", + " ax.set_ylabel(ylabel)\n", + "\n", + "plot_histogram(\n", + " df,\n", + " \"SalePrice\",\n", + " \"Distribution of Sale Prices\",\n", + " \"Sale Price\",\n", + " \"Number of Houses\"\n", + ")\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean: 180921.19589041095\n", + "Median: 163000.0\n", + "Standard Deviation: 79442.50288288662\n" + ] + } + ], + "source": [ + "def print_stats(df,column):\n", + " print('Mean:', df[column].mean())\n", + " print('Median:', df[column].median())\n", + " print('Standard Deviation:', df[column].std())\n", + " \n", + "print_stats(df,'SalePrice')\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nThe mean and the median are fairly close to eachother, plus the standard deviation is around 8000 while the mean\\nand the median are in the 160k-180k. This is a Poisson distribution\\n'" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "The mean and the median are fairly close to eachother, plus the standard deviation is around 8000 while the mean\n", + "and the median are in the 160k-180k. This is a Poisson distribution\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Total Rooms Above Grade\n", + "\n", + "In the cell below, produce a histogram for `TotRmsAbvGrd`." + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plot_histogram(df, 'TotRmsAbvGrd', 'Total Rooms Above Grade', 'Total rooms', 'houses')\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean: 6.517808219178082\n", + "Median: 6.0\n", + "Standard Deviation: 1.6253932905840505\n" + ] + } + ], + "source": [ + "print_stats(df,'TotRmsAbvGrd')\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nFor this distribution, we can see that the mean and median are close to eachother, and most datapoint point are around them\\nThis is also a Poisson distribution\\n'" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "For this distribution, we can see that the mean and median are close to eachother, and most datapoint point are around them\n", + "This is also a Poisson distribution\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Overall Condition\n", + "\n", + "In the cell below, produce a histogram for `OverallCond`." + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "\n", + "# We are again reusing the same function\n", + "\n", + "plot_histogram(\n", + " df,\n", + " \"OverallCond\",\n", + " \"Distribution of Overall Condition of Houses on a 1-10 Scale\",\n", + " \"Condition of House\",\n", + " \"Number of Houses\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean: 5.575342465753424\n", + "Median: 5.0\n", + "Standard Deviation: 1.1127993367127316\n" + ] + } + ], + "source": [ + "print_stats(df, 'OverallCond')\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nMost homes have recived a rating of 5 and the mean is about a 5.5 \\n'" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "Most homes have recived a rating of 5 and the mean is about a 5.5 \n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Explore Differences between Subsets\n", + "\n", + "As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.\n", + "\n", + "One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use `SalePrice` as that other variable.\n", + "\n", + "In the cell below, create three variables, each of which represents a record-wise subset of `df` (meaning, it has the same columns as `df`, but only some of the rows).\n", + "\n", + "* `below_average_condition`: home sales where the overall condition was less than 5\n", + "* `average_condition`: home sales where the overall condition was exactly 5\n", + "* `above_average_condition`: home sales where the overall condition was greater than 5" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "below_average_condition = df[df[\"OverallCond\"] < 5]\n", + "average_condition = df[df[\"OverallCond\"] == 5]\n", + "above_average_condition = df[df[\"OverallCond\"] > 5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code checks that you created the subsets correctly:" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Check that all of them still have 80 columns\n", + "assert below_average_condition.shape[1] == 80\n", + "assert average_condition.shape[1] == 80\n", + "assert above_average_condition.shape[1] == 80\n", + "\n", + "# Check the numbers of rows of each subset\n", + "assert below_average_condition.shape[0] == 88\n", + "assert average_condition.shape[0] == 821\n", + "assert above_average_condition.shape[0] == 551" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code will produce a plot of the distributions of sale price for each of these subsets:" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Set up plot\n", + "fig, ax = plt.subplots(figsize=(15,5))\n", + "\n", + "# Create custom bins so all are on the same scale\n", + "bins = range(df[\"SalePrice\"].min(), df[\"SalePrice\"].max(), int(df[\"SalePrice\"].median()) // 20)\n", + "\n", + "# Plot three histograms, with reduced opacity (alpha) so we\n", + "# can see them overlapping\n", + "ax.hist(\n", + " x=above_average_condition[\"SalePrice\"],\n", + " label=\"above average condition\",\n", + " bins=bins,\n", + " color=\"cyan\",\n", + " alpha=0.5\n", + ")\n", + "ax.hist(\n", + " x=average_condition[\"SalePrice\"],\n", + " label=\"average condition\",\n", + " bins=bins,\n", + " color=\"gray\",\n", + " alpha=0.3\n", + ")\n", + "ax.hist(\n", + " x=below_average_condition[\"SalePrice\"],\n", + " label=\"below average condition\",\n", + " bins=bins,\n", + " color=\"yellow\",\n", + " alpha=0.5\n", + ")\n", + "\n", + "# Customize labels\n", + "ax.set_title(\"Distributions of Sale Price Grouped by Condition\")\n", + "ax.set_xlabel(\"Sale Price\")\n", + "ax.set_ylabel(\"Number of Houses\")\n", + "ax.legend();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nThe majority of houses sell at around the 100k-300k\\n'" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "The majority of houses sell at around the 100k-300k. We also note that there is a notable difference in price between\n", + "below average conditions and above average conditions; however, average conditions have interestingly a higher sell price\n", + "than the above average homes.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Explore Correlations\n", + "\n", + "To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full `df`, rather than the subsets.\n", + "\n", + "In the cell below, print out both the name of the column and the Pearson correlation for the column that is ***most positively correlated*** with `SalePrice` (other than `SalePrice`, which is perfectly correlated with itself).\n", + "\n", + "We'll only check the correlations with some kind of numeric data type.\n", + "\n", + "You can import additional libraries, although it is possible to do this just using pandas." + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 1460 entries, 1 to 1460\n", + "Data columns (total 80 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 MSSubClass 1460 non-null int64 \n", + " 1 MSZoning 1460 non-null object \n", + " 2 LotFrontage 1201 non-null float64\n", + " 3 LotArea 1460 non-null int64 \n", + " 4 Street 1460 non-null object \n", + " 5 Alley 91 non-null object \n", + " 6 LotShape 1460 non-null object \n", + " 7 LandContour 1460 non-null object \n", + " 8 Utilities 1460 non-null object \n", + " 9 LotConfig 1460 non-null object \n", + " 10 LandSlope 1460 non-null object \n", + " 11 Neighborhood 1460 non-null object \n", + " 12 Condition1 1460 non-null object \n", + " 13 Condition2 1460 non-null object \n", + " 14 BldgType 1460 non-null object \n", + " 15 HouseStyle 1460 non-null object \n", + " 16 OverallQual 1460 non-null int64 \n", + " 17 OverallCond 1460 non-null int64 \n", + " 18 YearBuilt 1460 non-null int64 \n", + " 19 YearRemodAdd 1460 non-null int64 \n", + " 20 RoofStyle 1460 non-null object \n", + " 21 RoofMatl 1460 non-null object \n", + " 22 Exterior1st 1460 non-null object \n", + " 23 Exterior2nd 1460 non-null object \n", + " 24 MasVnrType 1452 non-null object \n", + " 25 MasVnrArea 1452 non-null float64\n", + " 26 ExterQual 1460 non-null object \n", + " 27 ExterCond 1460 non-null object \n", + " 28 Foundation 1460 non-null object \n", + " 29 BsmtQual 1423 non-null object \n", + " 30 BsmtCond 1423 non-null object \n", + " 31 BsmtExposure 1422 non-null object \n", + " 32 BsmtFinType1 1423 non-null object \n", + " 33 BsmtFinSF1 1460 non-null int64 \n", + " 34 BsmtFinType2 1422 non-null object \n", + " 35 BsmtFinSF2 1460 non-null int64 \n", + " 36 BsmtUnfSF 1460 non-null int64 \n", + " 37 TotalBsmtSF 1460 non-null int64 \n", + " 38 Heating 1460 non-null object \n", + " 39 HeatingQC 1460 non-null object \n", + " 40 CentralAir 1460 non-null object \n", + " 41 Electrical 1459 non-null object \n", + " 42 1stFlrSF 1460 non-null int64 \n", + " 43 2ndFlrSF 1460 non-null int64 \n", + " 44 LowQualFinSF 1460 non-null int64 \n", + " 45 GrLivArea 1460 non-null int64 \n", + " 46 BsmtFullBath 1460 non-null int64 \n", + " 47 BsmtHalfBath 1460 non-null int64 \n", + " 48 FullBath 1460 non-null int64 \n", + " 49 HalfBath 1460 non-null int64 \n", + " 50 BedroomAbvGr 1460 non-null int64 \n", + " 51 KitchenAbvGr 1460 non-null int64 \n", + " 52 KitchenQual 1460 non-null object \n", + " 53 TotRmsAbvGrd 1460 non-null int64 \n", + " 54 Functional 1460 non-null object \n", + " 55 Fireplaces 1460 non-null int64 \n", + " 56 FireplaceQu 770 non-null object \n", + " 57 GarageType 1379 non-null object \n", + " 58 GarageYrBlt 1379 non-null float64\n", + " 59 GarageFinish 1379 non-null object \n", + " 60 GarageCars 1460 non-null int64 \n", + " 61 GarageArea 1460 non-null int64 \n", + " 62 GarageQual 1379 non-null object \n", + " 63 GarageCond 1379 non-null object \n", + " 64 PavedDrive 1460 non-null object \n", + " 65 WoodDeckSF 1460 non-null int64 \n", + " 66 OpenPorchSF 1460 non-null int64 \n", + " 67 EnclosedPorch 1460 non-null int64 \n", + " 68 3SsnPorch 1460 non-null int64 \n", + " 69 ScreenPorch 1460 non-null int64 \n", + " 70 PoolArea 1460 non-null int64 \n", + " 71 PoolQC 7 non-null object \n", + " 72 Fence 281 non-null object \n", + " 73 MiscFeature 54 non-null object \n", + " 74 MiscVal 1460 non-null int64 \n", + " 75 MoSold 1460 non-null int64 \n", + " 76 YrSold 1460 non-null int64 \n", + " 77 SaleType 1460 non-null object \n", + " 78 SaleCondition 1460 non-null object \n", + " 79 SalePrice 1460 non-null int64 \n", + "dtypes: float64(3), int64(34), object(43)\n", + "memory usage: 923.9+ KB\n" + ] + } + ], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "MSSubClass -0.084284\n", + "LotFrontage 0.351799\n", + "LotArea 0.263843\n", + "OverallQual 0.790982\n", + "OverallCond -0.077856\n", + "YearBuilt 0.522897\n", + "KitchenAbvGr -0.135907\n", + "dtype: float64" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[['MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','KitchenAbvGr']].corrwith(df['SalePrice'])\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, find the ***most negatively correlated*** column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"\n", + "KitchenAbvGr has the most negative correlation with Sales Price\n", + "\"\"\"\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you have your answer, edit the code below so that it produces a box plot of the relevant columns." + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Replace None with appropriate code\n", + "\n", + "import seaborn as sns\n", + "\n", + "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))\n", + "\n", + "# Plot distribution of column with highest correlation\n", + "sns.boxplot(\n", + " x=df['OverallQual'],\n", + " y=df[\"SalePrice\"],\n", + " ax=ax1\n", + ")\n", + "# Plot distribution of column with most negative correlation\n", + "sns.boxplot(\n", + " x=df['KitchenAbvGr'],\n", + " y=df[\"SalePrice\"],\n", + " ax=ax2\n", + ")\n", + "\n", + "# Customize labels\n", + "ax1.set_title(None)\n", + "ax1.set_xlabel(None)\n", + "ax1.set_ylabel(\"Sale Price\")\n", + "ax2.set_title(None)\n", + "ax2.set_xlabel(None)\n", + "ax2.set_ylabel(\"Sale Price\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret the results below. Consult `data/data_description.txt` as needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "There is a clear correlation between the Quality of the home and the Sales Price.\n", + "The 2nd Q of kitchen above ground with 1 kitchen has the most amount of clustering.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Engineer and Explore a New Feature\n", + "\n", + "Here the code is written for you, all you need to do is interpret it.\n", + "\n", + "We note that the data spans across several years of sales:" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2006 314\n", + "2007 329\n", + "2008 304\n", + "2009 338\n", + "2010 175\n", + "Name: YrSold, dtype: int64" + ] + }, + "execution_count": 101, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Run this cell without changes\n", + "df[\"YrSold\"].value_counts().sort_index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the `YrBuilt` and `YrSold` columns, but represents a truly distinct feature." + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Make a new column, Age\n", + "df[\"Age\"] = df[\"YrSold\"] - df[\"YearBuilt\"]\n", + "\n", + "# Set up plot\n", + "fig, ax = plt.subplots(figsize=(15,5))\n", + "\n", + "# Plot Age vs. SalePrice\n", + "ax.scatter(df[\"Age\"], df[\"SalePrice\"], alpha=0.3, color=\"green\")\n", + "ax.set_title(\"Home Age vs. Sale Price\")\n", + "ax.set_xlabel(\"Age of Home at Time of Sale\")\n", + "ax.set_ylabel(\"Sale Price\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret this plot below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "There is high demand for new homes and are willing to pay top dollar for them. there is also a significant amount of homes\n", + "selling around the ages of 25-60\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "Congratulations, you've completed an exploratory data analysis of a popular dataset! You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (learn-env)", + "language": "python", + "name": "learn-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}