From aff7f691b67da619706c80a68f0bebcddd63aeb7 Mon Sep 17 00:00:00 2001 From: Myra Minayo kadenge Date: Mon, 18 Mar 2024 19:40:35 +0300 Subject: [PATCH] finished lab --- README.md | 1 + index.ipynb | 1463 ++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 1463 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 405c0c2..fe499c9 100644 --- a/README.md +++ b/README.md @@ -416,3 +416,4 @@ None ## Summary Congratulations, you've completed an exploratory data analysis of a popular dataset. You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features. +# git_practice diff --git a/index.ipynb b/index.ipynb index e562147..89f55ba 100644 --- a/index.ipynb +++ b/index.ipynb @@ -1 +1,1462 @@ -{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# EDA with Pandas - Cumulative Lab\n", "\n", "## Introduction\n", "\n", "In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.\n", "\n", "## Objectives\n", "\n", "You will be able to:\n", "\n", "* Practice loading data with pandas\n", "* Practice calculating measures of centrality and dispersion with pandas\n", "* Practice creating subsets of data with pandas\n", "* Practice using data visualizations to explore data, and interpreting those visualizations\n", "* Perform a full exploratory data analysis process to gain insight about a dataset "]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Your Task: Explore the Ames Housing Dataset with Pandas\n", "\n", "![aerial photo of a neighborhood](images/neighborhood_aerial.jpg)\n", "\n", "Photo by Matt Donders on Unsplash\n", "\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Understanding\n", "\n", "Each record (row) in this dataset represents a home that was sold in Ames, IA.\n", "\n", "Each feature (column) in this dataset is some attribute of that home sale. You can view the file `data/data_description.txt` in this repository for a full explanation of all variables in this dataset \u2014 80 columns in total.\n", "\n", "We are going to focus on the following features:\n", "\n", "**SalePrice**: `Sale price of the house in dollars`\n", "\n", "**TotRmsAbvGrd**: `Total rooms above grade (does not include bathrooms)`\n", "\n", "**OverallCond**: `Rates the overall condition of the house`\n", "```\n", " 10\tVery Excellent\n", " 9\t Excellent\n", " 8\t Very Good\n", " 7\t Good\n", " 6\t Above Average\t\n", " 5\t Average\n", " 4\t Below Average\t\n", " 3\t Fair\n", " 2\t Poor\n", " 1\t Very Poor\n", "```\n", "\n", "**YrSold**: `Year Sold (YYYY)`\n", "\n", "**YearBuilt**: `Original construction date`\n", "\n", "**LandSlope**: `Slope of property`\n", "```\n", " Gtl\tGentle slope\n", " Mod\tModerate Slope\t\n", " Sev\tSevere Slope\n", "```"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Requirements\n", "\n", "In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.\n", "\n", "#### 1. Load the Dataset with Pandas\n", "\n", "Import pandas with the standard alias `pd` and load the data into a dataframe with the standard name `df`.\n", "\n", "#### 2. Explore Data Distributions\n", "\n", "Produce summary statistics, visualizations, and interpretive text describing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", "\n", "#### 3. Explore Differences between Subsets\n", "\n", "Separate the data into subsets based on `OverallCond`, then demonstrate how this split impacts the distribution of `SalePrice`.\n", "\n", "#### 4. Explore Correlations\n", "\n", "Find the features that have the strongest positive and negative correlations with `SalePrice`, and produce plots representing these relationships.\n", "\n", "#### 5. Engineer and Explore a New Feature\n", "\n", "Create a new feature `Age`, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 1. Load the Dataset with Pandas\n", "\n", "In the cell below, import:\n", "* `pandas` with the standard alias `pd`\n", "* `matplotlib.pyplot` with the standard alias `plt`\n", "\n", "And set `%matplotlib inline` so the graphs will display immediately below the cell that creates them."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, use pandas to open the file located at `data/ames.csv` ([documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Specify the argument `index_col=0` in order to avoid creating an extra `Id` column. Name the resulting dataframe `df`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code checks that you loaded the data correctly:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that df is a dataframe\n", "assert type(df) == pd.DataFrame\n", "\n", "# Check that there are the correct number of rows\n", "assert df.shape[0] == 1460\n", "\n", "# Check that there are the correct number of columns\n", "# (if this crashes, make sure you specified `index_col=0`)\n", "assert df.shape[1] == 80"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Inspect the contents of the dataframe:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df.info()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 2. Explore Data Distributions\n", "\n", "Write code to produce histograms showing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", "\n", "Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for [plotting histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html), [customizing axes](https://matplotlib.org/stable/api/axes_api.html#axis-labels-title-and-legend), and [plotting vertical lines](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axvline.html#matplotlib.axes.Axes.axvline) as needed."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Sale Price\n", "\n", "In the cell below, produce a histogram for `SalePrice`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Total Rooms Above Grade\n", "\n", "In the cell below, produce a histogram for `TotRmsAbvGrd`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Overall Condition\n", "\n", "In the cell below, produce a histogram for `OverallCond`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, print out the mean, median, and standard deviation:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret the above information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 3. Explore Differences between Subsets\n", "\n", "As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.\n", "\n", "One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use `SalePrice` as that other variable.\n", "\n", "In the cell below, create three variables, each of which represents a record-wise subset of `df` (meaning, it has the same columns as `df`, but only some of the rows).\n", "\n", "* `below_average_condition`: home sales where the overall condition was less than 5\n", "* `average_condition`: home sales where the overall condition was exactly 5\n", "* `above_average_condition`: home sales where the overall condition was greater than 5"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "below_average_condition = None\n", "average_condition = None\n", "above_average_condition = None"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code checks that you created the subsets correctly:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that all of them still have 80 columns\n", "assert below_average_condition.shape[1] == 80\n", "assert average_condition.shape[1] == 80\n", "assert above_average_condition.shape[1] == 80\n", "\n", "# Check the numbers of rows of each subset\n", "assert below_average_condition.shape[0] == 88\n", "assert average_condition.shape[0] == 821\n", "assert above_average_condition.shape[0] == 551"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The following code will produce a plot of the distributions of sale price for each of these subsets:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Set up plot\n", "fig, ax = plt.subplots(figsize=(15,5))\n", "\n", "# Create custom bins so all are on the same scale\n", "bins = range(df[\"SalePrice\"].min(), df[\"SalePrice\"].max(), int(df[\"SalePrice\"].median()) // 20)\n", "\n", "# Plot three histograms, with reduced opacity (alpha) so we\n", "# can see them overlapping\n", "ax.hist(\n", " x=above_average_condition[\"SalePrice\"],\n", " label=\"above average condition\",\n", " bins=bins,\n", " color=\"cyan\",\n", " alpha=0.5\n", ")\n", "ax.hist(\n", " x=average_condition[\"SalePrice\"],\n", " label=\"average condition\",\n", " bins=bins,\n", " color=\"gray\",\n", " alpha=0.3\n", ")\n", "ax.hist(\n", " x=below_average_condition[\"SalePrice\"],\n", " label=\"below average condition\",\n", " bins=bins,\n", " color=\"yellow\",\n", " alpha=0.5\n", ")\n", "\n", "# Customize labels\n", "ax.set_title(\"Distributions of Sale Price Grouped by Condition\")\n", "ax.set_xlabel(\"Sale Price\")\n", "ax.set_ylabel(\"Number of Houses\")\n", "ax.legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 4. Explore Correlations\n", "\n", "To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full `df`, rather than the subsets.\n", "\n", "In the cell below, print out both the name of the column and the Pearson correlation for the column that is ***most positively correlated*** with `SalePrice` (other than `SalePrice`, which is perfectly correlated with itself).\n", "\n", "We'll only check the correlations with some kind of numeric data type.\n", "\n", "You can import additional libraries, although it is possible to do this just using pandas."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, find the ***most negatively correlated*** column:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Your code here"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Once you have your answer, edit the code below so that it produces a box plot of the relevant columns."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "import seaborn as sns\n", "\n", "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))\n", "\n", "# Plot distribution of column with highest correlation\n", "sns.boxplot(\n", " x=None,\n", " y=df[\"SalePrice\"],\n", " ax=ax1\n", ")\n", "# Plot distribution of column with most negative correlation\n", "sns.boxplot(\n", " x=None,\n", " y=df[\"SalePrice\"],\n", " ax=ax2\n", ")\n", "\n", "# Customize labels\n", "ax1.set_title(None)\n", "ax1.set_xlabel(None)\n", "ax1.set_ylabel(\"Sale Price\")\n", "ax2.set_title(None)\n", "ax2.set_xlabel(None)\n", "ax2.set_ylabel(\"Sale Price\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret the results below. Consult `data/data_description.txt` as needed."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 5. Engineer and Explore a New Feature\n", "\n", "Here the code is written for you, all you need to do is interpret it.\n", "\n", "We note that the data spans across several years of sales:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "df[\"YrSold\"].value_counts().sort_index()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the `YrBuilt` and `YrSold` columns, but represents a truly distinct feature."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Make a new column, Age\n", "df[\"Age\"] = df[\"YrSold\"] - df[\"YearBuilt\"]\n", "\n", "# Set up plot\n", "fig, ax = plt.subplots(figsize=(15,5))\n", "\n", "# Plot Age vs. SalePrice\n", "ax.scatter(df[\"Age\"], df[\"SalePrice\"], alpha=0.3, color=\"green\")\n", "ax.set_title(\"Home Age vs. Sale Price\")\n", "ax.set_xlabel(\"Age of Home at Time of Sale\")\n", "ax.set_ylabel(\"Sale Price\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Interpret this plot below:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Summary\n", "\n", "Congratulations, you've completed an exploratory data analysis of a popular dataset. You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features."]}], "metadata": {"kernelspec": {"display_name": "python (learn-env)", "language": "python", "name": "learn-env"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5"}}, "nbformat": 4, "nbformat_minor": 2} \ No newline at end of file +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# EDA with Pandas - Cumulative Lab\n", + "\n", + "## Introduction\n", + "\n", + "In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.\n", + "\n", + "## Objectives\n", + "\n", + "You will be able to:\n", + "\n", + "* Practice loading data with pandas\n", + "* Practice calculating measures of centrality and dispersion with pandas\n", + "* Practice creating subsets of data with pandas\n", + "* Practice using data visualizations to explore data, and interpreting those visualizations\n", + "* Perform a full exploratory data analysis process to gain insight about a dataset " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Your Task: Explore the Ames Housing Dataset with Pandas\n", + "\n", + "![aerial photo of a neighborhood](images/neighborhood_aerial.jpg)\n", + "\n", + "Photo by Matt Donders on Unsplash\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Understanding\n", + "\n", + "Each record (row) in this dataset represents a home that was sold in Ames, IA.\n", + "\n", + "Each feature (column) in this dataset is some attribute of that home sale. You can view the file `data/data_description.txt` in this repository for a full explanation of all variables in this dataset — 80 columns in total.\n", + "\n", + "We are going to focus on the following features:\n", + "\n", + "**SalePrice**: `Sale price of the house in dollars`\n", + "\n", + "**TotRmsAbvGrd**: `Total rooms above grade (does not include bathrooms)`\n", + "\n", + "**OverallCond**: `Rates the overall condition of the house`\n", + "```\n", + " 10\tVery Excellent\n", + " 9\t Excellent\n", + " 8\t Very Good\n", + " 7\t Good\n", + " 6\t Above Average\t\n", + " 5\t Average\n", + " 4\t Below Average\t\n", + " 3\t Fair\n", + " 2\t Poor\n", + " 1\t Very Poor\n", + "```\n", + "\n", + "**YrSold**: `Year Sold (YYYY)`\n", + "\n", + "**YearBuilt**: `Original construction date`\n", + "\n", + "**LandSlope**: `Slope of property`\n", + "```\n", + " Gtl\tGentle slope\n", + " Mod\tModerate Slope\t\n", + " Sev\tSevere Slope\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Requirements\n", + "\n", + "In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.\n", + "\n", + "#### 1. Load the Dataset with Pandas\n", + "\n", + "Import pandas with the standard alias `pd` and load the data into a dataframe with the standard name `df`.\n", + "\n", + "#### 2. Explore Data Distributions\n", + "\n", + "Produce summary statistics, visualizations, and interpretive text describing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", + "\n", + "#### 3. Explore Differences between Subsets\n", + "\n", + "Separate the data into subsets based on `OverallCond`, then demonstrate how this split impacts the distribution of `SalePrice`.\n", + "\n", + "#### 4. Explore Correlations\n", + "\n", + "Find the features that have the strongest positive and negative correlations with `SalePrice`, and produce plots representing these relationships.\n", + "\n", + "#### 5. Engineer and Explore a New Feature\n", + "\n", + "Create a new feature `Age`, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Load the Dataset with Pandas\n", + "\n", + "In the cell below, import:\n", + "* `pandas` with the standard alias `pd`\n", + "* `matplotlib.pyplot` with the standard alias `plt`\n", + "\n", + "And set `%matplotlib inline` so the graphs will display immediately below the cell that creates them." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, use pandas to open the file located at `data/ames.csv` ([documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Specify the argument `index_col=0` in order to avoid creating an extra `Id` column. Name the resulting dataframe `df`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "df=pd.read_csv('data/ames.csv', index_col=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code checks that you loaded the data correctly:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Check that df is a dataframe\n", + "assert type(df) == pd.DataFrame\n", + "\n", + "# Check that there are the correct number of rows\n", + "assert df.shape[0] == 1460\n", + "\n", + "# Check that there are the correct number of columns\n", + "# (if this crashes, make sure you specified `index_col=0`)\n", + "assert df.shape[1] == 80" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the contents of the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
Id
160RL65.08450PaveNaNRegLvlAllPubInside...0NaNNaNNaN022008WDNormal208500
220RL80.09600PaveNaNRegLvlAllPubFR2...0NaNNaNNaN052007WDNormal181500
360RL68.011250PaveNaNIR1LvlAllPubInside...0NaNNaNNaN092008WDNormal223500
470RL60.09550PaveNaNIR1LvlAllPubCorner...0NaNNaNNaN022006WDAbnorml140000
560RL84.014260PaveNaNIR1LvlAllPubFR2...0NaNNaNNaN0122008WDNormal250000
..................................................................
145660RL62.07917PaveNaNRegLvlAllPubInside...0NaNNaNNaN082007WDNormal175000
145720RL85.013175PaveNaNRegLvlAllPubInside...0NaNMnPrvNaN022010WDNormal210000
145870RL66.09042PaveNaNRegLvlAllPubInside...0NaNGdPrvShed250052010WDNormal266500
145920RL68.09717PaveNaNRegLvlAllPubInside...0NaNNaNNaN042010WDNormal142125
146020RL75.09937PaveNaNRegLvlAllPubInside...0NaNNaNNaN062008WDNormal147500
\n", + "

1460 rows × 80 columns

\n", + "
" + ], + "text/plain": [ + " MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "Id \n", + "1 60 RL 65.0 8450 Pave NaN Reg \n", + "2 20 RL 80.0 9600 Pave NaN Reg \n", + "3 60 RL 68.0 11250 Pave NaN IR1 \n", + "4 70 RL 60.0 9550 Pave NaN IR1 \n", + "5 60 RL 84.0 14260 Pave NaN IR1 \n", + "... ... ... ... ... ... ... ... \n", + "1456 60 RL 62.0 7917 Pave NaN Reg \n", + "1457 20 RL 85.0 13175 Pave NaN Reg \n", + "1458 70 RL 66.0 9042 Pave NaN Reg \n", + "1459 20 RL 68.0 9717 Pave NaN Reg \n", + "1460 20 RL 75.0 9937 Pave NaN Reg \n", + "\n", + " LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature \\\n", + "Id ... \n", + "1 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "2 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", + "3 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "4 Lvl AllPub Corner ... 0 NaN NaN NaN \n", + "5 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", + "... ... ... ... ... ... ... ... ... \n", + "1456 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "1457 Lvl AllPub Inside ... 0 NaN MnPrv NaN \n", + "1458 Lvl AllPub Inside ... 0 NaN GdPrv Shed \n", + "1459 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "1460 Lvl AllPub Inside ... 0 NaN NaN NaN \n", + "\n", + " MiscVal MoSold YrSold SaleType SaleCondition SalePrice \n", + "Id \n", + "1 0 2 2008 WD Normal 208500 \n", + "2 0 5 2007 WD Normal 181500 \n", + "3 0 9 2008 WD Normal 223500 \n", + "4 0 2 2006 WD Abnorml 140000 \n", + "5 0 12 2008 WD Normal 250000 \n", + "... ... ... ... ... ... ... \n", + "1456 0 8 2007 WD Normal 175000 \n", + "1457 0 2 2010 WD Normal 210000 \n", + "1458 2500 5 2010 WD Normal 266500 \n", + "1459 0 4 2010 WD Normal 142125 \n", + "1460 0 6 2008 WD Normal 147500 \n", + "\n", + "[1460 rows x 80 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Run this cell without changes\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 1460 entries, 1 to 1460\n", + "Data columns (total 80 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 MSSubClass 1460 non-null int64 \n", + " 1 MSZoning 1460 non-null object \n", + " 2 LotFrontage 1201 non-null float64\n", + " 3 LotArea 1460 non-null int64 \n", + " 4 Street 1460 non-null object \n", + " 5 Alley 91 non-null object \n", + " 6 LotShape 1460 non-null object \n", + " 7 LandContour 1460 non-null object \n", + " 8 Utilities 1460 non-null object \n", + " 9 LotConfig 1460 non-null object \n", + " 10 LandSlope 1460 non-null object \n", + " 11 Neighborhood 1460 non-null object \n", + " 12 Condition1 1460 non-null object \n", + " 13 Condition2 1460 non-null object \n", + " 14 BldgType 1460 non-null object \n", + " 15 HouseStyle 1460 non-null object \n", + " 16 OverallQual 1460 non-null int64 \n", + " 17 OverallCond 1460 non-null int64 \n", + " 18 YearBuilt 1460 non-null int64 \n", + " 19 YearRemodAdd 1460 non-null int64 \n", + " 20 RoofStyle 1460 non-null object \n", + " 21 RoofMatl 1460 non-null object \n", + " 22 Exterior1st 1460 non-null object \n", + " 23 Exterior2nd 1460 non-null object \n", + " 24 MasVnrType 1452 non-null object \n", + " 25 MasVnrArea 1452 non-null float64\n", + " 26 ExterQual 1460 non-null object \n", + " 27 ExterCond 1460 non-null object \n", + " 28 Foundation 1460 non-null object \n", + " 29 BsmtQual 1423 non-null object \n", + " 30 BsmtCond 1423 non-null object \n", + " 31 BsmtExposure 1422 non-null object \n", + " 32 BsmtFinType1 1423 non-null object \n", + " 33 BsmtFinSF1 1460 non-null int64 \n", + " 34 BsmtFinType2 1422 non-null object \n", + " 35 BsmtFinSF2 1460 non-null int64 \n", + " 36 BsmtUnfSF 1460 non-null int64 \n", + " 37 TotalBsmtSF 1460 non-null int64 \n", + " 38 Heating 1460 non-null object \n", + " 39 HeatingQC 1460 non-null object \n", + " 40 CentralAir 1460 non-null object \n", + " 41 Electrical 1459 non-null object \n", + " 42 1stFlrSF 1460 non-null int64 \n", + " 43 2ndFlrSF 1460 non-null int64 \n", + " 44 LowQualFinSF 1460 non-null int64 \n", + " 45 GrLivArea 1460 non-null int64 \n", + " 46 BsmtFullBath 1460 non-null int64 \n", + " 47 BsmtHalfBath 1460 non-null int64 \n", + " 48 FullBath 1460 non-null int64 \n", + " 49 HalfBath 1460 non-null int64 \n", + " 50 BedroomAbvGr 1460 non-null int64 \n", + " 51 KitchenAbvGr 1460 non-null int64 \n", + " 52 KitchenQual 1460 non-null object \n", + " 53 TotRmsAbvGrd 1460 non-null int64 \n", + " 54 Functional 1460 non-null object \n", + " 55 Fireplaces 1460 non-null int64 \n", + " 56 FireplaceQu 770 non-null object \n", + " 57 GarageType 1379 non-null object \n", + " 58 GarageYrBlt 1379 non-null float64\n", + " 59 GarageFinish 1379 non-null object \n", + " 60 GarageCars 1460 non-null int64 \n", + " 61 GarageArea 1460 non-null int64 \n", + " 62 GarageQual 1379 non-null object \n", + " 63 GarageCond 1379 non-null object \n", + " 64 PavedDrive 1460 non-null object \n", + " 65 WoodDeckSF 1460 non-null int64 \n", + " 66 OpenPorchSF 1460 non-null int64 \n", + " 67 EnclosedPorch 1460 non-null int64 \n", + " 68 3SsnPorch 1460 non-null int64 \n", + " 69 ScreenPorch 1460 non-null int64 \n", + " 70 PoolArea 1460 non-null int64 \n", + " 71 PoolQC 7 non-null object \n", + " 72 Fence 281 non-null object \n", + " 73 MiscFeature 54 non-null object \n", + " 74 MiscVal 1460 non-null int64 \n", + " 75 MoSold 1460 non-null int64 \n", + " 76 YrSold 1460 non-null int64 \n", + " 77 SaleType 1460 non-null object \n", + " 78 SaleCondition 1460 non-null object \n", + " 79 SalePrice 1460 non-null int64 \n", + "dtypes: float64(3), int64(34), object(43)\n", + "memory usage: 923.9+ KB\n" + ] + } + ], + "source": [ + "# Run this cell without changes\n", + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Explore Data Distributions\n", + "\n", + "Write code to produce histograms showing the distributions of `SalePrice`, `TotRmsAbvGrd`, and `OverallCond`.\n", + "\n", + "Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for [plotting histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html), [customizing axes](https://matplotlib.org/stable/api/axes_api.html#axis-labels-title-and-legend), and [plotting vertical lines](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axvline.html#matplotlib.axes.Axes.axvline) as needed." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sale Price\n", + "\n", + "In the cell below, produce a histogram for `SalePrice`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([ 16., 74., 184., 345., 252., 199., 125., 85., 61., 38., 27.,\n", + " 15., 14., 8., 6., 2., 1., 1., 2., 1., 2., 0.,\n", + " 0., 0., 2.]),\n", + " array([ 34900., 63704., 92508., 121312., 150116., 178920., 207724.,\n", + " 236528., 265332., 294136., 322940., 351744., 380548., 409352.,\n", + " 438156., 466960., 495764., 524568., 553372., 582176., 610980.,\n", + " 639784., 668588., 697392., 726196., 755000.]),\n", + " )" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAATPklEQVR4nO3dbYwd133f8e8vpEQ/NpYqSmC4NFc2mKCU0VDsgrGhIlDtNJKFILSBuqBeuESrgAYqATYaoBBjtLFfEHLTOE6L1m7oWAnbOpbZ2I4EI2nCEDYS90H0iqFlURIrxqKkFVmRsePabguqpP99cUfhJbkPd3fv5V4efT/A4M49c+bOf/nwu8MzZ4apKiRJbfmRlS5AkjR8hrskNchwl6QGGe6S1CDDXZIatHqlCwC44YYbanJycqXLkKSrymOPPfYXVbV2tm1jEe6Tk5NMT0+vdBmSdFVJ8txc2xyWkaQGGe6S1CDDXZIatGC4J3lNkkNJvpHkaJKPdu0fSfJikiPdclffPruTHE9yLMkdo/wBJEmXG+SC6lngnVX1gyTXAF9L8gfdtk9U1a/2d06yGdgB3AL8GPDHSX68qs4Ps3BJ0twWPHOvnh90b6/plvmeNrYdeKiqzlbVs8BxYNuyK5UkDWygMfckq5IcAU4DB6rq0W7TfUkeT/Jgkuu6tvXAC327z3Rtl37mriTTSabPnDmz9J9AknSZgcK9qs5X1RZgAtiW5G3Ap4C3AluAU8DHu+6Z7SNm+cy9VTVVVVNr1846B1+StESLmi1TVd8FvgrcWVUvdaH/Q+DTXBh6mQE29O02AZxcfqmSpEENMltmbZI3deuvBX4GeDrJur5u7wWe6NYfAXYkWZPkZmATcGioVY+5iY2TJBl4mdg4udIlS2rMILNl1gH7kqyi92Wwv6q+nOQ/JNlCb8jlBPABgKo6mmQ/8CRwDrj31TZT5sXnn+OBw4NfR9i91WEpScO1YLhX1ePArbO0v3+effYAe5ZXmiRpqbxDVZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBC4Z7ktckOZTkG0mOJvlo1359kgNJnuler+vbZ3eS40mOJbljlD+AJOlyg5y5nwXeWVU/CWwB7kzyduB+4GBVbQIOdu9JshnYAdwC3Al8MsmqEdQuSZrDguFePT/o3l7TLQVsB/Z17fuA93Tr24GHqupsVT0LHAe2DbNoSdL8BhpzT7IqyRHgNHCgqh4FbqqqUwDd641d9/XAC327z3Rtl37mriTTSabPnDmzjB9BknSpgcK9qs5X1RZgAtiW5G3zdM9sHzHLZ+6tqqmqmlq7du1AxUqSBrOo2TJV9V3gq/TG0l9Ksg6gez3ddZsBNvTtNgGcXG6hkqTBDTJbZm2SN3XrrwV+BngaeATY2XXbCTzcrT8C7EiyJsnNwCbg0JDrliTNY/UAfdYB+7oZLz8C7K+qLyf5b8D+JPcAzwPvA6iqo0n2A08C54B7q+r8aMqXJM1mwXCvqseBW2dp/zbwrjn22QPsWXZ1kqQl8Q5VSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYtGO5JNiT5SpKnkhxN8sGu/SNJXkxypFvu6ttnd5LjSY4luWOUP4Ak6XKrB+hzDvjFqjqc5I3AY0kOdNs+UVW/2t85yWZgB3AL8GPAHyf58ao6P8zCJUlzW/DMvapOVdXhbv37wFPA+nl22Q48VFVnq+pZ4DiwbRjFtmr1tWtIMvAysXFypUuWNOYGOXP/K0kmgVuBR4HbgPuS/ANgmt7Z/V/SC/7/3rfbDLN8GSTZBewCePOb37yU2ptx7uWzPHD4zMD9d29dO8JqJLVg4AuqSd4AfAH4UFV9D/gU8FZgC3AK+PgrXWfZvS5rqNpbVVNVNbV2rWElScM0ULgnuYZesH+2qr4IUFUvVdX5qvoh8GkuDL3MABv6dp8ATg6vZEnSQgaZLRPgM8BTVfVrfe3r+rq9F3iiW38E2JFkTZKbgU3AoeGVLElayCBj7rcB7we+meRI1/ZLwN1JttAbcjkBfACgqo4m2Q88SW+mzb3OlJGkK2vBcK+qrzH7OPrvz7PPHmDPMuqSJC2Dd6hKUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNWjBcE+yIclXkjyV5GiSD3bt1yc5kOSZ7vW6vn12Jzme5FiSO0b5A0iSLjfImfs54Ber6m8AbwfuTbIZuB84WFWbgIPde7ptO4BbgDuBTyZZNYriJUmzWzDcq+pUVR3u1r8PPAWsB7YD+7pu+4D3dOvbgYeq6mxVPQscB7YNuW5J0jwWNeaeZBK4FXgUuKmqTkHvCwC4seu2Hnihb7eZru3Sz9qVZDrJ9JkzZ5ZQ+qvX6mvXkGRRy8TGyZUuW9IVtHrQjkneAHwB+FBVfS/JnF1naavLGqr2AnsBpqamLtuuuZ17+SwPHF7cF+LurWtHVI2kcTTQmXuSa+gF+2er6otd80tJ1nXb1wGnu/YZYEPf7hPAyeGUK0kaxCCzZQJ8Bniqqn6tb9MjwM5ufSfwcF/7jiRrktwMbAIODa9kSdJCBhmWuQ14P/DNJEe6tl8CPgbsT3IP8DzwPoCqOppkP/AkvZk291bV+WEXLkma24LhXlVfY/ZxdIB3zbHPHmDPMuqSJC2Dd6hKUoMMd0lqkOEuSQ0y3AcwsXFyUTcMSdJKG/gmplezF59/blE3DXnDkKSV5pm7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBC4Z7kgeTnE7yRF/bR5K8mORIt9zVt213kuNJjiW5Y1SFS5LmNsiZ+28Dd87S/omq2tItvw+QZDOwA7il2+eTSVYNq1hJ0mAWDPeq+hPgOwN+3nbgoao6W1XPAseBbcuoT5K0BMsZc78vyePdsM11Xdt64IW+PjNd22WS7EoynWT6zJnB/39SSdLClhrunwLeCmwBTgEf79ozS9+a7QOqam9VTVXV1Nq1/ofSkjRMSwr3qnqpqs5X1Q+BT3Nh6GUG2NDXdQI4ubwSJUmLtaRwT7Ku7+17gVdm0jwC7EiyJsnNwCbg0PJKlCQt1uqFOiT5HHA7cEOSGeCXgduTbKE35HIC+ABAVR1Nsh94EjgH3FtV50dSuSRpTguGe1XdPUvzZ+bpvwfYs5yiJEnL4x2qktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMP9VWL1tWtIMvAysXFypUuWtAwLPn5AbTj38lkeODz4c/N3b/UxzNLVzDN3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ1aMNyTPJjkdJIn+tquT3IgyTPd63V923YnOZ7kWJI7RlW4JGlug5y5/zZw5yVt9wMHq2oTcLB7T5LNwA7glm6fTyZZNbRqJUkDWTDcq+pPgO9c0rwd2Net7wPe09f+UFWdrapngePAtuGUKkka1FLH3G+qqlMA3euNXft64IW+fjNd22WS7EoynWT6zJnBn1YoSVrYsC+oZpa2mq1jVe2tqqmqmlq71sfLStIwLTXcX0qyDqB7Pd21zwAb+vpNACeXXp4kaSmWGu6PADu79Z3Aw33tO5KsSXIzsAk4tLwSJUmLteD/xJTkc8DtwA1JZoBfBj4G7E9yD/A88D6AqjqaZD/wJHAOuLeqzo+odknSHBYM96q6e45N75qj/x5gz3KKkiQtj3eoSlKDDHdJapDhLkkNMtw1q9XXriHJwMvExsmVLllSnwUvqOrV6dzLZ3ng8OB3Du/e6o1o0jjxzF2SGmS4S1KDDHdJapDhLkkNelWG+8TGyUXNBJGkq82rcrbMi88/50wQSU17VZ65S1LrDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7hoKnyIpjZdX5Tx3DZ9PkZTGi2fuktQgw12SGrSsYZkkJ4DvA+eBc1U1leR64PPAJHAC+PtV9ZfLK1OStBjDOHP/O1W1paqmuvf3AwerahNwsHsvSbqCRjEssx3Y163vA94zgmNIkuax3HAv4I+SPJZkV9d2U1WdAuheb5xtxyS7kkwnmT5zZvBZFpKkhS13KuRtVXUyyY3AgSRPD7pjVe0F9gJMTU3VMuuQJPVZ1pl7VZ3sXk8DXwK2AS8lWQfQvZ5ebpGSpMVZcrgneX2SN76yDvws8ATwCLCz67YTeHi5RUqSFmc5wzI3AV/q/hu61cDvVNV/TvJ1YH+Se4Dngfctv0xJ0mIsOdyr6lvAT87S/m3gXcspSpK0PN6hKkkNMty1InyKpDRaPhVSK8KnSEqj5Zm7JDXIcJekBhnuktQgw12SGmS4S1KDDHddFZw6KS2OUyF1VXDqpLQ4nrlLUoMMd0lqkOEuSQ0y3CWpQU2E+8TGyUXNpJCk1jUxW+bF559zJoUu8srUycW49rWv4+X/+38G7r/+zRuZee7EIiuTrowmwl261GKnTkLvS9+TBLWiiWEZSdLFDHdJapDhLi2Rj0TQOHPMXVqixY7r/7O3TyzqIq8XbLUcIwv3JHcC/wpYBfxmVX1sVMeSrgY+H0dX0kiGZZKsAv4t8G5gM3B3ks2jOJYk6XKjGnPfBhyvqm9V1cvAQ8D2ER1LatJix/TXvO71i+q/lH3Grf84XsdY7E2Vo/oZUlXD/9Dk7wF3VtUvdO/fD/xUVd3X12cXsKt7+xPAt4G/GHoxw3cD1jlsV0ut1jlcV0udML61bqyqWcfvRjXmPttVo4u+RapqL7D3r3ZIpqtqakT1DI11Dt/VUqt1DtfVUidcXbW+YlTDMjPAhr73E8DJER1LknSJUYX714FNSW5Oci2wA3hkRMeSJF1iJMMyVXUuyX3AH9KbCvlgVR1dYLe9C2wfF9Y5fFdLrdY5XFdLnXB11QqM6IKqJGll+fgBSWqQ4S5JLaqqFV2AO4FjwHHg/hEe50HgNPBEX9v1wAHgme71ur5tu7uajgF39LX/LeCb3bZ/zYWhrTXA57v2R4HJvn12dsd4Bti5QJ0bgK8ATwFHgQ+OY63Aa4BDwDe6Oj86jnX29V8F/Bnw5TGv80R3jCPA9LjWCrwJ+F3gaXp/Vt8xbnXSu3/mSN/yPeBD41bnqJYrerA5/sL9OfAW4Fp6QbF5RMf6aWArF4f7r9B9oQD3A/+iW9/c1bIGuLmrcVW37VD3BznAHwDv7tr/MfDvuvUdwOf7/mJ+q3u9rlu/bp461wFbu/U3Av+jq2esau0+8w3d+jXdH+y3j1udffX+E+B3uBDu41rnCeCGS9rGrlZgH/AL3fq19MJ+7Oq8JGv+J7BxnOscauZdyYPN8gv+DuAP+97vBnaP8HiTXBzux4B13fo64NhsddCb9fOOrs/Tfe13A7/R36dbX03vbrb09+m2/QZw9yJqfhj4u+NcK/A64DDwU+NYJ737LA4C7+RCuI9dnV2fE1we7mNVK/DXgGfpzl7Htc5LavtZ4L+Me53DXFZ6zH098ELf+5mu7Uq5qapOAXSvNy5Q1/pu/dL2i/apqnPA/wL++jyftaAkk8Ct9M6Kx67WJKuSHKE33HWgqsayTuDXgX8K/LCvbRzrhN6d3H+U5LHuER3jWOtbgDPAbyX5syS/meT1Y1hnvx3A57r1ca5zaFY63Bd8TMEKmauu+epdyj5zF5C8AfgC8KGq+t58XZdw3KHUWlXnq2oLvTPjbUneNm51Jvk54HRVPTZPbRftsoRjDvP3/raq2krviar3JvnpefquVK2r6Q1xfqqqbgX+N73hjXGrs/dBvRspfx74T/P1W+Ixh/r3fphWOtxX+jEFLyVZB9C9nl6grplu/dL2i/ZJshr4UeA783zWnJJcQy/YP1tVXxznWgGq6rvAV+ldHB+3Om8Dfj7JCXpPJ31nkv84hnUCUFUnu9fTwJfoPWF13GqdAWa6f6lB78Lq1jGs8xXvBg5X1Uvd+3Gtc7iu5BjQLONgq+ldaLiZCxdUbxnh8Sa5eMz9X3LxhZVf6dZv4eILK9/iwoWVr9O7cPjKhZW7uvZ7ufjCyv5u/Xp645PXdcuzwPXz1Bjg3wO/fkn7WNUKrAXe1K2/FvhT4OfGrc5Lar6dC2PuY1cn8HrgjX3r/5XeF+Y41vqnwE906x/pahy7Ort9HgL+4bj+XRpZ3l3Jg83xC38XvRkhfw58eITH+RxwCvh/9L5V76E3NnaQ3lSlg/2/+MCHu5qO0V0Z79qngCe6bf+GC1OiXkPvn33H6V1Zf0vfPv+oaz/e/4dsjjr/Nr1/vj3OhSlcd41brcDfpDe18PHuGP+8ax+rOi+p+XYuhPvY1UlvLPsbXJhe+uExrnULMN39/v8evQAbxzpfR+9x4j/a1zZ2dY5i8fEDktSglR5zlySNgOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGvT/ATtJgRf37zJWAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Your code here\n", + "plt.hist(df['SalePrice'], bins=25, color='skyblue', edgecolor='black')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "180921.19589041095" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Your code here\n", + "df['SalePrice'].mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "79442.50288288662" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['SalePrice'].std()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "163000.0" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['SalePrice'].median()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 140000\n", + "dtype: int64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['SalePrice'].mode()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nthe sale price data is rightly skewed meaning the mean is towards the right of the data\\nthis means we have outliers in our saleprice.\\nThe large standard deviation indicates greater variability around the saleprices. Hence the saleprices vary\\nconsiderably around the mean.\\nThe average sale of the houses indicated by the mean is 180921 and since the mean is greater than the median \\nit suggests the data is positively skewed which is proven right by the histogram.\\nThe most frequently occuring saleprice inidcated by the mode is 140000 sowing a significant number of houses \\nhave this saleprice\\n'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "the sale price data is rightly skewed meaning the mean is towards the right of the data\n", + "this means we have outliers in our saleprice.\n", + "The large standard deviation indicates greater variability around the saleprices. Hence the saleprices vary\n", + "considerably around the mean.\n", + "The average sale of the houses indicated by the mean is 180921 and since the mean is greater than the median \n", + "it suggests the data is positively skewed which is proven right by the histogram.\n", + "The most frequently occuring saleprice inidcated by the mode is 140000 sowing a significant number of houses \n", + "have this saleprice\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Total Rooms Above Grade\n", + "\n", + "In the cell below, produce a histogram for `TotRmsAbvGrd`." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([ 1., 0., 17., 0., 97., 0., 275., 0., 402., 0., 329.,\n", + " 0., 187., 0., 75., 0., 47., 0., 18., 0., 11., 0.,\n", + " 0., 0., 1.]),\n", + " array([ 2. , 2.48, 2.96, 3.44, 3.92, 4.4 , 4.88, 5.36, 5.84,\n", + " 6.32, 6.8 , 7.28, 7.76, 8.24, 8.72, 9.2 , 9.68, 10.16,\n", + " 10.64, 11.12, 11.6 , 12.08, 12.56, 13.04, 13.52, 14. ]),\n", + " )" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAS30lEQVR4nO3dbYwd53ne8f9lSpVfA0vQSqG5NFc1GCeUUVPCgnUroHAtp2IVw5SBKqDRCASqgv5ANXLgIBETIHEKsBJQv6RAKxe0rYhoFKuEXyDCSFIzjA3DQCJmxciyKJoVEUnUkgy5setabgCmpO9+2FF9Qu5yz+45y8N99P8BBzPznGfm3APyXDv77LykqpAkteV1oy5AkjR8hrskNchwl6QGGe6S1CDDXZIadNWoCwC4/vrra2JiYtRlSNKK8tRTT/1NVY3N9d4VEe4TExNMTU2NugxJWlGSvDTfew7LSFKDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoL7DPcmqJH+Z5Kvd8nVJ9id5vpte29N3Z5JjSY4muWM5CpckzW8xR+73A0d6lh8ADlTVeuBAt0ySDcBW4GZgM/BwklXDKVeS1I++wj3JOPALwOd6mrcAe7r5PcBdPe2PV9XZqnoBOAZsGkq1umKNr5sgSd+v8XUToy5Zalq/tx/4XeDXgLf0tN1YVacAqupUkhu69jXAn/f0m+7a/p4k24HtAG9/+9sXV7WuOCeOv8SDh2b67r/z1jlvhyFpSBY8ck/yAeBMVT3V5zYzR9tFz/Krqt1VNVlVk2NjftElaZj6OXK/DfhgkjuB1wM/leT3gdNJVndH7auBM13/aWBtz/rjwMlhFi1JurQFj9yramdVjVfVBLN/KP3TqvolYB+wreu2DXiim98HbE1yTZKbgPXAwaFXLkma1yC3/H0I2JvkXuA4cDdAVR1Oshd4DjgH7Kiq8wNXKknq26LCvaq+AXyjm/8ecPs8/XYBuwasTZK0RF6hKkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqUD8PyH59koNJvp3kcJLf6do/nuREkqe715096+xMcizJ0SR3LOcOSJIu1s+TmM4C76uqHyW5GvhWkj/q3vt0VX2it3OSDcw+a/Vm4G3AnyT5GR+1J0mXTz8PyK6q+lG3eHX3qkussgV4vKrOVtULwDFg08CVSpL61teYe5JVSZ4GzgD7q+rJ7q37kjyT5JEk13Zta4CXe1af7tokSZdJX+FeVeeraiMwDmxK8i7gM8A7gI3AKeCTXffMtYkLG5JsTzKVZGpmZmYJpUuS5rOos2Wq6gfAN4DNVXW6C/0fA5/lJ0Mv08DantXGgZNzbGt3VU1W1eTY2NhSapckzaOfs2XGkry1m38D8H7gu0lW93T7EPBsN78P2JrkmiQ3AeuBg0OtWpJ0Sf2cLbMa2JNkFbM/DPZW1VeT/LckG5kdcnkR+AhAVR1Oshd4DjgH7PBMGUm6vBYM96p6BrhljvZ7LrHOLmDXYKVJkpbKK1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuGtFGF83QZK+X+PrJkZdsjRS/ZznLo3cieMv8eCh/m9TsfNWr3rWa5tH7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIa1M8zVF+f5GCSbyc5nOR3uvbrkuxP8nw3vbZnnZ1JjiU5muSO5dwBSdLF+jlyPwu8r6reDWwENid5D/AAcKCq1gMHumWSbAC2AjcDm4GHu+evSpIukwXDvWb9qFu8unsVsAXY07XvAe7q5rcAj1fV2ap6ATgGbBpm0ZKkS+trzD3JqiRPA2eA/VX1JHBjVZ0C6KY3dN3XAC/3rD7dtV24ze1JppJMzcz0f7c/SdLC+gr3qjpfVRuBcWBTknddonvm2sQc29xdVZNVNTk25u1ZJWmYFnW2TFX9APgGs2Ppp5OsBuimZ7pu08DantXGgZODFipJ6l8/Z8uMJXlrN/8G4P3Ad4F9wLau2zbgiW5+H7A1yTVJbgLWAweHXLck6RL6eRLTamBPd8bL64C9VfXVJH8G7E1yL3AcuBugqg4n2Qs8B5wDdlTV+eUpX5I0lwXDvaqeAW6Zo/17wO3zrLML2DVwdZKkJfEKVUlqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4v0aMr5sgSd+v8XUToy5Z0gD6uUJVDThx/CUePNT/3Td33urN3KSVzCN3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoP6eYbq2iRfT3IkyeEk93ftH09yIsnT3evOnnV2JjmW5GiSO5ZzByRJF+vnCtVzwMeq6lCStwBPJdnfvffpqvpEb+ckG4CtwM3A24A/SfIzPkdVki6fBY/cq+pUVR3q5l8BjgBrLrHKFuDxqjpbVS8Ax4BNwyhWktSfRY25J5lg9mHZT3ZN9yV5JskjSa7t2tYAL/esNs0cPwySbE8ylWRqZqb/e55IkhbWd7gneTPwJeCjVfVD4DPAO4CNwCngk692nWP1uqihandVTVbV5NiYN6mSpGHqK9yTXM1ssD9WVV8GqKrTVXW+qn4MfJafDL1MA2t7Vh8HTg6vZEnSQvo5WybA54EjVfWpnvbVPd0+BDzbze8Dtia5JslNwHrg4PBKliQtpJ+zZW4D7gG+k+Tpru03gA8n2cjskMuLwEcAqupwkr3Ac8yeabPDM2Uk6fJaMNyr6lvMPY7+h5dYZxewa4C6JEkD8ApVSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJalA/z1Bdm+TrSY4kOZzk/q79uiT7kzzfTa/tWWdnkmNJjia5Yzl3QJJ0sX6O3M8BH6uqnwPeA+xIsgF4ADhQVeuBA90y3XtbgZuBzcDDSVYtR/GSpLktGO5VdaqqDnXzrwBHgDXAFmBP120PcFc3vwV4vKrOVtULwDFg05DrliRdwqLG3JNMALcATwI3VtUpmP0BANzQdVsDvNyz2nTXduG2tieZSjI1MzOzhNIlSfPpO9yTvBn4EvDRqvrhpbrO0VYXNVTtrqrJqpocGxvrtwxJUh/6CvckVzMb7I9V1Ze75tNJVnfvrwbOdO3TwNqe1ceBk8MpV5LUj37OlgnweeBIVX2q5619wLZufhvwRE/71iTXJLkJWA8cHF7JkqSFXNVHn9uAe4DvJHm6a/sN4CFgb5J7gePA3QBVdTjJXuA5Zs+02VFV54dduCRpfguGe1V9i7nH0QFun2edXcCuAeqSJA3AK1Slzvi6CZL0/RpfNzHqkqV59TMsI70mnDj+Eg8e6v+03J23epaXrlweuUtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBvXzmL1HkpxJ8mxP28eTnEjydPe6s+e9nUmOJTma5I7lKlySNL9+jtwfBTbP0f7pqtrYvf4QIMkGYCtwc7fOw0lWDatYSVJ/Fgz3qvom8P0+t7cFeLyqzlbVC8AxYNMA9UmSlmCQMff7kjzTDdtc27WtAV7u6TPdtUmSLqOlhvtngHcAG4FTwCe79rkepF1zbSDJ9iRTSaZmZvp/tJkkaWFLCveqOl1V56vqx8Bn+cnQyzSwtqfrOHBynm3srqrJqpocG/NZlJI0TEsK9ySrexY/BLx6Js0+YGuSa5LcBKwHDg5WoiRpsa5aqEOSLwDvBa5PMg38NvDeJBuZHXJ5EfgIQFUdTrIXeA44B+yoqvPLUrkkaV4LhntVfXiO5s9fov8uYNcgRUmSBuMVqpLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktSgBcM9ySNJziR5tqftuiT7kzzfTa/teW9nkmNJjia5Y7kKlyTNr58j90eBzRe0PQAcqKr1wIFumSQbgK3Azd06DydZNbRqJUl9WTDcq+qbwPcvaN4C7Onm9wB39bQ/XlVnq+oF4BiwaTilSpL6tdQx9xur6hRAN72ha18DvNzTb7pru0iS7UmmkkzNzMwssQxJ0lyG/QfVzNFWc3Wsqt1VNVlVk2NjY0MuQ5Je25Ya7qeTrAbopme69mlgbU+/ceDk0st7bRhfN0GSvl/j6yZGXbKkK9xVS1xvH7ANeKibPtHT/gdJPgW8DVgPHBy0yNadOP4SDx7qf2hq563+piPp0hYM9yRfAN4LXJ9kGvhtZkN9b5J7gePA3QBVdTjJXuA54Bywo6rOL1PtkqR5LBjuVfXhed66fZ7+u4BdgxQlSRqMV6hKUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw126TLzNhC6npd5+QNIieZsJXU4euUtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaNNBFTEleBF4BzgPnqmoyyXXAfwcmgBeBX6yq/zVYmZKkxRjGkfs/r6qNVTXZLT8AHKiq9cCBblmSdBktx7DMFmBPN78HuGsZPkOSdAmDhnsBX0vyVJLtXduNVXUKoJveMNeKSbYnmUoyNTPT//02JEkLG/TGYbdV1ckkNwD7k3y33xWrajewG2BycrIGrEOS1GOgI/eqOtlNzwBfATYBp5OsBuimZwYtUtLCFntLYW8r3LYlH7kneRPwuqp6pZv/F8C/B/YB24CHuukTwyhU0qUt9pbC4G2FWzbIsMyNwFeSvLqdP6iqP07yF8DeJPcCx4G7By9TkrQYSw73qvor4N1ztH8PuH2QoiRJg/EKVUlqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGex8WeytVb6OqVvldWDkGfVjHa8Jib6XqbVTVKr8LK4dH7pLUIMNd0hXDYZ/hcVhG0hXDYZ/h8chdkhq0bOGeZHOSo0mOJXlguT5HknSxZQn3JKuA/wL8S2AD8OEkG5bjsyRJF1uuI/dNwLGq+quq+jvgcWDLMn2Wf4SRdMW4UvIoVTX8jSb/CthcVf+2W74H+MdVdV9Pn+3A9m7xncDRAT7yeuBvBlj/StHKfoD7ciVqZT/AfXnVuqqa86/Ky3W2TOZo+3s/RapqN7B7KB+WTFXV5DC2NUqt7Ae4L1eiVvYD3Jd+LNewzDSwtmd5HDi5TJ8lSbrAcoX7XwDrk9yU5B8AW4F9y/RZkqQLLMuwTFWdS3If8D+AVcAjVXV4OT6rM5ThnStAK/sB7suVqJX9APdlQcvyB1VJ0mh5haokNchwl6QGrdhwT7I2ydeTHElyOMn9o65pUElWJfnLJF8ddS2DSPLWJF9M8t3u3+efjLqmpUjyK93/rWeTfCHJ60ddU7+SPJLkTJJne9quS7I/yfPd9NpR1tivefblP3b/v55J8pUkbx1hiX2ba1963vvVJJXk+mF81ooNd+Ac8LGq+jngPcCOBm5xcD9wZNRFDMF/Av64qn4WeDcrcJ+SrAF+GZisqncxe2LA1tFWtSiPApsvaHsAOFBV64ED3fJK8CgX78t+4F1V9Y+A/wnsvNxFLdGjXLwvJFkL/DxwfFgftGLDvapOVdWhbv4VZgNkzWirWrok48AvAJ8bdS2DSPJTwD8DPg9QVX9XVT8YaVFLdxXwhiRXAW9kBV2rUVXfBL5/QfMWYE83vwe463LWtFRz7UtVfa2qznWLf87stTRXvHn+XQA+DfwaF1zsOYgVG+69kkwAtwBPjriUQfwus/+4Px5xHYP6h8AM8HvdENPnkrxp1EUtVlWdAD7B7JHUKeB/V9XXRlvVwG6sqlMwe3AE3DDieobl3wB/NOoilirJB4ETVfXtYW53xYd7kjcDXwI+WlU/HHU9S5HkA8CZqnpq1LUMwVXArcBnquoW4P+wcn79//+68egtwE3A24A3Jfml0ValCyX5TWaHaB8bdS1LkeSNwG8CvzXsba/ocE9yNbPB/lhVfXnU9QzgNuCDSV5k9g6a70vy+6MtacmmgemqevW3qC8yG/YrzfuBF6pqpqr+L/Bl4J+OuKZBnU6yGqCbnhlxPQNJsg34APCva+VesPMOZg8gvt19/8eBQ0l+etANr9hwTxJmx3WPVNWnRl3PIKpqZ1WNV9UEs3+0+9OqWpFHiVX118DLSd7ZNd0OPDfCkpbqOPCeJG/s/q/dzgr8w/AF9gHbuvltwBMjrGUgSTYDvw58sKr+dtT1LFVVfaeqbqiqie77Pw3c2n2PBrJiw53Zo917mD3Kfbp73TnqogTAvwMeS/IMsBH4D6MtZ/G63zy+CBwCvsPsd2XFXPKe5AvAnwHvTDKd5F7gIeDnkzzP7JkZD42yxn7Nsy//GXgLsL/77v/XkRbZp3n2ZXk+a+X+NiNJms9KPnKXJM3DcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkN+n87JkRkYXHYCwAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Your code here\n", + "plt.hist(df['TotRmsAbvGrd'], bins=25, color='skyblue', edgecolor='black')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6.517808219178082" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Your code here\n", + "df['TotRmsAbvGrd'].mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6.0" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['TotRmsAbvGrd'].median()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 6\n", + "dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['TotRmsAbvGrd'].mode()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax (, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m df['TotRmsAbvGrd'].std()b\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" + ] + } + ], + "source": [ + "df['TotRmsAbvGrd'].std()b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "This shows our data is evenly distributed around the mean\n", + "The most common number of rooms interpreted by the mode is 6 rooms\n", + "The mean of 6.5 indicates the houses on average tend to have slightly more than 6 rooms\n", + "The middle value in the dataset is 6, which means that half of the houses have 6 or fewer\n", + " rooms above ground, and the other half have more than 6 rooms. \n", + " The lower standard deviation indicates that the number of rooms are clustered closely around the mean\n", + " contributing to the normally distributed histogram.\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Overall Condition\n", + "\n", + "In the cell below, produce a histogram for `OverallCond`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "plt.hist(df['OverallCond'], bins=20, color='green', edgecolor='black')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, print out the mean, median, and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "df['OverallCond'].mean()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df['OverallCond'].median()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df['OverallCond'].mode()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df['OverallCond'].std()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the cell below, interpret the above information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "The data is clustered closely around the mean hence a fairly normally distributed histogram\n", + "In short with a number 5 this shows that the average condition of the houses is considered average.\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Explore Differences between Subsets\n", + "\n", + "As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.\n", + "\n", + "One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use `SalePrice` as that other variable.\n", + "\n", + "In the cell below, create three variables, each of which represents a record-wise subset of `df` (meaning, it has the same columns as `df`, but only some of the rows).\n", + "\n", + "* `below_average_condition`: home sales where the overall condition was less than 5\n", + "* `average_condition`: home sales where the overall condition was exactly 5\n", + "* `above_average_condition`: home sales where the overall condition was greater than 5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate code\n", + "below_average_condition = df[df['OverallCond'] < 5]\n", + "average_condition = df[df['OverallCond'] == 5]\n", + "above_average_condition = df[df['OverallCond'] > 5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code checks that you created the subsets correctly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Check that all of them still have 80 columns\n", + "assert below_average_condition.shape[1] == 80\n", + "assert average_condition.shape[1] == 80\n", + "assert above_average_condition.shape[1] == 80\n", + "\n", + "# Check the numbers of rows of each subset\n", + "assert below_average_condition.shape[0] == 88\n", + "assert average_condition.shape[0] == 821\n", + "assert above_average_condition.shape[0] == 551" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following code will produce a plot of the distributions of sale price for each of these subsets:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Set up plot\n", + "fig, ax = plt.subplots(figsize=(15,5))\n", + "\n", + "# Create custom bins so all are on the same scale\n", + "bins = range(df[\"SalePrice\"].min(), df[\"SalePrice\"].max(), int(df[\"SalePrice\"].median()) // 20)\n", + "\n", + "# Plot three histograms, with reduced opacity (alpha) so we\n", + "# can see them overlapping\n", + "ax.hist(\n", + " x=above_average_condition[\"SalePrice\"],\n", + " label=\"above average condition\",\n", + " bins=bins,\n", + " color=\"cyan\",\n", + " alpha=0.5\n", + ")\n", + "ax.hist(\n", + " x=average_condition[\"SalePrice\"],\n", + " label=\"average condition\",\n", + " bins=bins,\n", + " color=\"gray\",\n", + " alpha=0.3\n", + ")\n", + "ax.hist(\n", + " x=below_average_condition[\"SalePrice\"],\n", + " label=\"below average condition\",\n", + " bins=bins,\n", + " color=\"yellow\",\n", + " alpha=0.5\n", + ")\n", + "\n", + "# Customize labels\n", + "ax.set_title(\"Distributions of Sale Price Grouped by Condition\")\n", + "ax.set_xlabel(\"Sale Price\")\n", + "ax.set_ylabel(\"Number of Houses\")\n", + "ax.legend();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "All the categories seem to be rightly skewed showing the distribution of sale prices within each category \n", + "seems to be leaning towards higher prices.\n", + "It is possible that houses with better overall conditions tend to have higher sale prices as buyers may be \n", + "willing to pay more for homes that are in better condition, which drives up the prices for those properties.\n", + "As for the average and low conditioned houses, homeowners may invest in renovation in their homes thus expecting\n", + "higher returns on the investment when selling their homes.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Explore Correlations\n", + "\n", + "To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full `df`, rather than the subsets.\n", + "\n", + "In the cell below, print out both the name of the column and the Pearson correlation for the column that is ***most positively correlated*** with `SalePrice` (other than `SalePrice`, which is perfectly correlated with itself).\n", + "\n", + "We'll only check the correlations with some kind of numeric data type.\n", + "\n", + "You can import additional libraries, although it is possible to do this just using pandas." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "correlations = df.corr()['SalePrice'].sort_values(ascending=False)\n", + "correlations = correlations.drop('SalePrice')\n", + "print(\"Column:\", correlations.idxmax())\n", + "print(\"Correlation coefficient:\", correlations.max())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, find the ***most negatively correlated*** column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here\n", + "correlations = df.corr()['SalePrice'].sort_values(ascending=True)\n", + "correlations = correlations.drop('SalePrice')\n", + "print(\"Column:\", correlations.idxmin())\n", + "print(\"Correlation coefficient:\", correlations.min())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you have your answer, edit the code below so that it produces a box plot of the relevant columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate code\n", + "\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))\n", + "\n", + "# Plot distribution of column with highest correlation\n", + "sns.boxplot(\n", + " x=df[\"OverallQual\"],\n", + " y=df[\"SalePrice\"],\n", + " ax=ax1\n", + ")\n", + "# Plot distribution of column with most negative correlation\n", + "sns.boxplot(\n", + " x=df[\"KitchenAbvGr\"],\n", + " y=df[\"SalePrice\"],\n", + " ax=ax2\n", + ")\n", + "\n", + "# Customize labels\n", + "ax1.set_title(\"Overall Condition vs Sale Price\")\n", + "ax1.set_xlabel(\"OverallQual\")\n", + "ax1.set_ylabel(\"Sale Price\")\n", + "ax2.set_title(\"Number of kitchens above grade vs saleprice\")\n", + "ax2.set_xlabel(\"KitchenAbvGr\")\n", + "ax2.set_ylabel(\"Sale Price\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret the results below. Consult `data/data_description.txt` as needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "From the above box plots, we have several outliers in the overall quality of houses. This means there\n", + "are some houses within certain categories that are slightly different from the ones they have been categorised\n", + "with.\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Engineer and Explore a New Feature\n", + "\n", + "Here the code is written for you, all you need to do is interpret it.\n", + "\n", + "We note that the data spans across several years of sales:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "df[\"YrSold\"].value_counts().sort_index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the `YrBuilt` and `YrSold` columns, but represents a truly distinct feature." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell without changes\n", + "\n", + "# Make a new column, Age\n", + "df[\"Age\"] = df[\"YrSold\"] - df[\"YearBuilt\"]\n", + "\n", + "# Set up plot\n", + "fig, ax = plt.subplots(figsize=(15,5))\n", + "\n", + "# Plot Age vs. SalePrice\n", + "ax.scatter(df[\"Age\"], df[\"SalePrice\"], alpha=0.3, color=\"green\")\n", + "ax.set_title(\"Home Age vs. Sale Price\")\n", + "ax.set_xlabel(\"Age of Home at Time of Sale\")\n", + "ax.set_ylabel(\"Sale Price\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpret this plot below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace None with appropriate text\n", + "\"\"\"\n", + "The homes that have been in existence on in this case a younger age have a higher selling price compared\n", + "to those with an older age that have lower selling prices. This is because a new home is always valued at\n", + "a higher price due to latest features that have beenused to furnish them. They also come with warranties \n", + "on major items such as roofs or appliances. While older homes are lower in price because most of the times\n", + "buyers would have to spend separate money on renovating the homes abd replacing faulty appliances.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "Congratulations, you've completed an exploratory data analysis of a popular dataset. You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}