Skip to content

Latest commit

 

History

History
418 lines (276 loc) · 11.1 KB

README.md

File metadata and controls

418 lines (276 loc) · 11.1 KB

EDA with Pandas - Cumulative Lab

Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.

Objectives

You will be able to:

  • Practice loading data with pandas
  • Practice calculating measures of centrality and dispersion with pandas
  • Practice creating subsets of data with pandas
  • Practice using data visualizations to explore data, and interpreting those visualizations
  • Perform a full exploratory data analysis process to gain insight about a dataset

Your Task: Explore the Ames Housing Dataset with Pandas

aerial photo of a neighborhood

Photo by Matt Donders on Unsplash

Data Understanding

Each record (row) in this dataset represents a home that was sold in Ames, IA.

Each feature (column) in this dataset is some attribute of that home sale. You can view the file data/data_description.txt in this repository for a full explanation of all variables in this dataset — 80 columns in total.

We are going to focus on the following features:

SalePrice: Sale price of the house in dollars

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

OverallCond: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

YrSold: Year Sold (YYYY)

YearBuilt: Original construction date

LandSlope: Slope of property

       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope

Requirements

In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.

1. Load the Dataset with Pandas

Import pandas with the standard alias pd and load the data into a dataframe with the standard name df.

2. Explore Data Distributions

Produce summary statistics, visualizations, and interpretive text describing the distributions of SalePrice, TotRmsAbvGrd, and OverallCond.

3. Explore Differences between Subsets

Separate the data into subsets based on OverallCond, then demonstrate how this split impacts the distribution of SalePrice.

4. Explore Correlations

Find the features that have the strongest positive and negative correlations with SalePrice, and produce plots representing these relationships.

5. Engineer and Explore a New Feature

Create a new feature Age, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price.

1. Load the Dataset with Pandas

In the cell below, import:

  • pandas with the standard alias pd
  • matplotlib.pyplot with the standard alias plt

And set %matplotlib inline so the graphs will display immediately below the cell that creates them.

# Your code here

Now, use pandas to open the file located at data/ames.csv (documentation here). Specify the argument index_col=0 in order to avoid creating an extra Id column. Name the resulting dataframe df.

# Your code here

The following code checks that you loaded the data correctly:

# Run this cell without changes

# Check that df is a dataframe
assert type(df) == pd.DataFrame

# Check that there are the correct number of rows
assert df.shape[0] == 1460

# Check that there are the correct number of columns
# (if this crashes, make sure you specified `index_col=0`)
assert df.shape[1] == 80

Inspect the contents of the dataframe:

# Run this cell without changes
df
# Run this cell without changes
df.info()

2. Explore Data Distributions

Write code to produce histograms showing the distributions of SalePrice, TotRmsAbvGrd, and OverallCond.

Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for plotting histograms, customizing axes, and plotting vertical lines as needed.

Sale Price

In the cell below, produce a histogram for SalePrice.

# Your code here

Now, print out the mean, median, and standard deviation:

# Your code here

In the cell below, interpret the above information.

# Replace None with appropriate text
"""
None
"""

Total Rooms Above Grade

In the cell below, produce a histogram for TotRmsAbvGrd.

# Your code here

Now, print out the mean, median, and standard deviation:

# Your code here

In the cell below, interpret the above information.

# Replace None with appropriate text
"""
None
"""

Overall Condition

In the cell below, produce a histogram for OverallCond.

# Your code here

Now, print out the mean, median, and standard deviation:

# Your code here

In the cell below, interpret the above information.

# Replace None with appropriate text
"""
None
"""

3. Explore Differences between Subsets

As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.

One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use SalePrice as that other variable.

In the cell below, create three variables, each of which represents a record-wise subset of df (meaning, it has the same columns as df, but only some of the rows).

  • below_average_condition: home sales where the overall condition was less than 5
  • average_condition: home sales where the overall condition was exactly 5
  • above_average_condition: home sales where the overall condition was greater than 5
# Replace None with appropriate code
below_average_condition = None
average_condition = None
above_average_condition = None

The following code checks that you created the subsets correctly:

# Run this cell without changes

# Check that all of them still have 80 columns
assert below_average_condition.shape[1] == 80
assert average_condition.shape[1] == 80
assert above_average_condition.shape[1] == 80

# Check the numbers of rows of each subset
assert below_average_condition.shape[0] == 88
assert average_condition.shape[0] == 821
assert above_average_condition.shape[0] == 551

The following code will produce a plot of the distributions of sale price for each of these subsets:

# Run this cell without changes

# Set up plot
fig, ax = plt.subplots(figsize=(15,5))

# Create custom bins so all are on the same scale
bins = range(df["SalePrice"].min(), df["SalePrice"].max(), int(df["SalePrice"].median()) // 20)

# Plot three histograms, with reduced opacity (alpha) so we
# can see them overlapping
ax.hist(
    x=above_average_condition["SalePrice"],
    label="above average condition",
    bins=bins,
    color="cyan",
    alpha=0.5
)
ax.hist(
    x=average_condition["SalePrice"],
    label="average condition",
    bins=bins,
    color="gray",
    alpha=0.3
)
ax.hist(
    x=below_average_condition["SalePrice"],
    label="below average condition",
    bins=bins,
    color="yellow",
    alpha=0.5
)

# Customize labels
ax.set_title("Distributions of Sale Price Grouped by Condition")
ax.set_xlabel("Sale Price")
ax.set_ylabel("Number of Houses")
ax.legend();

Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?

# Replace None with appropriate text
"""
None
"""

4. Explore Correlations

To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full df, rather than the subsets.

In the cell below, print out both the name of the column and the Pearson correlation for the column that is most positively correlated with SalePrice (other than SalePrice, which is perfectly correlated with itself).

We'll only check the correlations with some kind of numeric data type.

You can import additional libraries, although it is possible to do this just using pandas.

# Your code here

Now, find the most negatively correlated column:

# Your code here

Once you have your answer, edit the code below so that it produces a box plot of the relevant columns.

# Replace None with appropriate code

import seaborn as sns

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))

# Plot distribution of column with highest correlation
sns.boxplot(
    x=None,
    y=df["SalePrice"],
    ax=ax1
)
# Plot distribution of column with most negative correlation
sns.boxplot(
    x=None,
    y=df["SalePrice"],
    ax=ax2
)

# Customize labels
ax1.set_title(None)
ax1.set_xlabel(None)
ax1.set_ylabel("Sale Price")
ax2.set_title(None)
ax2.set_xlabel(None)
ax2.set_ylabel("Sale Price");

Interpret the results below. Consult data/data_description.txt as needed.

# Replace None with appropriate text
"""
None
"""

5. Engineer and Explore a New Feature

Here the code is written for you, all you need to do is interpret it.

We note that the data spans across several years of sales:

# Run this cell without changes
df["YrSold"].value_counts().sort_index()

Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the YrBuilt and YrSold columns, but represents a truly distinct feature.

# Run this cell without changes

# Make a new column, Age
df["Age"] = df["YrSold"] - df["YearBuilt"]

# Set up plot
fig, ax = plt.subplots(figsize=(15,5))

# Plot Age vs. SalePrice
ax.scatter(df["Age"], df["SalePrice"], alpha=0.3, color="green")
ax.set_title("Home Age vs. Sale Price")
ax.set_xlabel("Age of Home at Time of Sale")
ax.set_ylabel("Sale Price");

Interpret this plot below:

# Replace None with appropriate text
"""
None
"""

Summary

Congratulations, you've completed an exploratory data analysis of a popular dataset. You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features.