Skip to content

learn-co-curriculum/dsc-linear-regression-statsmodels-lab

Repository files navigation

Linear Regression in StatsModels - Lab

Introduction

It's time to apply the StatsModels skills from the previous lesson! In this lab , you'll explore a slightly more complex example to study the impact of spending on different advertising channels on total sales.

Objectives

You will be able to:

  • Perform a linear regression using StatsModels
  • Evaluate a linear regression model using StatsModels
  • Interpret linear regression coefficients using StatsModels

Let's Get Started

In this lab, you'll work with the "Advertising Dataset", which is a very popular dataset for studying simple regression. The dataset is available on Kaggle, but we have downloaded it for you. It is available in this repository as advertising.csv. You'll use this dataset to answer this question:

Which advertising channel has the strongest relationship with sales volume, and can be used to model and predict the sales?

The columns in this dataset are:

  1. sales: the number of widgets sold (in thousands)
  2. tv: the amount of money (in thousands of dollars) spent on TV ads
  3. radio: the amount of money (in thousands of dollars) spent on radio ads
  4. newspaper: the amount of money (in thousands of dollars) spent on newspaper ads

Step 1: Exploratory Data Analysis

# Load necessary libraries and import the data
# Check the columns and first few rows
# Generate summary statistics for data with .describe()

Based on what you have seen so far, describe the contents of this dataset. Remember that our business problem is asking us to build a model that predicts sales.

# Your answer here
Answer (click to reveal)

Every record in our dataset shows the advertising budget spend on TV, newspaper, and radio campaigns as well as a target variable, sales.

The count for each is 200, which means that we do not have any missing data.

Looking at the mean values, it appears that spending on TV is highest, and spending on radio is lowest. This aligns with what we see in the output from head().

Now, use scatter plots to plot each predictor (TV, radio, newspaper) against the target variable.

# Visualize the relationship between the preditors and the target using scatter plots

Does there appear to be a linear relationship between these predictors and the target?

# Record your observations on linearity here 
Answer (click to reveal)

TV seems to be a good predictor because it has the most linear relationship with sales.

radio also seems to have a linear relationship, but there is more variance than with TV. We would expect a model using radio to be able to predict the target, but not as well as a model using TV.

newspaper has the least linear-looking relationship. There is a lot of variance as well. It's not clear from this plot whether a model using newspaper would be able to predict the target.

Step 2: Run a Simple Linear Regression with TV as the Predictor

As the analysis above indicates, TV looks like it has the strongest relationship with sales. Let's attempt to quantify that using linear regression.

# Import libraries

# Determine X and y values

# Create an OLS model
# Get model results

# Display results summary

Step 3: Evaluate and Interpret Results from Step 2

How does this model perform overall? What do the coefficients say about the relationship between the variables?

# Your answer here
Answer (click to reveal)

Overall the model and coefficients are statistically significant, with all p-values well below a standard alpha of 0.05.

The R-squared value is about 0.61 i.e. 61% of the variance in the target variable can be explained by TV spending.

The intercept is about 7.0, meaning that if we spent 0 on TV, we would expect sales of about 7k widgets (the units of sales are in thousands of widgets).

The TV coefficient is about 0.05, meaning that for each additional $1k spent on TV (the units of TV are in thousands of dollars), we would expect to sell an additional 50 widgets. (More precisely, 47.5 widgets.)

Note that all of these coefficients represent associations rather than causation. It's possible that better sales are what leads to more TV spending! Either way, TV seems to have a strong relationship with sales.

Step 4: Visualize Model with TV as Predictor

Create at least one visualization that shows the prediction line against a scatter plot of TV vs. sales, as well as at least one visualization that shows the residuals.

# Plot the model fit (scatter plot and regression line)
# Plot the model residuals

Step 5: Repeat Steps 2-4 with radio as Predictor

Compare and contrast the model performance, coefficient value, etc. The goal is to answer the business question described above.

# Run model

# Display results
# Visualize model fit
# Visualize residuals
# Your interpretation here
Answer (click to reveal)

Same as with TV, the model using radio to predict sales as well as its parameters are statistically significant (p-values well below 0.05).

However, this model explains less of the variance. It only explains about 33% of the variance in sales, compared to about 61% explained by TV. If our main focus is the percentage of variance explained, this is a worse model than the TV model.

On the other hand, the coefficient for radio is much higher. An increase of $1k in radio spending is associated with an increase of sales of about 200 widgets! This is roughly 4x the increase of widget sales that we see for TV.

Visualizing this model, it doesn't look much different from the TV model.

So, how should we answer the business question? Realistically, you would need to return to your stakeholders to get a better understanding of what they are looking for. Do they care more about the variable that explains more variance, or do they care more about where an extra $1k of advertising spending is likely to make the most difference?

Step 6: Repeat Steps 2-4 with newspaper as Predictor

Once again, use this information to compare and contrast.

# Run model

# Display results
# Visualize model fit
# Visualize residuals
# Your interpretation here
Answer (click to reveal)

Technically our model and coefficients are still statistically significant at an alpha of 0.05, but the p-values are much higher. For both the F-statistic (overall model significance) and the newspaper coefficient, our p-values are about 0.001, meaning that there is about a 0.1% chance that a variable with no linear relationship would produce these statistics. That is a pretty small false positive rate, so we'll consider the model to be statistically significant and move on to interpreting the other results.

The R-Squared here is the smallest we have seen yet: 0.05. This means that the model explains about 5% of the variance in sales. 5% is well below both the radio model (33%) and the TV model (61%).

The coefficient is also small, though similar to the TV coefficient. An increase of $1k in newspaper spending is associated with about 50 additional widget sales (more precisely, about 54.7). This is still much less than the 200-widget increase associated with $1k of additional radio spending.

Visualizing this model, the best-fit line is clearly not a strong predictor. On the other hand, the residuals exhibit homoscedasticity, meaning that the distribution of the residuals doesn't vary much based on the value of newspaper. This contrasts with the radio and TV residuals which exhibit a "cone" shape, where the errors are larger as the x-axis increases. Homoscedasticity of residuals is a good thing, which we will describe more in depth when we discuss regression assumptions.

Once again, how should we answer the business question? Regardless of the framing, it is unlikely that newspaper is the answer that your stakeholders want. This model has neither the highest R-Squared nor the highest coefficient.

Summary

In this lab, you ran a complete regression analysis with a simple dataset. You used StatsModels to perform linear regression and evaluated your models using statistical metrics as well as visualizations. You also reached a conclusion about how you would answer a business question using linear regression.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published