In this codealong, you'll get some hands-on practice developing a simple linear regression model. In practice, you would typically use a code library rather than writing linear regression code from scratch, but this is an exercise designed to help you see what is happening "under the hood".
You will be able to:
- Perform a linear regression using self-constructed functions
- Interpret the parameters of a simple linear regression model in relation to what they signify for specific data
Remember that the data for a simple linear regression consists of
Thus the overall model notation is
or, alternatively
In the example below,
In other words, the overall equation is
If you think back to the basic algebra formulas, you might remember that slope can be calculated between two points by finding the change in y over the change in x, i.e.
Because these are estimations, we'll use the "hat" notation for the variables, i.e.
or
Everything in these equations represented with a "hat" (e.g.
So, how do you find the line with the best fit? You may think that you have to try lots and lots of different lines to see which one fits best. Fortunately, this task is not as complicated as it may seem. Given some data points, the best-fit line always has a distinct slope and y-intercept that can be calculated using simple linear algebraic approaches.
We can calculate
Breaking down those components, we have:
-
$\hat{m}$ : the estimated slope -
$\rho$ : the Pearson correlation, represented by the Greek letter "Rho" -
$S_y$ : the standard deviation of the y values -
$S_x$ : the standard deviation of the x values
(You can visit this Wikipedia link to get take a look into the math behind the derivation of this formula.)
Then once we have the slope value (
so
Breaking down those components, we have:
-
$\hat{c}$ : the estimated intercept -
$\bar{y}$ : the mean of the y values -
$\hat{m}$ : the estimated slope -
$\bar{x}$ : the mean of the x values
In the cell below, we import the necessary libraries and provide you with some toy data:
# Run this cell without changes
# import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
# Initialize arrays X and Y with given values
# X = Independent Variable
X = np.array([1,2,3,4,5,6,8,8,9,10], dtype=np.float64)
# Y = Dependent Variable
Y = np.array([7,7,8,9,9,10,10,11,11,12], dtype=np.float64)
Before performing a linear regression analysis, it's a best practice to look at a scatter plot of the independent variable vs. the dependent variable. Linear regression is only appropriate if there is a linear relationship between them. In the cell below, create a quick scatter plot showing x vs. y.
Solution code (click to reveal)
plt.scatter(X, Y);
# Your code here
Based on the plot above, does linear regression analysis seem appropriate?
Answer (click to reveal)
Yes. The relationship is very linear but not perfectly linear
The best fit line should be able to explain this relationship with very low error
# Your answer here
Write a function calc_slope
that returns
The formula is:
Remember that you can use NumPy methods to calculate correlation and standard deviation.
Solution code (click to reveal)
def calc_slope(x_vals, y_vals):
# setting up components of formula
rho = np.corrcoef(x_vals, y_vals)[0][1]
s_y = y_vals.std()
s_x = x_vals.std()
# calculating slope estimate
m = rho * s_y / s_x
return m
def calc_slope(x_vals, y_vals):
# Your code here
m = calc_slope(X,Y)
m # should produce approximately 0.539
Now that we have our estimated slope
As a reminder, the calculation for the best-fit line's y-intercept is:
Write a function calc_intercept
that returns
Solution code (click to reveal)
def calc_intercept(m, x_vals, y_vals):
# setting up components of formula
y_mean = y_vals.mean()
x_mean = x_vals.mean()
# calculating intercept estimate
c = y_mean - m * x_mean
return c
def calc_intercept(m, x_vals, y_vals):
# Your code here
c = calc_intercept(m, X, Y)
c # should produce approximately 6.38
So, how might you go about actually making a prediction based on this model you just made?
Now that we have a working model with
Let's try to find a y prediction for a new value of
Solution code (click to reveal)
y_predicted = m * x_new + c
# Replace None with appropriate code
x_new = 7
y_predicted = None
y_predicted # should be about 10.155
Write a function best_fit
that takes in x and y values, calculates and prints the coefficient and intercept, and plots the original data points along with the best fit line. Be sure to reuse the functions we have already written!
Solution code (click to reveal)
def best_fit(x_vals, y_vals):
# Create a scatter plot of x vs. y
fig, ax = plt.subplots()
ax.scatter(x_vals, y_vals, color='#003F72', label="Data points")
# Calculate coefficient and intercept
m = calc_slope(x_vals, y_vals)
c = calc_intercept(m, x_vals, y_vals)
# Plot line created by coefficient and intercept
regression_line = m * x_vals + c
ax.plot(x_vals, regression_line, label= "Regression Line")
ax.legend()
def best_fit(x_vals, y_vals):
# Create a scatter plot of x vs. y
# Calculate and print coefficient and intercept
# Plot line created by coefficient and intercept
best_fit(X, Y)
So there we have it, our least squares regression line. This is the best fit line and does describe the data pretty well (still not perfect though).
What is the overall formula of the model you have created? How would you describe the slope and intercept, and what do they say about the relationship between x and y?
Answer (click to reveal)
The overall formula is
The intercept (where the line crosses the y-axis) is at 6.37. This means that if x is equal to 0, the value of y would be 6.37.
The slope of the line is 0.53. This means that every increase of 1 in the value of x is associated with an increase of 0.53 in the value of y.
# Your answer here
In this lesson, you learned how to perform linear regression from scratch using NumPy methods. You first calculated the slope and intercept parameters of the regression line that best fit the data. You then used the regression line parameters to predict the value (