Please fill out:
- Student name: Hellen Samuel,Calvine Dasilver,Sandra kiptum ,Jack Otieno,Salahudin
- Student pace: full time
- Scheduled project review date/time:
- Instructor name: NIKITA
Business Understanding The real estate market is a vital component of regional economic health and stability. This project delves into the dynamics of house sales in a specific northwestern county in the United States, aiming to unravel the key factors influencing property valuation in this area.
- Real estate data complexity, encompassing diverse property features and local market trends.
- Accurately identifying and quantifying the impact of each factor on house prices.
- Consideration of external factors like economic conditions and interest rates.
Utilizing multiple linear regression, a powerful machine learning technique, to analyze a large dataset of house sales and identify statistical relationships between property features and sale prices.
- Develop a robust multiple linear regression model for accurate house price prediction.
- Identify significant factors influencing property value in the specific market.
- Provide insights into regional housing market dynamics.
Dataset Description
The analysis utilizes the King County House Sales dataset, comprising over 21,500 records and 20 distinct features. Spanning house sales from May 2014 to May 2015, the dataset offers a comprehensive snapshot of the housing market.
Key Columns
Constraints and Considerations
Data preparation we import the necessary functions and clean the data in the following ways
- checking the data and null values
- deleting the columns with null values
- checking for non-numeric columns
- checking for duplicates
- creating the necessary columns
- checking for outliers using the box plot and deleting the outliers
Exploratory Data Analysis
we will perform exploratory data analysis (EDA) to understand the data better and discover any patterns, trends using univariate,bivariate and multivariate analysis
We will use descriptive statistics and visualizations to summarize the main characteristics of the data and examine the relationships between the features and the target variable.
We will also check the distribution and correlation of the variables and identify any potential problems or opportunities for the analysis.
Univariate analysis involves the examination of single variables.We focus in the summary statistics of target variable-price to help us undersatand the distribution and skewness of house prices.
The histogram shows that the distribution of house price is positively skewed suggesting that while most houses are concentrated around lower prices, there are some properties with significantly higher prices.
We perform bivariate analysis to examine the relationship between the target variable - price and the other numeric and continuous features in the data using the scatter plots to show the direction, strength, and shape of the relationship between two numeric variables.
The scatter plots show that there is a positive relationship between most of the independent variables and the price of a house. This means that houses with higher values for these variables tend to be more expensive
In this section, we will perform multivariate analysis to examine the relationship between the target variable - price and multiple features in the data. We will use heatmap to visualize the correlation matrix of the features and see how they are related to each other and to the price.
The heatmap shows that Positive correlations are typically represented by shades of red, and negative correlations by shades of blue. We note that bathrooms and sqft_living are highly positively correlated.
Regression Modelling
For simple linear regression we will use the one column that has the strongest correlation to the price, this will also be or baseline model for the multiple linear regression.
- Checking for correlation
from the correlation sqft_living has the highest correlatio with price, we will therefore use sqft_living as the exogenous variable and price as our endogenous variable. plot using a scatter plot
from this we can see that there is a linearity between the two variables satisfying one of the 4 LINE specifications.
- building the model
we build the model qand interprate our models results
we are plotting our residuals to understand where our model is perfoming best and where it is performing poorly
our graphs give us the same information as our summary did from this we can see that our residuals are not normally distributed we can solve this but using multiple linear distribution
For Multiple Linear Regression, we are going to use more than one predictor variable to predict price for our case
Our baseline for this model will be the linear Regression that we just did above We then clean our data
The image above is a heatmap of the cleaned data
-
we build the model,fit it and interprate the results
-
we check for normality
From the diagram above we can see that the errors are not normaly distributed and therefore we will check the other assumptions to evaluate
-
plotting the model
-
independence of errors We are going to find out the predicted y of the model and calculate the residual from there on
This shows where our modle works best
- evaluating the model From this we can see that due to Outliers,Nonlinear Relationships,Heteroscedasticity and overfitting our MSE and RMSE are high, we will build another model to remidy this factors.
From the 3 modules built we advise potential buyers or sellers to concider model 3 in determining the price of a house. We can also suggest that the factor affecting the price of a house most is square foot living but they should concider increasing the number of bathrooms during renovations for the case of the sellers.
1.Find more features that home buyers often value highly to add to the model 2.Correlate the information of this model with ones for other states