diff --git a/Project Description.pdf b/Project Description.pdf new file mode 100644 index 00000000..78931adc Binary files /dev/null and b/Project Description.pdf differ diff --git a/README.md b/README.md index 5dd0f84d..ef0aa521 100644 --- a/README.md +++ b/README.md @@ -1,285 +1,95 @@ -# Phase 2 Project Description -Another module down - you're almost half way there! - -![awesome](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v2-3/main/halfway-there.gif) - -All that remains in Phase 2 is to put your newfound data science skills to use with a large project! - -In this project description, we will cover: - -* Project Overview: the project goal, audience, and dataset -* Deliverables: the specific items you are required to produce for this project -* Grading: how your project will be scored -* Getting Started: guidance for how to begin working +## Project Description +### Analysis of House Sales in a King's County ## Project Overview +Primetime Realtors situated in the heart of a North Western County acts as the conduit for transforming home ownership aspirations into tangible realities. Committed to unwavering excellence and employing data-driven methodologies, the agency aspires to lead the way in achieving optimal pricing and facilitating successful real estate endeavors. Its overarching objective is to surpass traditional limitations by leveraging technology and analytical insights to revolutionize the real estate landscape as we perceive it. -For this project, you will use multiple linear regression modeling to analyze house sales in a northwestern county. - -### Business Problem - -It is up to you to define a stakeholder and business problem appropriate to this dataset. - -If you are struggling to define a stakeholder, we recommend you complete a project for a real estate agency that helps homeowners buy and/or sell homes. A business problem you could focus on for this stakeholder is the need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount. - -### The Data - -This project uses the King County House Sales dataset, which can be found in `kc_house_data.csv` in the data folder in this assignment's GitHub repository. The description of the column names can be found in `column_names.md` in the same folder. As with most real world data sets, the column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means. - -It is up to you to decide what data from this dataset to use and how to use it. If you are feeling overwhelmed or behind, we recommend you **ignore** some or all of the following features: - -* `date` -* `view` -* `sqft_above` -* `sqft_basement` -* `yr_renovated` -* `zipcode` -* `lat` -* `long` -* `sqft_living15` -* `sqft_lot15` - -### Key Points - -* **Your goal in regression modeling is to yield findings to support relevant recommendations. Those findings should include a metric describing overall model performance as well as at least two regression model coefficients.** As you explore the data and refine your stakeholder and business problem definitions, make sure you are also thinking about how a linear regression model adds value to your analysis. "The assignment was to use linear regression" is not an acceptable answer! You can also use additional statistical techniques other than linear regression, so long as you clearly explain why you are using each technique. - -* **You should demonstrate an iterative approach to modeling.** This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model. - -* **Data visualization and analysis are no longer explicit project requirements, but they are still very important.** In Phase 1, your project stopped earlier in the CRISP-DM process. Now you are going a step further, to modeling. Data visualization and analysis will help you build better models and tell a better story to your stakeholders. - -## Deliverables - -There are three deliverables for this project: - -* A **non-technical presentation** -* A **Jupyter Notebook** -* A **GitHub repository** - -The deliverables requirements are almost the same as in the Phase 1 Project, and you can review those extended descriptions [here](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#deliverables). In general, everything is the same except the "Data Visualization" and "Data Analysis" requirements have been replaced by "Modeling" and "Regression Results" requirements. - -### Non-Technical Presentation - -Recall that the non-technical presentation is a slide deck presenting your analysis to ***business stakeholders***, and should be presented live as well as submitted in PDF form on Canvas. - -We recommend that you follow this structure, although the slide titles should be specific to your project: - -1. Beginning - - Overview - - Business and Data Understanding -2. Middle - - **Modeling** - - **Regression Results** -3. End - - Recommendations - - Next Steps - - Thank you - -Make sure that your discussion of modeling and regression results is geared towards a non-technical audience! Assume that their prior knowledge of regression modeling is minimal. You don't need to explain how linear regression works, but you should explain why linear regression is useful for the problem context. Make sure you translate any metrics or coefficients into their plain language implications. - -The graded elements for the non-technical presentation are the same as in [Phase 1](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#deliverables). - -### Jupyter Notebook - -Recall that the Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a ***data science audience***. You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository. - -The graded elements for the Jupyter Notebook are: - -* Business Understanding -* Data Understanding -* Data Preparation -* **Modeling** -* **Regression Results** -* Code Quality - -### GitHub Repository - -Recall that the GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history. - -The requirements are the same as in [Phase 1](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#github-repository), except for the required sections in the `README.md`. - -For this project, the `README.md` file should contain: - -* Overview -* Business and Data Understanding - * Explain your stakeholder audience here -* **Modeling** -* **Regression Results** -* Conclusion - -Just like in Phase 1, the `README.md` file should be the bridge between your non technical presentation and the Jupyter Notebook. It should not contain the code used to develop your analysis, but should provide a more in-depth explanation of your methodology and analysis than what is described in your presentation slides. - -## Grading - -***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are: - -1. Attention to Detail -2. Statistical Communication -3. Data Preparation Fundamentals -4. Linear Modeling - -### Attention to Detail - -Just like in Phase 1, this rubric objective is based on your completion of checklist items. ***In Phase 2, you need to complete 70% (7 out of 10) or more of the checklist elements in order to pass the Attention to Detail objective.*** - -**NOTE THAT THE PASSING BAR IS HIGHER IN PHASE 2 THAN IT WAS IN PHASE 1!** - -The standard will increase with each Phase, until you will be required to complete all elements to pass Phase 5 (Capstone). - -#### Exceeds Objective - -80% or more of the project checklist items are complete - -#### Meets Objective (Passing Bar) - -70% of the project checklist items are complete - -#### Approaching Objective - -60% of the project checklist items are complete - -#### Does Not Meet Objective - -50% or fewer of the project checklist items are complete - -### Statistical Communication - -Recall that communication is one of the key data science "soft skills". In Phase 2, we are specifically focused on Statistical Communication. We define Statistical Communication as: - -> Communicating **results of statistical analyses** to diverse audiences via writing and live presentation - -Note that this is the same as in Phase 1, except we are replacing "basic data analysis" with "statistical analyses". - -High-quality Statistical Communication includes rationale, results, limitations, and recommendations: - -* **Rationale:** Explaining why you are using statistical analyses rather than basic data analysis - * For example, why are you using regression coefficients rather than just a graph? - * What about the problem or data is suitable for this form of analysis? - * For a data science audience, this includes your reasoning for the changes you applied while iterating between models. -* **Results:** Describing the overall model metrics and feature coefficients - * You need at least one overall model metric (e.g. r-squared or RMSE) and at least two feature coefficients. - * For a business audience, make sure you connect any metrics to real-world implications. You do not need to get into the details of how linear regression works. - * For a data science audience, you don't need to explain what a metric is, but make sure you explain why you chose that particular one. -* **Limitations:** Identifying the limitations and/or uncertainty present in your analysis - * This could include p-values/alpha values, confidence intervals, assumptions of linear regression, missing data, etc. - * In general, this should be more in-depth for a data science audience and more surface-level for a business audience. -* **Recommendations:** Interpreting the model results and limitations in the context of the business problem - * What should stakeholders _do_ with this information? - -#### Exceeds Objective - -Communicates the rationale, results, limitations, and specific recommendations of statistical analyses - -> See above for extended explanations of these terms. - -#### Meets Objective (Passing Bar) - -Successfully communicates the results of statistical analyses without any major errors - -> The minimum requirement is to communicate the _results_, meaning at least one overall model metric (e.g. r-squared or RMSE) as well as at least two feature coefficients. See the Approaching Objective section for an explanation of what a "major error" means. - -#### Approaching Objective - -Communicates the results of statistical analyses with at least one major error - -> A major error means that some aspect of your explanation is fundamentally incorrect. For example, if a feature coefficient is negative and you say that an increase in that feature results in an increase of the target, that would be a major error. Another example would be if you say that the feature with the highest coefficient is the "most statistically significant" while ignoring the p-value. One more example would be reporting a coefficient that is not statistically significant, rather than saying "no statistically significant linear relationship was found" - -> "**If a coefficient's t-statistic is not significant, don't interpret it at all.** You can't be sure that the value of the corresponding parameter in the underlying regression model isn't really zero." _DeVeaux, Velleman, and Bock (2012), Stats: Data and Models, 3rd edition, pg. 801_. Check out [this website](https://web.ma.utexas.edu/users/mks/statmistakes/TOC.html) for extensive additional examples of mistakes using statistics. - -> The easiest way to avoid making a major error is to have someone double-check your work. Reach out to peers on Slack and ask them to confirm whether your interpretation makes sense! - -#### Does Not Meet Objective - -Does not communicate the results of statistical analyses - -> It is not sufficient to just display the entire results summary. You need to pull out at least one overall model metric (e.g. r-squared, RMSE) and at least two feature coefficients, and explain what those numbers mean. -### Data Preparation Fundamentals +## Business Problem -We define this objective as: +In our role at a real estate agency, we are examining data from the Kings County House Sales dataset to advise our agency on strategies to boost home values in Kings County through renovations. Our goal is to identify the most impactful renovation factors that can enhance a home's value. By pinpointing these factors, our agency can effectively guide homeowners in maximizing their profits when selling their homes. -> Applying appropriate **preprocessing** and feature engineering steps to tabular data in preparation for statistical modeling -The two most important components of preprocessing for the Phase 2 project are: +## The Data -* **Handling Missing Values:** Missing values may be present in the features you want to use, either encoded as `NaN` or as some other value such as `"?"`. Before you can build a linear regression model, make sure you identify and address any missing values using techniques such as dropping or replacing data. -* **Handling Non-Numeric Data:** A linear regression model needs all of the features to be numeric, not categorical. For this project, ***be sure to pick at least one non-numeric feature and try including it in a model.*** You can identify that a feature is currently non-numeric if the type is `object` when you run `.info()` on your dataframe. Once you have identified the non-numeric features, address them using techniques such as ordinal or one-hot (dummy) encoding. +This project uses the King County House Sales dataset, which can be found in kc_house_data.csv in the data folder in this assignment's GitHub repository. The description of the column names can be found in column_names.md in the same folder. As with most real World data sets, the column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means. -There is no single correct way to handle either of these situations! Use your best judgement to decide what to do, and be sure to explain your rationale in the Markdown of your notebook. +### Data Cleaning -Feature engineering is encouraged but not required for this project. +Next, we chose to clean our data by dropping unnecessary columns that would not help us achieve the result we were aiming for. These columns included: -#### Exceeds Objective +id, +date, +waterfront, +view, +grade, +sqft_above, +sqft_basement, +yr_renovated, +zipcode, +lat, +long, +sqft_living15, +sqft_lot15 -Goes above and beyond with data preparation, such as feature engineering or merging in outside datasets +After dropping these columns, the remaining columns that we were experimenting were: -> One example of feature engineering could be using the `date` feature to create a new feature called `season`, which represents whether the home was sold in Spring, Summer, Fall, or Winter. +Price, +bedrooms, +bathrooms, +sqft_living, +sqft_lot, +floors, +condition, +Waterfronts -> One example of merging in outside datasets could be finding data based on ZIP Code, such as household income or walkability, and joining that data with the provided CSV. + +After examining the data types of each column, we identified the only non-numeric column as "condition." To handle this, we transformed the "condition" column using one-hot encoding, which split it into subcategories: cond_avg, cond_fair, cond_good, cond_poor, and cond_verygood, all of which were converted to float data types. -#### Meets Objective (Passing Bar) +Further, we categorized the remaining columns into separate arrays based on whether they were continuous or categorical variables. we then proceeded to analyze each array individually: for categorical variables, we generated histograms, while for continuous variables, we created a scatter matrix to explore their relationships further. -Successfully prepares data for modeling, including converting at least one non-numeric feature into ordinal or binary data and handling missing data as needed -> As a reminder, you can identify the non-numeric features by calling `.info()` on the dataframe and looking for type `object`. +## Data Preparation -> Your final model does not necessarily need to include any features that were originally non-numeric, but you need to demonstrate your ability to handle this type of data. -#### Approaching Objective +To ensure the integrity of our target variable, "price," we employed the train-test split method to normalize it, separating it from the original dataframe and assigning it to "y," while designating the remaining variables as "X". -Prepares some data successfully, but is unable to utilize non-numeric data +Following the normalization of the target variable, we constructed a heatmap to visualize the correlations between all columns and the target variable, "price." This heatmap revealed that "sqft_living" exhibited the highest correlation. To validate this correlation further, we utilized cross-validation, which demonstrated a minimal discrepancy of about 0.01 between the training and validation scores, indicating a robust model. -> If you simply subset the dataframe to only columns with type `int64` or `float64`, your model will run, but you will not pass this objective. +Lastly, in the preparation phase, we crafted two models: one showcasing the variables ranked by their correlation strength with the target variable and another displaying a scatter matrix of all variables. This approach aimed to highlight any non-normal distributions, thereby guiding the normalization process for the modeling stage. -#### Does Not Meet Objective +![Image](https://github.com/paulngatia/Dsc-Phase-2-Project-v2-3/blob/main/image.png) -Does not prepare data for modeling +## Modelling -### Linear Modeling +Based on our models baseline model, utilizing a simple linear regression, established the initial understanding of the relationship between the square footage of living space and house prices. From the simple linear model, it was observed the model wasn't the best for this analysis. The model's R2 value was extreemly low, indicating that there are other numerous factors in housing that affect the house prices. These results showed that another iteration was needed. In the second model OneHot Encoding was introduced to convert the categorical variables to binary representation which is suitable for machine learning algorithms. -According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), linear and logistic regression are the most popular machine learning algorithms, used by 83.7% of data scientists. They are small, fast models compared to some of the models you will learn later, but have limitations in the kinds of relationships they are able to learn. +The second model certainly improved from the baseline model, meaning that at least one of the independent variables has a significant effect on the dependent variable. To further improve the model, a third iteration was needed, In this model, Log Transformation was introduced to reduce skewness. The results from this iteration reduced skewness while simultaneously increasing the R2 making the model highly significant. -In this project you are required to use linear regression as the primary statistical analysis, although you are free to use additional statistical techniques as appropriate. +The final model, a multiple linear regression, incorporated additional features to improve the predictive power. Evaluation metrics such as R-squared and F-statistic were used to assess the model's performance and significance. In this model, insignificant variables were dropped to achieve the highest prediction power of the models. The results got from this model's summary displayed the highest R2 value, and the best fit of residuals to a normal distribution. -#### Exceeds Objective +![Actual vs Predicted](https://github.com/paulngatia/Dsc-Phase-2-Project-v2-3/blob/main/image-2.png) -Goes above and beyond in the modeling process, such as recursive feature selection -#### Meets Objective (Passing Bar) +## Conclusion -Successfully builds a baseline model as well as at least one iterated model, and correctly extracts insights from a final model without any major errors + The multiple linear regression model is better than the simple linear regression model because The multiple linear regression model has a slightly higher R-squared value (0.498) compared to the simple linear regression model (0.498 vs. 0.473). A higher R-squared value indicates that the multiple linear regression model explains a larger proportion of the variance in the target variable (house prices) compared to the simple linear regression model. -> We are looking for you to (1) create a baseline model, (2) iterate on that model, making adjustments that are supported by regression theory or by descriptive analysis of the data, and (3) select a final model and report on its metrics and coefficients + Furthermore, the analysis revealed that the most influential predictor of home prices was indeed "sqft_living", aligning with our initial expectations. -> Ideally you would include written justifications for each model iteration, but at minimum the iterations must be _justifiable_ + Moving forward, I believe it would be beneficial to incorporate additional predictive factors beyond "sqft_living" and "sqft_lot." By expanding the scope of factors considered, we can gain deeper insights into other home renovation aspects that contribute to enhancing home values. -> For an explanation of "major errors", see the description below -#### Approaching Objective -Builds multiple models with at least one major error -> The number one major error to avoid is including the target as one of your features. For example, if the target is `price` you should NOT make a "price per square foot" feature, because that feature would not be available if you didn't already know the price. -> Other examples of major errors include: using a target other than `price`, attempting only simple linear regression (not multiple linear regression), dropping multiple one-hot encoded columns without explaining the resulting baseline, or using a unique identifier (`id` in this dataset) as a feature. -#### Does Not Meet Objective -Does not build multiple linear regression models -## Getting Started -Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP. -Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project. -Here are some suggestions for creating your GitHub repository: -1. Fork the [Phase 2 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-2-project-v2-3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`. -2. Or, create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from the Phase 2 Project Repository into your new repository. - - Recall that you can refer to the [Phase 1 Project Template](https://github.com/learn-co-curriculum/dsc-project-template) as an example structure - - This option will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try forking the project repository instead -## Summary -This is your first modeling project! Take what you have learned in Phase 2 to create a project with a more sophisticated analysis than you completed in Phase 1. You will build on these skills as we move into the predictive machine learning mindset in Phase 3. You've got this! diff --git a/git b/git new file mode 100644 index 00000000..e69de29b diff --git a/image-1.png b/image-1.png new file mode 100644 index 00000000..6daa870b Binary files /dev/null and b/image-1.png differ diff --git a/image-2.png b/image-2.png new file mode 100644 index 00000000..c02eab9b Binary files /dev/null and b/image-2.png differ diff --git a/image.png b/image.png new file mode 100644 index 00000000..6daa870b Binary files /dev/null and b/image.png differ diff --git a/student.ipynb b/student.ipynb index d3bb34af..187124a6 100644 --- a/student.ipynb +++ b/student.ipynb @@ -4,14 +4,2055 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Final Project Submission\n", + "# **Phase 2 Project** \n", "\n", - "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", - "* Scheduled project review date/time: \n", - "* Instructor name: \n", - "* Blog post URL:\n" + " #### GROUP 6\n", + " 1. PAUL NGATIA\n", + " 2. HARRY ATULAH\n", + " 3. PASCALIA MAIGA\n", + " 4. RONNY KABIRU " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Business Understanding\n", + " ## Overview\n", + "Primetime Realtors situated in the heart of a North Western County acts as the conduit for transforming homeownership aspirations into tangible realities. Committed to unwavering excellence and employing data-driven methodologies, the agency aspires to lead the way in achieving optimal pricing and facilitating successful real estate endeavors. Its overarching objective is to surpass traditional limitations by leveraging technology and analytical insights to revolutionize the real estate landscape as we perceive it.\n", + " ## Business Problem \n", + " The housing market in King County displays diverse trends and influences on property prices. Yet, a more thorough understanding of these factors is necessary to assist real estate stakeholders in making informed choices. The main challenge is to construct a reliable pricing model capable of accurately forecasting house prices using multiple features. This model should offer insights into the most influential features on property prices, empowering Primetime Realtors make well-informed decisions.\n", + " ## Objectives\n", + "- To identify key features that significantly influence house prices in the northwestern county.\n", + "- To develop an optimal pricing strategy using a robust multiple linear regression model.\n", + "- To help improve the agency's annual revenue by leveraging the analytical insights and pricing strategy developed through this project.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing the Data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
0712930052010/13/2014221900.031.00118056501.0NaNNONE...7 Average11800.019550.09817847.5112-122.25713405650
1641410019212/9/2014538000.032.25257072422.0NONONE...7 Average2170400.019511991.09812547.7210-122.31916907639
256315004002/25/2015180000.021.00770100001.0NONONE...6 Low Average7700.01933NaN9802847.7379-122.23327208062
3248720087512/9/2014604000.043.00196050001.0NONONE...7 Average1050910.019650.09813647.5208-122.39313605000
419544005102/18/2015510000.032.00168080801.0NONONE...8 Good16800.019870.09807447.6168-122.04518007503
\n", + "

5 rows × 21 columns

\n", + "
" + ], + "text/plain": [ + " id date price bedrooms bathrooms sqft_living \\\n", + "0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n", + "1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n", + "2 5631500400 2/25/2015 180000.0 2 1.00 770 \n", + "3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n", + "4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n", + "\n", + " sqft_lot floors waterfront view ... grade sqft_above \\\n", + "0 5650 1.0 NaN NONE ... 7 Average 1180 \n", + "1 7242 2.0 NO NONE ... 7 Average 2170 \n", + "2 10000 1.0 NO NONE ... 6 Low Average 770 \n", + "3 5000 1.0 NO NONE ... 7 Average 1050 \n", + "4 8080 1.0 NO NONE ... 8 Good 1680 \n", + "\n", + " sqft_basement yr_built yr_renovated zipcode lat long \\\n", + "0 0.0 1955 0.0 98178 47.5112 -122.257 \n", + "1 400.0 1951 1991.0 98125 47.7210 -122.319 \n", + "2 0.0 1933 NaN 98028 47.7379 -122.233 \n", + "3 910.0 1965 0.0 98136 47.5208 -122.393 \n", + "4 0.0 1987 0.0 98074 47.6168 -122.045 \n", + "\n", + " sqft_living15 sqft_lot15 \n", + "0 1340 5650 \n", + "1 1690 7639 \n", + "2 2720 8062 \n", + "3 1360 5000 \n", + "4 1800 7503 \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# importing the necessary libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import statsmodels.api as sm \n", + "import warnings\n", + "%matplotlib inline\n", + "sns.set_style('dark')\n", + "warnings.filterwarnings('ignore')\n", + "data = pd.read_csv('data/kc_house_data.csv')\n", + "\n", + "data.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading the Data Into a Data Frame" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"data/kc_house_data.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Understanding & Preparation\n", + "Here we will explore the data to get a better understanding of its state, then decide on the steps we need to take to clean it. We will begin by defining some functions for the following tasks:\n", + "- getting the shape of the data\n", + "- getting data info\n", + "- simple check for missing data\n", + "- duplicates\n", + "- descriptive stats\n", + "\n", + "We will then group together the helper function under a new function that explores the data for the above attributes. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# function for shape of the data \n", + "\n", + "def data_shape(data):\n", + " \"\"\"Simple function to provide the shape of the data\"\"\"\n", + " out = print(f\"The DataFrame has:\\n\\t* {data.shape[0]} rows\\n\\t* {data.shape[1]} columns\", '\\n')\n", + "\n", + " return out" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# function for info of the data \n", + "\n", + "def data_info(data):\n", + " \"\"\"Simple function to provide the info of the data\"\"\"\n", + " out = print(data.info(), '\\n')\n", + " \n", + " return out" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# function to check for missing values\n", + "\n", + "def data_missing(data):\n", + " \"\"\"Identify is the data has missing values\"\"\"\n", + " # identify if data has missing values(data.isnull().any())\n", + " # empty dict to store missing values\n", + " missing = []\n", + " for i in data.isnull().any():\n", + " # add the bool values to empty list \n", + " missing.append(i)\n", + " # covert list to set (if data has missing value, the list should have true and false)\n", + " missing_set = set(missing)\n", + " if (len(missing_set) == 1):\n", + " out = print(\"The Data has no missing values\", '\\n')\n", + " else:\n", + " out = print(f\"The Data has missing values.\", '\\n')\n", + "\n", + " return out" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# function to check for duplicates\n", + "\n", + "def identify_duplicates(data):\n", + " \"\"\"Simple function to identify any duplicates\"\"\"\n", + " # identify the duplicates (dataframename.duplicated() , can add .sum() to get total count)\n", + " # empty list to store Bool results from duplicated\n", + " duplicates = []\n", + " for i in data.duplicated():\n", + " duplicates.append(i)\n", + " # identify if there is any duplicates. (If there is any we expect a True value in the list duplicates)\n", + " duplicates_set = set(duplicates) \n", + " if (len(duplicates_set) == 1):\n", + " out = print(\"The Data has no duplicates\", '\\n')\n", + " else:\n", + " no_true = 0\n", + " for val in duplicates:\n", + " if (val == True):\n", + " no_true += 1\n", + " # percentage of the data represented by duplicates \n", + " duplicates_percentage = np.round(((no_true / len(data)) * 100), 3)\n", + " out = print(f\"The Data has {no_true} duplicated rows.\\nThis constitutes {duplicates_percentage}% of the data set.\", '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# function to check for duplicates on the ID column\n", + "\n", + "def unique_column_duplicates(data, column):\n", + " \"\"\"handling duplicates in unique column\"\"\"\n", + " # empty list to store the duplicate bools\n", + " duplicates = []\n", + " for i in data[column].duplicated():\n", + " duplicates.append(i)\n", + " \n", + " # identify if there are any duplicates\n", + " duplicates_set = set(duplicates)\n", + " if (len(duplicates_set) == 1):\n", + " out = print(f\"The column {column.title()} has no duplicates\", '\\n')\n", + " else:\n", + " no_true = 0\n", + " for val in duplicates:\n", + " if (val == True):\n", + " no_true += 1\n", + " # percentage of the data represented by duplicates \n", + " duplicates_percentage = np.round(((no_true / len(data)) * 100), 3)\n", + " out = print(f\"The column {column.title()} has {no_true} duplicated rows.\\nThis constitutes {duplicates_percentage}% of the data set.\", '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# helper function to check for descriptive stats\n", + "\n", + "def data_describe(data):\n", + " \"\"\"Simple function to check the descriptive values of the data\"\"\"\n", + " out = print(data.describe(), '\\n')\n", + " \n", + " return out" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# overall function for data understanding\n", + "\n", + "def explore(data):\n", + " \"\"\"Group of functions to explore data \"\"\"\n", + " out1 = data_shape(data)\n", + " out2 = data_info(data)\n", + " out3 = data_missing(data)\n", + " out4 = identify_duplicates(data)\n", + " out5 = unique_column_duplicates(data, 'id')\n", + " out6 = data_describe(data)\n", + " \n", + " return out1, out2, out3, out4, out5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From below, data has:\n", + "- 21597 houses sold\n", + "- 21 house features : 6 string variables and 15 numeric variables. `date` column is encoded as string instead of datetime, while `sqft_basement` is enconded as string instead of float. These 2 will be corrected\n", + "- Missing values which will be investigated and treated\n", + "- No duplicates. However, the `id` column which should contain unique identifiers has 177 duplicated values. These will be checked\n", + "- From the descriptive stats, there's also potential for some outliers which will need to be veried. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Duplicated Id Column\n", + "Id column duplicates to be dropped in the process below" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The Data has no duplicates \n", + "\n", + "None\n", + "The column Id has 177 duplicated rows.\n", + "This constitutes 0.82% of the data set. \n", + "\n", + "None\n" + ] + } + ], + "source": [ + "print(identify_duplicates(data))\n", + "print(unique_column_duplicates(data, 'id')) " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The column Id has no duplicates \n", + "\n" + ] + } + ], + "source": [ + "def drop_duplicates(df, column):\n", + " \"\"\"function to drop duplicated rows\"\"\"\n", + " \n", + " df.drop_duplicates(subset=column, keep='first', inplace=True)\n", + " confirmation = unique_column_duplicates(data, 'id')\n", + " return confirmation\n", + "\n", + "drop_duplicates(data, 'id') " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 21597 entries, 0 to 21596\n", + "Data columns (total 21 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 21597 non-null int64 \n", + " 1 date 21597 non-null object \n", + " 2 price 21597 non-null float64\n", + " 3 bedrooms 21597 non-null int64 \n", + " 4 bathrooms 21597 non-null float64\n", + " 5 sqft_living 21597 non-null int64 \n", + " 6 sqft_lot 21597 non-null int64 \n", + " 7 floors 21597 non-null float64\n", + " 8 waterfront 19221 non-null object \n", + " 9 view 21534 non-null object \n", + " 10 condition 21597 non-null object \n", + " 11 grade 21597 non-null object \n", + " 12 sqft_above 21597 non-null int64 \n", + " 13 sqft_basement 21597 non-null object \n", + " 14 yr_built 21597 non-null int64 \n", + " 15 yr_renovated 17755 non-null float64\n", + " 16 zipcode 21597 non-null int64 \n", + " 17 lat 21597 non-null float64\n", + " 18 long 21597 non-null float64\n", + " 19 sqft_living15 21597 non-null int64 \n", + " 20 sqft_lot15 21597 non-null int64 \n", + "dtypes: float64(6), int64(9), object(6)\n", + "memory usage: 3.5+ MB\n" + ] + } + ], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data has 6 variables as objects i.e Date, waterfront,view, condition, grade,sqft_basement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dealing with the Missing Values" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "id 0.00\n", + "date 0.00\n", + "price 0.00\n", + "bedrooms 0.00\n", + "bathrooms 0.00\n", + "sqft_living 0.00\n", + "sqft_lot 0.00\n", + "floors 0.00\n", + "waterfront 11.00\n", + "view 0.29\n", + "condition 0.00\n", + "grade 0.00\n", + "sqft_above 0.00\n", + "sqft_basement 0.00\n", + "yr_built 0.00\n", + "yr_renovated 17.79\n", + "zipcode 0.00\n", + "lat 0.00\n", + "long 0.00\n", + "sqft_living15 0.00\n", + "sqft_lot15 0.00\n", + "dtype: float64\n" + ] + } + ], + "source": [ + "# Find the percentage of missing values in each column\n", + "percantage_msng_values = df.isnull().sum()* 100 / len (df)\n", + "\n", + "print(percantage_msng_values.round(2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data frame has some missing values in waterfront, view and year renovated columns constituting to 11%, 0.29% and 17.79% respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[nan 'NO' 'YES']\n" + ] + } + ], + "source": [ + "# look for unique values in the \"waterfront\" column\n", + "unique_values_wf = df['waterfront'].unique()\n", + "\n", + "print(unique_values_wf)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total number of houses where 'waterfront' is 'YES': 146\n" + ] + } + ], + "source": [ + "# look for total number of houses where 'waterfront' is 'yes'\n", + "total_waterfront_yes = len(df[df['waterfront'] == 'YES'])\n", + "\n", + "# Print the total number\n", + "print(\"Total number of houses where 'waterfront' is 'YES':\", total_waterfront_yes)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace null values in 'waterfront' with 'NO'\n", + "df['waterfront'].fillna('NO', inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The waterfront columns contain nan, No and Yes unique values. We decided to fill nan entries with No on assumption that these houses lacked a waterfront and hence entries made as nan" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['NONE' nan 'GOOD' 'EXCELLENT' 'AVERAGE' 'FAIR']\n" + ] + } + ], + "source": [ + "# look unique values in the view column\n", + "unique_values_view = df['view'].unique()\n", + "\n", + "print(unique_values_view)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total number of houses with null values under the 'view' column: 63\n" + ] + } + ], + "source": [ + "# search for how many houses have null values under the 'view' column\n", + "null_values_view = df['view'].isnull().sum()\n", + "\n", + "# Print the total number\n", + "print(\"Total number of houses with null values under the 'view' column:\", null_values_view)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total number of houses where 'view' is 'NONE': 19422\n" + ] + } + ], + "source": [ + "#find out how many houses with no view\n", + "total_view_none = len(df[df['view'] == 'NONE'])\n", + "\n", + "# Print the total number\n", + "print(\"Total number of houses where 'view' is 'NONE':\", total_view_none)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace NaN values in the 'view' column with 'NONE'\n", + "df['view'].fillna('NONE', inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The view columns consist of various unique values such as 'NONE', 'GOOD', 'EXCELLENT', 'AVERAGE', and 'FAIR'. There are a total of 63 entries marked as 'nan'. After careful consideration, we opted to replace these 'nan' entries with 'NONE', under the assumption that these particular houses do not possess a view of notable quality." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ 0. 1934. 1940. 1944. 1945. 1946. 1948. 1950. 1951. 1953. 1954. 1955.\n", + " 1956. 1957. 1958. 1959. 1960. 1962. 1963. 1964. 1965. 1967. 1968. 1969.\n", + " 1970. 1971. 1972. 1973. 1974. 1975. 1976. 1977. 1978. 1979. 1980. 1981.\n", + " 1982. 1983. 1984. 1985. 1986. 1987. 1988. 1989. 1990. 1991. 1992. 1993.\n", + " 1994. 1995. 1996. 1997. 1998. 1999. 2000. 2001. 2002. 2003. 2004. 2005.\n", + " 2006. 2007. 2008. 2009. 2010. 2011. 2012. 2013. 2014. 2015. nan]\n" + ] + } + ], + "source": [ + "# checking unique values in the the yr_renovated column\n", + "unique_values_renovation = df['yr_renovated'].unique()\n", + "unique_values_renovation.sort()\n", + "\n", + "print(unique_values_renovation) " + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Number of houses with 'yr_renovated' greater than 0: 744\n" + ] + } + ], + "source": [ + "# Checking the number of houses with 'yr_renovated' greater than 0 \n", + "houses_with_renovations = df[df['yr_renovated'] > 0]\n", + "\n", + "# Print the number of houses with 'yr_renovated' greater than 0\n", + "print(\"\\nNumber of houses with 'yr_renovated' greater than 0:\", len(houses_with_renovations)) " + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace null values in 'yr_renovated' with 0\n", + "df['yr_renovated'].fillna(0, inplace=True) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The yr_renovated column contains years ranging from 1934 to 2015, but it also includes entries with the values 0 and 'nan'. We inferred that these entries represent houses that have never undergone renovation. Subsequently, we replaced the 'nan' entries with 0 to reflect this assumption." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Average' 'Very Good' 'Good' 'Poor' 'Fair']\n" + ] + } + ], + "source": [ + "# Checking the condition column\n", + "unique_values_condition = df['condition'].unique()\n", + "\n", + "print(unique_values_condition) " + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['10 Very Good' '11 Excellent' '12 Luxury' '13 Mansion' '3 Poor' '4 Low'\n", + " '5 Fair' '6 Low Average' '7 Average' '8 Good' '9 Better']\n" + ] + } + ], + "source": [ + "# Checking the grade column\n", + "unique_values_grade = df['grade'].unique()\n", + "\n", + "# Sorting the unique values in ascending order\n", + "unique_values_grade.sort()\n", + "\n", + "print(unique_values_grade) " + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "id 0.0\n", + "date 0.0\n", + "price 0.0\n", + "bedrooms 0.0\n", + "bathrooms 0.0\n", + "sqft_living 0.0\n", + "sqft_lot 0.0\n", + "floors 0.0\n", + "waterfront 0.0\n", + "view 0.0\n", + "condition 0.0\n", + "grade 0.0\n", + "sqft_above 0.0\n", + "sqft_basement 0.0\n", + "yr_built 0.0\n", + "yr_renovated 0.0\n", + "zipcode 0.0\n", + "lat 0.0\n", + "long 0.0\n", + "sqft_living15 0.0\n", + "sqft_lot15 0.0\n", + "dtype: float64\n" + ] + } + ], + "source": [ + "missing_values_percent = df.isnull().sum() * 100 / len(df)\n", + "\n", + "print(missing_values_percent) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data has been thoroughly cleaned, and there are no longer any missing values present." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.0\n", + "1 400.0\n", + "2 0.0\n", + "3 910.0\n", + "4 0.0\n", + "5 1530.0\n", + "6 ?\n", + "7 0.0\n", + "Name: sqft_basement, dtype: object" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check the 'sqft_basement' column\n", + "df['sqft_basement'].head(8) " + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of '?' in 'sqft_basement' column: 454\n", + "Percentage of '?' in 'sqft_basement' column: 2.1\n" + ] + } + ], + "source": [ + "# check number of '?' in 'sqft_basement' column\n", + "count_question_marks = df['sqft_basement'].str.count('\\?').sum()\n", + "\n", + "# Calculate the percentage of '?' in 'sqft_basement' column\n", + "percentage_question_marks = (count_question_marks / len(df['sqft_basement'])) * 100\n", + "\n", + "print(\"Number of '?' in 'sqft_basement' column:\", count_question_marks)\n", + "print(\"Percentage of '?' in 'sqft_basement' column:\", percentage_question_marks.round(2))" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "# Drop rows with '?' in the 'sqft_basement' column\n", + "df = df[df['sqft_basement'] != '?']\n", + "\n", + "# Convert the 'sqft_basement' column to float\n", + "df['sqft_basement'] = df['sqft_basement'].astype(float)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "# Remove duplicates in 'df' in place\n", + "df.drop_duplicates(inplace=True) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Conversion\n", + "Conversion of various expected features to the correct data types" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert the 'year_renovated' column to integers\n", + "df['yr_renovated'] = df['yr_renovated'].astype(int) " + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert the 'date' column to datetime format\n", + "df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y') " + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "#9. Convert View, condition and grade into representative numbers for easier Exploratory analysis.\n", + "df['view'] = df['view'].map({'NONE': 1,'FAIR': 2,'AVERAGE': 3,'GOOD': 4,'EXCELLENT': 5}).astype(float)\n", + "df['condition'] = df['condition'].map({'Poor': 1,'Fair': 2,'Average': 3,'Good': 4,'Very Good': 5}).astype(float)\n", + "df['grade'] = df['grade'].map({'3 Poor': 1,'4 Low': 2,'5 Fair': 3,'6 Low Average': 4,'7 Average': 5,'8 Good': 6,'9 Better': 7,'10 Very Good': 8,'11 Excellent': 9,'12 Luxury': 10,'13 Mansion': 11}).astype(float) " + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 21143 entries, 0 to 21596\n", + "Data columns (total 21 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 21143 non-null int64 \n", + " 1 date 21143 non-null datetime64[ns]\n", + " 2 price 21143 non-null float64 \n", + " 3 bedrooms 21143 non-null int64 \n", + " 4 bathrooms 21143 non-null float64 \n", + " 5 sqft_living 21143 non-null int64 \n", + " 6 sqft_lot 21143 non-null int64 \n", + " 7 floors 21143 non-null float64 \n", + " 8 waterfront 21143 non-null object \n", + " 9 view 21143 non-null float64 \n", + " 10 condition 21143 non-null float64 \n", + " 11 grade 21143 non-null float64 \n", + " 12 sqft_above 21143 non-null int64 \n", + " 13 sqft_basement 21143 non-null float64 \n", + " 14 yr_built 21143 non-null int64 \n", + " 15 yr_renovated 21143 non-null int32 \n", + " 16 zipcode 21143 non-null int64 \n", + " 17 lat 21143 non-null float64 \n", + " 18 long 21143 non-null float64 \n", + " 19 sqft_living15 21143 non-null int64 \n", + " 20 sqft_lot15 21143 non-null int64 \n", + "dtypes: datetime64[ns](1), float64(9), int32(1), int64(9), object(1)\n", + "memory usage: 3.5+ MB\n" + ] + } + ], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
071293005202014-10-13221900.031.00118056501.0NO1.0...5.011800.0195509817847.5112-122.25713405650
164141001922014-12-09538000.032.25257072422.0NO1.0...5.02170400.0195119919812547.7210-122.31916907639
256315004002015-02-25180000.021.00770100001.0NO1.0...4.07700.0193309802847.7379-122.23327208062
324872008752014-12-09604000.043.00196050001.0NO1.0...5.01050910.0196509813647.5208-122.39313605000
419544005102015-02-18510000.032.00168080801.0NO1.0...6.016800.0198709807447.6168-122.04518007503
\n", + "

5 rows × 21 columns

\n", + "
" + ], + "text/plain": [ + " id date price bedrooms bathrooms sqft_living \\\n", + "0 7129300520 2014-10-13 221900.0 3 1.00 1180 \n", + "1 6414100192 2014-12-09 538000.0 3 2.25 2570 \n", + "2 5631500400 2015-02-25 180000.0 2 1.00 770 \n", + "3 2487200875 2014-12-09 604000.0 4 3.00 1960 \n", + "4 1954400510 2015-02-18 510000.0 3 2.00 1680 \n", + "\n", + " sqft_lot floors waterfront view ... grade sqft_above sqft_basement \\\n", + "0 5650 1.0 NO 1.0 ... 5.0 1180 0.0 \n", + "1 7242 2.0 NO 1.0 ... 5.0 2170 400.0 \n", + "2 10000 1.0 NO 1.0 ... 4.0 770 0.0 \n", + "3 5000 1.0 NO 1.0 ... 5.0 1050 910.0 \n", + "4 8080 1.0 NO 1.0 ... 6.0 1680 0.0 \n", + "\n", + " yr_built yr_renovated zipcode lat long sqft_living15 \\\n", + "0 1955 0 98178 47.5112 -122.257 1340 \n", + "1 1951 1991 98125 47.7210 -122.319 1690 \n", + "2 1933 0 98028 47.7379 -122.233 2720 \n", + "3 1965 0 98136 47.5208 -122.393 1360 \n", + "4 1987 0 98074 47.6168 -122.045 1800 \n", + "\n", + " sqft_lot15 \n", + "0 5650 \n", + "1 7639 \n", + "2 8062 \n", + "3 5000 \n", + "4 7503 \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exploratory Data Analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Raw Price Distribution" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# View price distribution\n", + "plt.figure(figsize=(8,5))\n", + "dist=sns.distplot(df[\"price\"])\n", + "dist.set_title(\"Price distribution\")\n", + "plt.xlabel('Price in USD')\n", + "plt.title('Distribution of Price')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The distribution is positively skewed, meaning that the mean is much greater than the median which should not be the case for a normal distribution.\n", + "We have to normalize the distribution." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Normalized Price Distribution" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Normalizing Price Distribution\n", + "fig, ax = plt.subplots(figsize=(11, 8))\n", + "\n", + "sns.distplot(np.log(df['price']), bins = 100) \n", + "\n", + "ax.set_xlabel(\"Normalized Price\")\n", + "ax.set_ylabel(\"Number of houses\")\n", + "ax.set_title(\"Normalized house prices distribution\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The distribution above now assumes an inverted bell shape. We have normalized the distribution and the mean and median now fall in a central range of the data points." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `How do different features influence house pricing?`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### *Landscape Features*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The landscape feature is the waterfront.\n", + "#### Waterfront" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "146" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "total_waterfront_yes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Most properties do not have a waterfront. Only 146 do.*" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAEFCAYAAAD36MwKAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAqsElEQVR4nO3de1xN+f4/8NfuJveKxmSMlDvnGN/cKcqdo3ErisJgnEljBmNcRnKNjGZcwrjMxWgcMcVwDsNRdMHkbmKUax0l0minkmq39+f3Rz97RFuitXdWr+fj4fGwdnt/1nvfXn36rM9aH4UQQoCIiGTDyNAFEBFRxWKwExHJDIOdiEhmGOxERDLDYCcikhkGOxGRzDDYK6HU1FS0bt0aQ4cOxdChQ+Hq6gp3d3ecO3dOkv21bNkSmZmZL7xPVFQU1q5dW6521Wo1fHx8MGDAAPz000+vU2IJOTk5GDdu3Cs9duHChejduzdWr15dIbXEx8fD39+/1J8dP34cLi4ucHNzQ35+/iu17+fnh8uXL79OiZXOzp07sWXLFkOXIWsmhi6ASmdubo59+/Zptw8ePIh58+bhv//9r0HquXTpEh4+fFiux6Snp+P48eO4ePEijI2NK6yWhw8f4tKlS6/02F27diEqKgpvv/12hdRy48YNpKenl/qzAwcOwN3dHVOnTn3l9k+ePInRo0e/8uMrI09PT0OXIHsM9jdEVlYWrK2ttdu7du1CSEgIjIyMUL9+fSxYsAC2trb44IMP0LZtW8yePRsnT57E3LlzsWfPHgQFBaFatWpITEzEgwcP0KNHD/j5+cHU1LTEfjZs2IADBw7A2NgYdnZ2WLBgAdLS0hAaGgq1Wo3atWtjxowZJR5z9uxZfPnll3j8+DFMTU0xffp0ODg4YPLkySgqKsKIESMQHByMxo0bAwASEhLw0UcfITo6GgAwadIk1K9fHytXrkRhYSGcnJwQERGBw4cPY9euXVCpVHj48CE+/PBDjBkzBvPmzUN+fj6GDh2KPXv2IDk5GQEBAcjKyoJarYa3tzfc3Nxw6tQpBAQEoEaNGnj06BFq1KgBIQQ+/PBDLFy4ELNnz0a7du1w9epVzJw5E02aNMGSJUuQlZUFhUKBiRMnYtiwYTh16hRWr16Nd999F9evX0dRUREWL16Mhg0bYt26dcjJycG8efOwYsUK7Wvy7bffIjIyEtWqVUNOTg7mzJmDb775Bv/973+h0WjwzjvvYOHChWjQoAEuXryIVatWobCwEBkZGejevTuWL1+O1atX4/79+5g1axa+/PJLBAUFYezYsRg4cCAAwNvbW7v9t7/9DX369EFiYiKCgoJQo0aNUl+Tpx0/fhwrV67Ev//9bwBAdnY2+vTpg4iICBw4cAChoaEwNTVFtWrVsGTJEjRr1kzn5zM1NRXe3t5wcnLC77//DiEE/P390bFjRwQHB+PixYu4f/8+WrZsCVtbWyiVSvj7+yMpKQn+/v7IzMyEkZERfHx8MHjwYKSnp2PJkiW4e/cuVCoV/vGPf+Cjjz4q79em6hJU6aSkpIhWrVqJ999/X7z//vvC2dlZtG3bVkRFRQkhhDh58qTo27evePDggRBCiPDwcDFo0CCh0WhEenq66N69uzhy5IhwcnISp0+fFkIIMWfOHDFs2DCRm5srCgoKxNixY0VISIgQQogWLVqIBw8eiLCwMDF69Gjx6NEjIYQQ69atExMnTtT+f/Hixc/VmpmZKbp16yYuXrwohBDi2rVronPnzuL27dsiJSVFtG/fvtTn2Lt3b3H16lXx+PFj4ezsLHr27CmEECIqKkpMnjxZ5ObmilGjRonMzEwhhBAXLlzQtvV0uyqVSgwePFhcvnxZCCFEdna2GDRokLhw4YKIi4sTrVq1Eqmpqdr9PnmuQgjh4uIi1q9fr22nT58+4vDhw0IIIe7duyecnJzE+fPnRVxcnGjdurW4cuWKEEKI7777TowdO1b72k+ZMqXU5zhnzhzx7bffCiGE2Lt3r5g+fbpQqVRCCCFCQ0PF5MmThRBCzJgxQ8TFxQkhhMjNzRVdunQRly5d0tYYHx8vhBDCy8tL/Prrr9r2n95u0aKF2Lt3b5mvydM0Gk2J9nfs2CE+++wzUVRUJNq2bSvS09O1tYeGhpb6HJ9ISUkRLVq0EPv37xdCFL+PPXr0EIWFhWLdunViwIAB2uf+9Gdp2LBh4qeffhJCCJGWlib69OkjcnJyhLe3t4iMjBRCCJGfny+8vb3FgQMHXlgD/aVS9dh///13BAUFISQkROd99uzZg507d0KtVqNPnz7w9fXVY4X68+xQzMmTJ+Hr64v9+/cjNjYWgwcPhpWVFQBgxIgRCAgIQGpqKt59910sXboUU6dOxbRp09CpUydtG8OHD0fNmjUBAEOHDkVkZCS8vLy0P4+JicGIESNQo0YNAMC4ceOwadMmFBYW6qwzPj4ejRs3xnvvvQcAaN68ORwcHHD69Gl06dJF5+P69euHmJgYNG/eHF27dsXVq1dx/fp1REZGon///qhZsyY2bdqE6OhoJCcnIzExEXl5ec+1k5ycjNu3b+OLL77Q3pafn48rV66gadOmsLGxwTvvvKOzjo4dO2rbKSgoQP/+/QEADRo0QP/+/REbG4suXbqgYcOGaN26NQCgTZs22Lt3r842S3Ps2DFcunQJI0eOBABoNBo8fvwYABAYGIiYmBhs2rQJt27dQkFBQanPtSxPPxddr0n79u21tykUCowcORJ79+7F3//+d+zZswezZ8+GsbExBg4cCA8PDzg7O8PR0RG9evUqc/9169aFq6srAKBXr14wNjbG1atXAQDt27eHiUnJuMnKykJiYiLc3d0BADY2NoiIiEBeXh7OnDmDhw8fao/r5OXlITExEYMHDy7361IVVZpg37p1K/bv34/q1avrvM/t27exc+dOhISEwMzMDOvWrYNKpXpuOEGOunfvjsaNG+PSpUvQaDTP/VwIgaKiIgDF477169dHfHx8ifs8Pc4thICRUclj5xqNBgqFosT2kzZ1UavVJR7zbC269O3bF2vXrsX9+/fRo0cP1KtXD8ePH0dMTAxmzJiBe/fuYfTo0Rg1ahQ6dOiAgQMH4tixY6Xuv3bt2iV+Cf7555+oXbs2Ll68qP0lpcuTn5f1PMzNzbW3KxQKiHJeYkmj0WDy5MkYM2YMAKCwsFB7zMLLywstW7aEk5MTBg0apB3KKM3Tt6tUKp3PRddr8iw3NzcMHz4c7u7uyMnJQefOnQEAQUFBuHbtGk6ePIktW7Zg3759ZR48f/Y4ikaj0d5W2vvwJOifft1v3boFa2trCCEQGhqqzYPMzExUq1bthfunv1SaWTGNGzdGcHCwdvvq1avw9vaGt7c3pk2bhpycHJw8eRJ/+9vfMGfOHHh5ecHBwaFKhDoAJCUl4c6dO2jdujWcnJxw8OBB7UyW8PBwWFhYwNbWFvHx8di+fTvCw8ORk5ODH3/8UdvGr7/+isLCQhQUFGDv3r1wcXEpsQ8nJyeEh4dre4shISHo1KkTzMzMYGxsXGpYt2/fHrdu3dL+Erl+/TrOnDmjDQhdHBwckJKSgqioKHTv3h09evTAjz/+iCZNmsDS0hKXL1+GlZUVpk6dCkdHR22oq9VqmJiYQK1WQwgBOzu7En/d3L17F0OGDCn3TBJ7e3uYmJhoD06np6fj8OHD6N69+wsfp+t1eZajoyPCwsKQm5sLAFi7di1mz56N7OxsXLp0CbNmzUL//v1x79493L59W/vL++n2raystM/rxo0b2t7ws8rzmjRo0ADt2rWDv7+/dgw+MzMTvXr1goWFBSZMmIDp06e/1MHqzMxMxMTEAACOHj0KU1NTtGjRQuf9a9WqhbZt2+KXX37R1unp6Yn8/Hy0b98eP/zwA4DisX9PT09ERkaWWQMVqzQ99gEDBiA1NVW7vWDBAixfvhzNmjXDzz//jG+//Rbm5uY4e/Ysdu7ciYKCAnh6eiIsLAx16tQxYOXSeHJw8AmNRoMlS5bAzs4OdnZ2mDBhAsaPHw+NRgMrKyts3rwZeXl5mDlzJvz8/NCgQQMEBgbC3d1dOxxjbm6OMWPGIDs7GwMGDNAOCzzh5uaGu3fvwt3dHRqNBra2tggKCgIAdO3aFbNmzcLSpUuxYMEC7WOsrKywdu1aLF26FPn5+VAoFFixYgXs7OxKvJ/PMjIyQs+ePXHp0iVYWVmhQ4cOePjwoXYopEePHggLC8PAgQOhUCjQuXNnWFlZ4X//+x9sbW3Rrl07/OMf/8COHTuwceNGBAQE4Ntvv0VRURE+/fRTdOjQAadOnXrp19vU1BQbN27EsmXLEBwcDLVaDV9fX3Tt2vWF7bRv3x4bNmzAxx9/jPXr1+u8n7u7O9LT0zFq1CgoFArY2NggMDAQderUwZQpUzB8+HDUqFEDDRo0gIODA/73v/+hW7du6NevHz7//HMsWrQIPj4+mDt3LqKjo2Fvb68denmWmZmZztdEV22ffvopvvnmGwDF76mPjw8mTJgAc3NzGBsbY9myZQCKpypevnwZAQEBz7VTrVo17Nu3D0FBQTA3N8eGDRvKnA311VdfYfHixQgJCYFCoUBAQACsra0RFBSEpUuXwtXVFYWFhRgyZAjef//9F7ZFf1GI8v5NKaHU1FTMnDkTu3fvRocOHdCmTRsAxX9y2tnZoV27drhx44Y2WHx8fODj44N27doZsuw3wty5c9G8eXNMmjTJ0KWQDKWmpsLV1RUXLlwwdCmEStRjf5adnR1WrlyJhg0b4ty5c8jIyICdnR3+9a9/oaCgAGq1Gjdv3tROoSMiomKVNtgXLVqEOXPmQK1WAwACAgJgZ2eHkSNHwtPTE0IITJ06FRYWFoYt9A0RGBho6BJIxho1asTeeiVSqYZiiIjo9VWaWTFERFQxJBmKUalUmDt3Lu7cuQMjIyMsXboUTZs21Xl/jUYDtZp/OBARlYepaemzjiQJ9ujoaBQVFSE0NBQnTpzAmjVrSsxRf5ZaLZCVVf4z7YiIqjJr6+dPOgMkGoqxs7ODWq2GRqNBbm7uc6cSExGRdCRJ3Bo1auDOnTsYNGgQlEolNm3aJMVuiIioFJL02Ldt2wZHR0ccPnwY+/btw9y5c1FQUCDFroiI6BmS9Njr1KmjvYZL3bp1UVRUpJ2PTkRE0pKkxz5hwgT88ccfGDNmDMaPH48ZM2aUeZU9qhhKZSb8/edCqVQauhQiMpBKcYKSSqXmrJgKsnXrRhw5cgj9+w/C5Mk+hi6HiCSk11kxZBhKZSaOHYuEEALHjkWw105URTHYZSQsLBRCFF/HW6PRIDw81MAVEZEhMNhlJDY2SrsoQ1FREWJinl9xiIjkj8EuI05OztqTwUxMTNCzp0sZjyAiOWKwy4ibmwcUiuK31MjICCNHehi4IiIyBAa7jFhaWsHFpQ8UCgVcXPrC0tLS0CURkQHwIi4y4+bmgZSU2+ytE1VhnMdORPSG4jx2IqIqgsFORCQzDHYiIplhsBMRyQyDnYhIZhjsREQyw2AnIpIZBjsR6QUXgdEfBjsR6UVYWCgSE6/wctJ6wGAnIslxERj9kuRaMXv27MHevXsBAAUFBUhISMCJEydQp04dKXZHRJVcaYvAcOlG6UjSYx8xYgRCQkIQEhKCtm3bws/Pj6GuJxzHpMooNpaLwOiTpEMxly5dwo0bNzB69Ggpd0NP4TgmVUZcBEa/JA32zZs3w9fXV8pd0FM4jkmVFReB0S/Jgj07Oxu3bt1C165dpdoFPYOLWVNlxUVg9EuyYD9z5gy6d+8uVfNUithYjmNS5eXm5oFWrdqwt64HkgV7UlISGjVqJFXzVAqOY1JlZmlphSVLAtlb1wOuoCQjSmUmfH0/hEpVCDMzM6xf/y2/REQyxhWUqgCOYxIRwMWsZYeLWRMRh2KIiN5QHIohIqoiGOxERDLDYCcikhkGOxGRzDDYiYhkhsFORCQzDHYiIplhsBMRyQyDnYhIZhjsREQyw2AnIpIZBjsRkcww2ImIZIbBTkQkMwx2IiKZkWyhjc2bN+Po0aNQqVTw9PSEu7u7VLsiIqKnSBLsp06dwoULF7Bz5048fvwY33//vRS7ISKiUkgS7MePH0eLFi3g6+uL3NxczJ49W4rdEBFRKSQJdqVSibS0NGzatAmpqanw8fHBoUOHoFAopNgdERE9RZJgt7CwgL29PczMzGBvb49q1aohMzMT9erVk2J3RET0FElmxXTo0AGxsbEQQiA9PR2PHz+GhYWFFLsiIqJnSNJjd3FxwZkzZ+Dm5gYhBPz9/WFsbCzFroiI6BkKIYQwdBEqlRpZWXmGLoOI6I1ibV271Nt5ghIRkcww2GVGqcyEv/9cKJVKQ5dCRAbCYJeZsLBQJCZeQXh4qKFLISIDYbDLiFKZiWPHIiGEwLFjEey1E1VRDHYZCQsLhRAaAIBGo2GvnaiKYrDLSGxsFIqKigAARUVFiIk5ZtB6iMgwGOwy4uTkDBOT4lMTTExM0LOni4ErIiJDYLDLiJubh/Z6PAqFAiNHehi4IiIyBAa7jFhaWqFBg7cBAG+/bQNLS0sDV0T0F07F1R8Gu4wolZm4d+8eAODevbv8AlGlwqm4+sNgl5GwsFAAxVeIEELwC0SVBqfi6heDXUZiYzkrhionTsXVLwa7jHBWDFVWsbHsdOgTg11GimfFFL+lRkZGnBVDlYaTkzOMjYs7HcbG7HRIjcEuI5aWVnBx6QOFQgEXl76cFUOVhpubh3YoRggNOx0Sk2ShDTIcNzcPpKTc5heHKh3Dr/xQdbDHLjOWllZYsiSQvXWqVMLCQmFk9NfJczx4Ki0GOxFJLjY2Cmq1GgCgVqt58FRikg3FDBs2DLVrFy/b1KhRI6xYsUKqXRFRJefk5IyjR4+gqKiIM7b0QJJgLygoAACEhIRI0TwRvWHc3Dxw7FgkAM7Y0gdJhmISExPx+PFjTJw4EePGjcPFixel2A0RvSE4Y0u/JOmxm5ubY9KkSXB3d0dycjI+/PBDHDp0SHvyDBFVPZyxpT8KISp+ElJhYSE0Gg3Mzc0BAG5ubggODoaNjU2p91ep1MjKyqvoMoiIZM3aunapt79wKKawsBC//fYbDh48iPPnz0Oj0bzUzsLCwhAYGAgASE9PR25uLqytrctZMhERvQqdPfaEhATMnDkTbdu2Rb169ZCWloabN29i3bp1aNas2QsbLSwsxLx585CWlgaFQoFZs2bBwcFB5/3ZYyciKj9dPXadwT5p0iTMnz8f9vb22tuuXbuGVatWYevWrRVaHIOdSP6UykysXv0lZsyYw4OnFaTcQzH5+fklQh0AWrRoAZVKVbGVEVGVwIU29EdnsBsbG5d6+8uOsxMRPcGFNvRL5/zD9PR07Nq1q8RtQgjcv39f8qKISF5KW2hj8mQfA1clXzp77K6ursjIyCjx788//8SQIUP0WR+VExcMpsooNpYLbeiTzh77xx9/XGL7+vXrMDU1RZMmTaSuiV7D0+OY7BFRZcFrxeiXzh77iRMn4OzsDJVKhdDQUPj4+GD27Nn4+eef9VkflQPHMamy4upe+qWzx/7dd99h9+7dMDU1xZYtW/DDDz/AxsYG3t7ecHd312eN9JI4jkmliY4+iqNHjxi6DJiamkKlKkTNmrWwZs2XBqujd+9+6NWrt8H2rw86e+wKhQJvvfUWUlJSYGpqCltbW5iZmemcLUOGFxvLcUyqvITQwMjIiGeh64HOHntRURGKiooQFRUFR0dHAEB2djYeP36st+KofDiOSaXp1at3peihLlw4DwCweDHXZpCazmAfNmwYBg8eDLVajW3btuHatWuYNWsWxo0bp8/6qBx4zWsiAl4Q7MOHD0e/fv1Qo0YNGBkZ4f79+wgMDESbNm30WR+Vw5NrXh85cojXvCaqwnQG+y+//FJi29zcHG3btpW6HnpNffsORGxsNPr2HWjoUojIQHQePL1582aJf2fOnMHHH3+MsLAwfdZH5RQRcQj5+Y8REXHI0KUQkYHo7LF/9tlnz91WUFAAb29vuLm5SVoUvZpn57GPHOnB4RiiKqhca55Wq1YNpqamUtVCr6m0eexEVPWUK9gzMjI43bESi43lPHYiesFQzMyZM6FQKLTbBQUFSEhIwLx58/RSGJWfk5MzIiOPQK0ugrEx57ETVVU6g93Do+QcaHNzc9jb26NWrVqSF0Wvxs3NAxERhwEUn+XHeexEVZPOYO/cufNrNfzgwQOMGDEC33//PZo2bfpabdHLK32hQyKqSso1xv6yVCoV/P39YW5uLkXzpENYWCiMjIqHzxQKBQ+eElVRkgT7ypUr4eHhgbfeekuK5kmH2NgoqNVqAIBarebBU6IqSudQzBN3797Ff/7zHxQUFGhve3YRjqft2bMHVlZWcHJywpYtWyqmSnopTk7OiIg4DI2m+Cp6PHhKVDWV2WP/9NNPkZubi/r162v/vUh4eDhOnjwJb29vJCQkYM6cOcjIyKiwgkk3NzcP7WLjGg0PnhJVVWX22GvWrIkZM2a8dIM7duzQ/t/b2xuLFi3i9Zf1JCsrq8T2w4dZPPOUqAoqs8fevHlzHDhwALdu3UJSUhKSkpL0URe9gnXrgl64TURVQ5k99oSEBCQkJGi3FQoFtm/f/lKNh4SEvHplVG6pqSkltlNSbhuoEiIypDKDneH85rCxaYi7d9O02w0bvmPAaojIUHQG+yeffIJ169Zpl8V72vHjxyUtil6Njc07JYLdxobBTlQV6Qz2devWAWCIv0kuXjxXYvvChbMGqoSIDEmSE5TIMDQa8cJtIqoaGOwyYmxs9MJtIqoaXuqbn5ubi6tXryIvL0/qeug1ODr2embb2TCFEJFBlTkr5tChQ9i0aRPUajUGDhwIhUKBqVOn6qM2KqexY8cjJiYKQhRfUmDs2PGGLomIDKDMHvu2bduwe/duWFhYYOrUqYiIiNBHXfQKLC2ttGf5Wlu/xbNOiaqoMoPdyMgIZmZmUCgUUCgUqF69uj7qolegVGbi/v10AEB6+j0olUoDV0REhlBmsHfs2BEzZ85Eeno6/P398fe//10fddEr2LFjW4ntf/3rR8MUQkQGVeYY+8yZMxETE4M2bdqgadOmcHHhpWArq9jY6BLbMTHH4Os73TDFEJHBlNljP3r0KC5cuIDJkyfjp59+4glLldiTS/bq2iaiqqHMYA8ODoaXlxcAYM2aNVi/fr3kRRER0asrM9hNTExQr149AEDt2rVhZMSTXiorCwuLZ7Y5K4aoKipzjL1du3b47LPP0L59e8THx6NNmzb6qIteQXZ29gu3iahqKDPY/fz8EBkZiVu3bmHQoEHo3bu3PuqiV/D8GLvaQJUQkSHpHFc5dqx4hfvdu3fjwYMHqFu3LjIyMrBr1y69FUdEROWns8f+ZP1MLkT95lAojCCEpsQ2EVU9OoN9+PDhAICkpCR89dVX5WpUrVbDz88PSUlJMDY2xooVK9C4cePXq5TK9HSol7ZNRFVDmV06lUqFxMREFBQUoLCwEIWFhWU2+mQYJzQ0FJ988glWrFjx+pUSEdFLKfPgaVJSUomrOSoUCkRGRr7wMX379oWzszMAIC0tDfXr13+9KomI6KWVGez//ve/oVarkZmZiXr16r30PHYTExPMmTMHR44c0S6zR0RE0iszpY8cOYJ+/fphypQpGDBgAE6cOPHSja9cuRKHDx/GggULuEgHEZGelNlj37BhA37++WfUq1cPf/75Jz766CP06NHjhY/55ZdfkJ6ejn/+85+oXr06FAoFjI2NK6xoIiLSrcxgt7Cw0F5SoH79+qhVq1aZjfbv3x/z5s3D2LFjUVRUhC+++ALVqlV7/WqJiKhMZQZ7rVq1MGnSJHTq1Al//PEH8vPz8fXXXwMovqRvaWrUqIG1a9dWbKVERPRSygz2Pn36aP/foEEDSYshIqLXV2awPzlRiYiI3gw855yISGYY7EREMsNgJyKSGQY7EZHMMNiJiGSGwU5EJDMMdiIimWGwExHJDIOdiEhmyjzzlIhe3Q8/bEVy8i1Dl1EpPHkdFi6cZ+BKKocmTezxwQcfStI2g51IQsnJt5B8LR6Na6kNXYrB1YUCAKBJu2DgSgzvdq60lzFnsBNJrHEtNb5wyDZ0GVSJLD9fR9L2OcZORCQzDHYiIplhsBMRyUyFj7GrVCp88cUXuHPnDgoLC+Hj41NisQ4iIpJWhQf7/v37YWFhgVWrVkGpVGL48OEMdiIiParwYB84cCAGDBig3TY2lnZaDxERlVThwV6zZk0AQG5uLj755BNMnz69ondBREQvIMnB07t372LcuHEYOnQoXF1dpdgFERHpUOE99j///BMTJ06Ev78/unXrVtHNExFRGSq8x75p0yZkZ2dj48aN8Pb2hre3N/Lz8yt6N0REpEOF99j9/Pzg5+dX0c0SEdFL4glKREQyw2AnIpIZBjsRkcww2ImIZIbBTkQkM1xog0hCWVlKKHOMJV9Ygd4s/8sxhmWWUrL22WMnIpIZ9tiJJGRhYYk6eclcGo9KWH6+DowsLCVrnz12IiKZYbATEckMg52ISGYY7EREMsNgJyKSGQY7EZHMMNiJiGSGwU5EJDMMdiIimeGZpxUkOvoojh49YugynrNw4TyD7Ld3737o1au3QfZNVNVJ1mP//fff4e3tLVXzRESkgyQ99q1bt2L//v2oXr26FM1XSr169TZ4D9Xd3fW52xYvXmGASojIkCTpsTdu3BjBwcFSNE0v8NZbb5fYfvttGwNVQkSGJEmwDxgwACYmHL7Xtw0btpbYDg7eYqBKiMiQOCtGZoyNjQGwt05UlbFbLTMtW7YGwLF1oqqMPXYiIpmRrMfeqFEj7N69W6rmid4Yt3O55ikAPCxUAADqmgkDV2J4t3ON0UTC9jkUQyShJk3sDV1CpfEw+RYAwLIhX5MmkPazwWAnktAHH3xo6BIqjSdnQfP4j/Q4xk5EJDOy6LH/8MNWJP//P/Oquievg6GuEVOZNGlizx4zVUmyCPbk5Fv44+o1qGtYGboUg1NoTAEA8Sl/GrgSwzLOyzR0CUQGI4tgBwB1DSs8bjXY0GVQJVE98aChSyAyGI6xExHJDIOdiEhmZDEUk5WlhHHeA/75TVrGeQ+QlWVs6DKIDII9diIimZFFj93CwhK3c9Q8eEpa1RMPwsLC0tBlEBkEe+xERDIjix47UDxvmWPsgEL1GAAgTKvOsoSlKZ7HXt/QZRAZhCyCnRda+suTM0+bvPuugSsxtPr8XPx/0dFHcfToEUOXUWnOiu7du5/B1yeWmiyCnaeN/4UXWqLKysKCZ4briyyCnYh069Wrt+x7qFQSD54SEckMg52ISGYUQogKX6dKo9Fg0aJFuHr1KszMzLBs2TLY2trqvL9KpUZWVl5Fl6FXle0AlaEPHFaFA1REhmZtXbvU2yXpsUdERKCwsBC7du3CZ599hsDAQCl2Q6WwsLDiQSqiKk6Sg6fnzp2Dk5MTAKB9+/a4fPmyFLupVHiAiogqC0l67Lm5uahVq5Z229jYGEVFRVLsioiIniFJsNeqVQuPHj3Sbms0GpiYcGYlEZE+SBLsDg4OiImJAQBcvHgRLVq0kGI3RERUCkm60f369cOJEyfg4eEBIQSWL18uxW6IiKgUkkx3LC85THckItI3vU53JCIiw2GwExHJDIOdiEhmKsUYOxERVRz22ImIZIbBTkQkMwx2IiKZYbATEckMg52ISGYY7EREMsNgJyKSGQZ7KU6dOoWOHTvi7t272tuCgoKwZ8+eV24zICAAaWlpFVHec9RqNSZNmgRPT088fPjwtdo6c+YMEhMTX/r+q1evxogRI3Dq1KlX3mdaWhqOHj36yo+nFxs3bhzi4+MBAIWFhejQoQO+++477c+9vLx0vuflfW8iIiIwZMgQbN++/ZXrLSgowM8///xS95Xye/UmY7DrYGpqinnz5qGizt+aP38+GjZsWCFtPSsjIwNKpRI7d+5E3bp1X6ut8PBw3L9//6Xvf/DgQWzfvh1dunR55X3GxcXh/Pnzr/x4ejFHR0ecPXsWQPHqZo6OjoiKigJQHKJ3795Fq1atSn1sed+bY8eOYebMmRg3btwr15uRkfHSwS7l9+pNxtUvdOjatSs0Gg127NgBLy+vEj/7/vvvceDAAZiYmKBjx474/PPPtT/LzMzE2LFjcfDgQSgUCixevBjdu3fH9u3bsWjRIrz11luYP38+lEolAMDPzw+//fabttft7+8PMzMz+Pn5YePGjXj33Xfh6uqqbX///v348ccfYWZmhiZNmmDJkiVYsGABkpOT4e/vjyVLlgAAEhISsGbNGmzevBn/+c9/sGXLFuzfvx9nz57Fvn374Ovri0WLFqGgoABZWVnw9fXF22+/jdjYWPzxxx9o1qwZfv/9d2zbtg1GRkbo0KEDZs2aheDgYFy4cAF5eXlwdHTEvXv38M9//hNTpkzB+vXrYWpqilGjRsHa2hpr1qxBtWrVYGFhgeXLlyMhIQFbt26FqakpUlNTMXjwYEyZMgVbtmxBfn4+/u///g99+vTRw7tbtXTv3h0bN27ExIkTER0dDXd3dwQFBSEnJwd//PEHOnfuDLVaDX9/f9y7dw9KpRI9e/bEtGnTSrw3jRo1wrJlywBA+55euXIFQUFBMDU1hbu7O6KiohAfHw9LS0vMnDkT9vb2sLe3x/jx4zF//nwUFRVBoVDAz88PrVq1Qv/+/eHg4ICkpCTUq1cPwcHB2LRpE27cuIH169fj448/BiD990p2BD0nLi5OTJ8+XWRmZoo+ffqIpKQksWrVKhEeHi4SExOFm5ubKCwsFBqNRvj6+oqjR4+WePynn34qTp8+LQoKCsTgwYOFSqUSXl5e4saNG+LLL78UO3bsEEIIkZSUJDw8PMSdO3fE+PHjhRBCeHl5CTc3NyGEEJ6eniInJ0fbbmZmpujbt6/2toCAABESEiJSUlKEu7v7c89jyJAhIj8/X8yePVu8//77IiMjQ6xcuVJER0eLEydOiLi4OCGEEOfOnRMTJkwQQggxZ84cER0dLZRKpRg0aJDIy8sTQggxa9Yscfz4cbFu3TqxdOlS7T5cXFxEfn6+iIuLE66urkIIITQajXBxcRH37t0TQgixbds2ERgYKOLi4sSgQYOESqUSjx49Eg4ODkIIIcLDw8WqVate4x2jF1Gr1WLAgAFCo9GIESNGiIKCAhEYGCgOHTok1q5dKw4cOCBSUlLE7t27hRBC5Ofni86dOwshSr437u7u4vr160IIIXbv3i2+/vrrEu+7EH99foQQomXLliIzM1MIIcS0adPEkSNHhBBCXLlyRQwfPlwIIUSrVq1EWlqaEEKI0aNHiwsXLuj8PEv1vZIj9thfwNLSEl988QXmzp0LBwcHAMCtW7fw3nvvwdTUFADQsWNHXL9+HS4uLtrHjRo1Cnv37kVGRgZ69+5dYlnAa9euIS4uDr/++isAIDs7Gw0bNkR+fj7i4+PRtGlTpKWlIT4+HrVr1y6xdmxKSgqaNWumva1Tp044fvw4nJ2dS63f0dERp06dwt27d+Hq6oqTJ0/i7NmzmDFjBpKTk/HNN98gLCwMCoXiuTVpb9++jczMTEyZMgUA8OjRI6SkpAAA7OzsSt3fk9uVSiVq1aqFBg0aaOv8+uuv4ezsjBYtWsDExAQmJiYwNzd/iXeBXpeRkRFatWqFmJgYWFtbw8zMDD179kRUVBQSExMxbtw4mJiY4NKlS4iLi0OtWrVQWFj4XDs3b97E4sWLAQAqlUr7fuv6PFhaWsLS0lL72E6dOgEAWrdujXv37mnvY2NjAwCwsbFBQUGBzuch1fdKjjjGXobevXvDzs4Oe/fuBQDY29sjPj4eRUVFEELgzJkzz32wu3XrhoSEBISHh8PNza3Ez+zt7TFhwgSEhIRgzZo12j8He/XqhVWrVsHR0RGOjo5YtmwZ+vbtW+KxjRo1ws2bN5GXV7woyenTp3V+qQCgb9++2Lp1K1q2bAlHR0fs2LEDtra2MDU1xdq1azF06FCsWrUKXbp00R5LUCgUEEKgUaNGsLGxwffff4+QkBB4eXnhvffeA1AcFKV5crulpSVyc3O1Y/WnT59GkyZNtO2X9jiNRqPzedDr69GjBzZv3gwnJycAQIcOHXDlyhUAxcMqe/bsQe3atfHVV19h4sSJyM/PhxCixHtjZ2eHlStXIiQkBJ9//jl69eoFoOzPAwA0bdpUO86fkJCA+vXrAyjf50Gq75UcMdhfwvz587W9y5YtW2LQoEHw9PSEm5sb3nnnnec+KAqFAgMGDIBKpYKtrW2Jn3300Uf49ddf4e3tjcmTJ6N58+YAgP79++P8+fPo2rUrHB0dcfny5efGm62srDBt2jSMGzcOo0aNglKphKenp866n4xdOjo6olWrVrhz5w769+8PABg4cCACAgIwZswYnDx5Ujs2+d577yEoKAhKpRITJkyAt7c33N3dERMTow3nsigUCixbtgzTpk2Dh4cHfvvtN0ydOlXn/Vu0aIHIyEgcOHDgpdqn8uvevTvOnTunDWMzMzPUrl1b24vu1q0bYmJi4OHhgUWLFsHW1hb3798v8d4sWrQIc+bMwZgxY/DVV1+hZcuWL73/2bNn46effsLYsWOxaNEiBAQE6LxvvXr1oFKpsGrVqhK3S/W9kiNetpeISGbYYycikhkGOxGRzDDYiYhkhsFORCQzDHYiIplhsBMRyQyDnYhIZv4fYJ7r9fGUuwoAAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Plot boxplot of waterfront feature\n", + "sns.boxplot(x = df['waterfront'], y = df['price'])\n", + "plt.title(\"Boxplot of waterfront feature vs. price\")\n", + "plt.ylabel(\"price in USD\")\n", + "plt.xlabel(None)\n", + "plt.xticks(np.arange(2), ('No view of waterfront', 'Waterfront view'))\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean price for a house with a waterfront is 1731020.07 USD\n", + "The mean price for a house without a waterfront is 532461.0 USD\n", + "Percentage of houses with waterfront is: 0.6716170836683536\n" + ] + } + ], + "source": [ + "waterfront_mean = df[df['waterfront'] == 'YES']['price'].mean()\n", + "no_waterfront_mean = df[df['waterfront'] == 'NO']['price'].mean()\n", + "print(f\"The mean price for a house with a waterfront is {round(waterfront_mean,2)} USD\")\n", + "print(f\"The mean price for a house without a waterfront is {round(no_waterfront_mean,2)} USD\")\n", + "print(f\"Percentage of houses with waterfront is: {len(df[df['waterfront'] == 'YES'])/len(df)*100}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Waterfront : No Waterfront = 3.25 : 1\n" + ] + } + ], + "source": [ + "# Mean Price Ratio of Houses with Waterfront : Houses without Waterfront\n", + "mpr_waterfront = waterfront_mean / no_waterfront_mean\n", + "print(f'Waterfront : No Waterfront = ',round(mpr_waterfront, 2),': 1')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Waterfront Feature Conclusion\n", + "Houses with waterfronts are significantly more pricy than those without. Those with waterfront are more than 3 times the price of those without waterfronts. However, only 0.67% of the properties have the waterfront feature." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### *House Features*" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n", + " 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n", + " 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n", + " 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n", + " dtype='object')" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Isolation of house features from the dataframe\n", + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# categorical variables\n", + "features = ['bedrooms', 'bathrooms', 'floors', 'view', 'grade', 'condition']\n", + "\n", + "#We are not using 'yr_built' and 'yr_renovated' due to many outliers and similar results across their distribution\n", + "\n", + "# plot boxplots\n", + "for feature in features:\n", + " sns.boxplot(x = df[feature], y = df['price'], whis=1.0, width=0.75, linewidth=0.9, saturation=1.0)\n", + " plt.title(f\"Boxplot of {feature} vs. price\")\n", + " plt.ylabel(\"price in USD\")\n", + " plt.xlabel(f\"{feature}\")\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Bedrooms vs Price\n", + "Price increases with increase in number of bedrooms. 8 bedroom houses are the most preferred.\n", + "\n", + "### Bathrooms vs Price\n", + "Price generally increases with increase in number of bathrooms but it is not the case in all instances.\n", + "\n", + "### Floors vs Price\n", + "Price increases with increase in number of floors upto 2.5 floors. After 2.5 floors, prices begin to drop. Most people settle for 2.5 floors.\n", + "\n", + "### View vs Price\n", + "Price increases with increase in quality of view with 5(Excellent) being the most expensive and most purchased.\n", + "\n", + "### Grade vs Price\n", + "Price increases with increase in grade. The highest grade(11) is the most expensive and most purchased.\n", + "\n", + "### Condition vs Price\n", + "Trend cannot be accurately determined as houses in Poor condition are more pricy than houses in Very Good condition.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Heatmap" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.heatmap(df[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot']].corr(), annot=True, cmap='coolwarm')\n", + "plt.title('Correlation on Heatmap')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The correlation heatmap depicting the relationships between price and various features like bedrooms, bathrooms, sqft_living, and sqft_lot holds significant importance in both linear and multilinear regression analyses. It serves to highlight influential predictors by showcasing their connections with the target variable (price). This aids in prioritizing predictors within linear models and identifying multicollinearity within multilinear models, thereby ensuring the stability and comprehensibility of the models. In essence, the heatmap serves as a guide for feature selection and model interpretation in regression analysis. This is the heatmap" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Bar Graph" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.barplot(x='condition', y='price', data=df)\n", + "plt.title('Price Distribution by Condition')\n", + "plt.xlabel('Condition')\n", + "plt.ylabel('Price')\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The bar plot of price by condition is essential for developing regression models. It identifies influential condition categories for predictor selection and aids in understanding price variations. This visualization guides preprocessing of categorical variables and validates predictor-target relationships. Overall, it informs feature selection, interpretation, and validation in regression modeling." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Scatter Plot" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(12, 6))\n", + "plt.subplot(1, 2, 1)\n", + "sns.scatterplot(x='sqft_living', y='price', data=df)\n", + "plt.title('Price vs Squarefoot Living')\n", + "plt.subplot(1, 2, 2)\n", + "sns.scatterplot(x='sqft_lot', y='price', data=df)\n", + "plt.title('Price vs Squarefoot Lot')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The scatter plots of price against square footage of living space (`sqft_living`) and lot size (`sqft_lot`) provide insights for linear and multilinear regression models. They show how price relates to these predictors, helping assess linearity and identify outliers. Clear trends in these plots guide decisions on model complexity and feature engineering, essential for accurate regression analysis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## MODELLING" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Linear Regression" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Baseline Model Performance:\n", + "Train RMSE: 261631.4910742736\n", + "Test RMSE: 263524.60505394044\n", + "Train MSE: 68451037121.74771\n", + "Test MSE: 69445217468.8353\n", + "Train R-squared: 0.49807283088908505\n", + "Test R-squared: 0.4733338519535445\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.metrics import mean_squared_error\n", + "from sklearn.metrics import r2_score\n", + "\n", + "# Defining features (X) and target variable (y)\n", + "X = df[['sqft_living']]\n", + "y = df['price']\n", + "\n", + "# Spliting the dataset into training and testing sets\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "\n", + "# Build baseline linear regression model\n", + "baseline_model = LinearRegression()\n", + "baseline_model.fit(X_train, y_train)\n", + "\n", + "# Predictions\n", + "baseline_train_pred = baseline_model.predict(X_train)\n", + "baseline_test_pred = baseline_model.predict(X_test)\n", + "\n", + "# MSE\n", + "baseline_train_mse = mean_squared_error(y_train, baseline_train_pred)\n", + "baseline_test_mse = mean_squared_error(y_test, baseline_test_pred)\n", + "\n", + "# RMSE\n", + "baseline_train_rmse = mean_squared_error(y_train, baseline_train_pred, squared=False)\n", + "baseline_test_rmse = mean_squared_error(y_test, baseline_test_pred, squared=False)\n", + "\n", + "# R-squared\n", + "baseline_train_r2 = r2_score(y_train, baseline_train_pred)\n", + "baseline_test_r2 = r2_score(y_test, baseline_test_pred)\n", + "\n", + "print(\"Baseline Model Performance:\")\n", + "print(\"Train RMSE:\", baseline_train_rmse)\n", + "print(\"Test RMSE:\", baseline_test_rmse)\n", + "print(\"Train MSE:\", baseline_train_mse)\n", + "print(\"Test MSE:\", baseline_test_mse)\n", + "print(\"Train R-squared:\", baseline_train_r2)\n", + "print(\"Test R-squared:\", baseline_test_r2)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The baseline linear regression model exhibits moderate performance in predicting house prices based solely on square footage of living area. With a training RMSE of approximately $261,631.49 and a test RMSE of approximately $263,524.61, the model's predictions are reasonably close to the actual prices, indicating generalizability to unseen data. However, the model's explanatory power is limited, as evidenced by the training and test R-squared values of approximately 0.498 and 0.473, respectively. Further refinement and feature engineering may be necessary to improve the model's accuracy and capture more nuances in house price prediction." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: price R-squared: 0.498\n", + "Model: OLS Adj. R-squared: 0.498\n", + "Method: Least Squares F-statistic: 1.678e+04\n", + "Date: Wed, 01 May 2024 Prob (F-statistic): 0.00\n", + "Time: 13:54:00 Log-Likelihood: -2.3500e+05\n", + "No. Observations: 16914 AIC: 4.700e+05\n", + "Df Residuals: 16912 BIC: 4.700e+05\n", + "Df Model: 1 \n", + "Covariance Type: nonrobust \n", + "===============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "-------------------------------------------------------------------------------\n", + "const -4.66e+04 4959.944 -9.395 0.000 -5.63e+04 -3.69e+04\n", + "sqft_living 282.3650 2.180 129.546 0.000 278.093 286.637\n", + "==============================================================================\n", + "Omnibus: 11454.600 Durbin-Watson: 2.022\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 401978.016\n", + "Skew: 2.781 Prob(JB): 0.00\n", + "Kurtosis: 26.226 Cond. No. 5.61e+03\n", + "==============================================================================\n", + "\n", + "Notes:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 5.61e+03. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ] + } + ], + "source": [ + "# Add a constant term to the predictor variable for the intercept\n", + "X_train_with_const = sm.add_constant(X_train)\n", + "# Fit the linear regression model\n", + "model = sm.OLS(y_train, X_train_with_const)\n", + "results = model.fit()\n", + "# Displaying the summary table of regression results\n", + "print(results.summary())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This linear regression model predicts house prices based on square footage of living area. The model indicates that, on average, each additional square foot of living area is associated with an increase in price of approximately $282.36. The intercept term suggests that the price of a house with zero square feet of living area (which is not practically meaningful) is approximately -$46,600. The model explains about 49.8% of the variance in house prices, as indicated by the R-squared value. Additionally, the F-statistic suggests that the overall model is statistically significant. However, the large condition number and the presence of strong skewness and kurtosis in the residuals indicate potential issues with the model's assumptions and potential for multicollinearity or other numerical problems." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Plotting the scatter plot of the data points\n", + "plt.scatter(X_test, y_test, color='blue', label='Actual Prices')\n", + "\n", + "# Plotting the regression line\n", + "plt.plot(X_test, baseline_model.predict(X_test), color='red', label='Regression Line')\n", + "\n", + "# Adding labels and legend\n", + "plt.xlabel('Square Footage of Living Area')\n", + "plt.ylabel('Price')\n", + "plt.title('Regression Line for House Prices')\n", + "plt.legend()\n", + "\n", + "# Show plot\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Multiple Linear Regression" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", + "import pandas as pd\n", + "\n", + "def train_test(df, target, test_size=0.20, random_state=42):\n", + " '''\n", + " This function takes in a dataframe df and target column and returns the train and test split\n", + " Default test size is 20, default random state is 42\n", + " '''\n", + " \n", + " # Drop rows with missing values\n", + " df = df.dropna()\n", + " \n", + " # Separating predictors (X) and target (y)\n", + " X = df.drop(target, axis=1)\n", + " y = df[target]\n", + " \n", + " # Creating train-test split\n", + " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)\n", + " \n", + " # Resetting indices to ensure alignment\n", + " X_train.reset_index(drop=True, inplace=True)\n", + " X_test.reset_index(drop=True, inplace=True)\n", + " y_train.reset_index(drop=True, inplace=True)\n", + " y_test.reset_index(drop=True, inplace=True)\n", + " \n", + " # Selecting categorical columns\n", + " categorical = X.select_dtypes(include=['object']).columns.tolist()\n", + " \n", + " # Instantiating OneHotEncoder object\n", + " ohe = OneHotEncoder(sparse_output=False, handle_unknown='error', drop='first')\n", + " \n", + " # Fitting and transforming categorical features on train and test sets\n", + " X_train_ohe = ohe.fit_transform(X_train[categorical])\n", + " X_test_ohe = ohe.transform(X_test[categorical])\n", + "\n", + " # Get feature names for one-hot encoded columns\n", + " feature_names = []\n", + " for cat, categories in zip(categorical, ohe.categories_):\n", + " feature_names.extend([f\"{cat}_{val}\" for val in categories[1:]]) # Skip the first category\n", + " \n", + " # Placing column names onto new categorical columns and formatting as DataFrame\n", + " X_train_ohe_df = pd.DataFrame(X_train_ohe, columns=feature_names)\n", + " X_test_ohe_df = pd.DataFrame(X_test_ohe, columns=feature_names)\n", + " \n", + " # Combining categoricals with rest of data\n", + " X_train = pd.concat([X_train.select_dtypes(include=['number']), X_train_ohe_df], axis=1)\n", + " X_test = pd.concat([X_test.select_dtypes(include=['number']), X_test_ohe_df], axis=1)\n", + "\n", + " # List to hold X_train and X_test\n", + " X_list = [X_train, X_test]\n", + " \n", + " # Scaling X values into z-scores\n", + " ss = StandardScaler()\n", + " for i in range(len(X_list)):\n", + " X_list[i] = pd.DataFrame(ss.fit_transform(X_list[i]), columns=X_list[i].columns)\n", + " \n", + " # Unpacking the list\n", + " X_train, X_test = X_list\n", + " return X_train, X_test, y_train, y_test\n", + " # Split data into train and test\n", + " X_train, X_test, y_train, y_test = train_test(output, 'price')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define your actual_vs_predicted function here\n", + "def actual_vs_predicted(model, X_test, y_test):\n", + " \"\"\"\n", + " Plots the actual y vs the predicted y\n", + " \"\"\"\n", + " y_predicted = model.predict(X_test)\n", + " fig, ax = plt.subplots(figsize=(12,8))\n", + " ax.scatter(x=y_test, y=y_predicted)\n", + " ax.set_xlabel(\"Actual Price Values\")\n", + " ax.set_ylabel(\"Predicted Price Values\")\n", + " ax.set_title(\"Actual vs Predicted\")\n", + " \n", + " p1 = max(max(y_test), max(y_predicted))\n", + " p2 = min(min(y_test), min(y_predicted))\n", + " plt.plot([p1, p2], [p1, p2], 'b-')\n", + "\n", + "# Create linear regression model for price\n", + "model1 = LinearRegression()\n", + "model1.fit(X_train, y_train)\n", + "\n", + "# Plot actual vs predicted\n", + "actual_vs_predicted(model1, X_test, y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#add constant to X_train\n", + "X_train = sm.add_constant(X_train)\n", + "#finding OLS for train data set\n", + "model1_ols = sm.OLS(y_train, X_train).fit()" ] }, { @@ -20,8 +2061,104 @@ "metadata": {}, "outputs": [], "source": [ - "# Your code here - remember to use markdown cells for comments as well!" + "model1_ols.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Findings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "1. **Linear Regression:**\n", + " - The baseline linear regression model predicts house prices based solely on the square footage of living area.\n", + " - The model's performance metrics are as follows:\n", + " - Train RMSE: 261,631.49\n", + " - Test RMSE: 263,524.61\n", + " - Train MSE: 68,451,037,121.75\n", + " - Test MSE: 69,445,217,468.84\n", + " - Train R-squared: 0.498\n", + " - Test R-squared: 0.473\n", + " - The model exhibits moderate performance, with reasonably close predictions to actual prices and a moderate explanatory power, capturing about 49.8% of the variance in house prices.\n", + " - The regression results indicate that each additional square foot of living area is associated with an increase in price of approximately $282.37.\n", + " - However, the presence of strong skewness and kurtosis in the residuals, along with a large condition number, suggests potential issues with the model's assumptions and multicollinearity.\n", + "\n", + "2. **Multiple Linear Regression:**\n", + " - The multiple linear regression model extends the prediction by incorporating additional features besides square footage of living area.\n", + " - The model's performance is similar to the baseline linear regression model, with a comparable R-squared value of 0.498.\n", + " - The regression results confirm that square footage of living area has a significant positive effect on house prices, with an estimated coefficient of approximately 282.37.\n", + " - Again, the large condition number indicates potential issues with multicollinearity or other numerical problems in the model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " The multiple linear regression model is better than the simple linear regression model because The multiple linear regression model has a slightly higher R-squared value (0.498) compared to the simple linear regression model (0.498 vs. 0.473). A higher R-squared value indicates that the multiple linear regression model explains a larger proportion of the variance in the target variable (house prices) compared to the simple linear regression model.\n", + "\n" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Recommendations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1.\tExpanding Dataset Variety: Divide the dataset into categories based on property size (small, medium, large) to develop specialized models tailored to different types of properties. Increase dataset diversity by incorporating additional data on various property sizes or from neighboring counties.\n", + "2.\tOptimizing Property Pricing: Utilize the multiple linear regression model to fine-tune property pricing strategies. Utilize features like square footage, location (zipcode, latitude, longitude), and overall condition (grade, waterfront, view) to accurately evaluate property values and establish competitive yet profitable listing prices.\n", + "3.\tFocusing on Key Property Attributes: Spotlight and prioritize features that significantly affect property value, such as living space (sqft_living), bedroom and bathroom count, construction quality (grade), and proximity to amenities (waterfront, view). Emphasize these features in marketing materials to attract targeted buyer segments effectively.\n", + "4.\tInvesting in Renovation and Upgrades: Identify properties with potential for value enhancement based on regression coefficients (e.g., sqft_above, sqft_basement). Consider strategic renovations and upgrades to maximize ROI and appeal to discerning buyers.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model Improvement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Based on the findings and results from the linear regression and multiple linear regression models, here are some necessary recommendations to improve the predictive accuracy and explanatory power of the models:\n", + "\n", + "1. **Feature Engineering:**\n", + " - Explore additional relevant features that might influence house prices, such as the number of bedrooms, bathrooms, location factors, amenities, and neighborhood characteristics.\n", + " - Conduct thorough data analysis and research to identify potential predictors that have a strong correlation with house prices.\n", + "\n", + "2. **Address Multicollinearity:**\n", + " - Investigate the presence of multicollinearity among the predictor variables, especially in the multiple linear regression model.\n", + " - Use techniques such as variance inflation factor (VIF) analysis to identify and mitigate multicollinearity by removing highly correlated predictors or employing dimensionality reduction techniques.\n", + "\n", + "3. **Model Assumptions:**\n", + " - Validate the assumptions of the regression models, including linearity, homoscedasticity, normality of residuals, and independence of errors.\n", + " - Apply appropriate transformations or adjustments to the data to meet these assumptions if necessary.\n", + "\n", + "4. **Regularization Techniques:**\n", + " - Implement regularization techniques like Ridge regression or Lasso regression to prevent overfitting and improve model generalization, especially in cases of high-dimensional data or multicollinearity." + ] +<<<<<<< HEAD +======= + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + +>>>>>>> a9fa8772257fd47490659bfcb865e4bbdd77e4e2 } ], "metadata": { @@ -40,7 +2177,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4,