diff --git a/Grey Modern Professional Business Project Presentation.pdf b/Grey Modern Professional Business Project Presentation.pdf new file mode 100644 index 00000000..eaf5457e Binary files /dev/null and b/Grey Modern Professional Business Project Presentation.pdf differ diff --git a/README.md b/README.md index 5dd0f84d..282c6cc5 100644 --- a/README.md +++ b/README.md @@ -1,285 +1,187 @@ -# Phase 2 Project Description +# Final Project Submission -Another module down - you're almost half way there! +Please fill out: +* Student name: Hellen Samuel,Calvine Dasilver,Sandra kiptum ,Jack Otieno,Salahudin +* Student pace: full time +* Scheduled project review date/time: +* Instructor name: NIKITA -![awesome](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v2-3/main/halfway-there.gif) -All that remains in Phase 2 is to put your newfound data science skills to use with a large project! -In this project description, we will cover: -* Project Overview: the project goal, audience, and dataset -* Deliverables: the specific items you are required to produce for this project -* Grading: how your project will be scored -* Getting Started: guidance for how to begin working + # Demystifying House Sales Analysis with Regression Modeling in a Northwestern County -## Project Overview - -For this project, you will use multiple linear regression modeling to analyze house sales in a northwestern county. - -### Business Problem - -It is up to you to define a stakeholder and business problem appropriate to this dataset. - -If you are struggling to define a stakeholder, we recommend you complete a project for a real estate agency that helps homeowners buy and/or sell homes. A business problem you could focus on for this stakeholder is the need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount. - -### The Data - -This project uses the King County House Sales dataset, which can be found in `kc_house_data.csv` in the data folder in this assignment's GitHub repository. The description of the column names can be found in `column_names.md` in the same folder. As with most real world data sets, the column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means. - -It is up to you to decide what data from this dataset to use and how to use it. If you are feeling overwhelmed or behind, we recommend you **ignore** some or all of the following features: - -* `date` -* `view` -* `sqft_above` -* `sqft_basement` -* `yr_renovated` -* `zipcode` -* `lat` -* `long` -* `sqft_living15` -* `sqft_lot15` - -### Key Points - -* **Your goal in regression modeling is to yield findings to support relevant recommendations. Those findings should include a metric describing overall model performance as well as at least two regression model coefficients.** As you explore the data and refine your stakeholder and business problem definitions, make sure you are also thinking about how a linear regression model adds value to your analysis. "The assignment was to use linear regression" is not an acceptable answer! You can also use additional statistical techniques other than linear regression, so long as you clearly explain why you are using each technique. - -* **You should demonstrate an iterative approach to modeling.** This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model. - -* **Data visualization and analysis are no longer explicit project requirements, but they are still very important.** In Phase 1, your project stopped earlier in the CRISP-DM process. Now you are going a step further, to modeling. Data visualization and analysis will help you build better models and tell a better story to your stakeholders. - -## Deliverables - -There are three deliverables for this project: - -* A **non-technical presentation** -* A **Jupyter Notebook** -* A **GitHub repository** - -The deliverables requirements are almost the same as in the Phase 1 Project, and you can review those extended descriptions [here](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#deliverables). In general, everything is the same except the "Data Visualization" and "Data Analysis" requirements have been replaced by "Modeling" and "Regression Results" requirements. - -### Non-Technical Presentation - -Recall that the non-technical presentation is a slide deck presenting your analysis to ***business stakeholders***, and should be presented live as well as submitted in PDF form on Canvas. - -We recommend that you follow this structure, although the slide titles should be specific to your project: + ![alt text](image-14.png) -1. Beginning - - Overview - - Business and Data Understanding -2. Middle - - **Modeling** - - **Regression Results** -3. End - - Recommendations - - Next Steps - - Thank you - -Make sure that your discussion of modeling and regression results is geared towards a non-technical audience! Assume that their prior knowledge of regression modeling is minimal. You don't need to explain how linear regression works, but you should explain why linear regression is useful for the problem context. Make sure you translate any metrics or coefficients into their plain language implications. - -The graded elements for the non-technical presentation are the same as in [Phase 1](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#deliverables). - -### Jupyter Notebook - -Recall that the Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a ***data science audience***. You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository. - -The graded elements for the Jupyter Notebook are: - -* Business Understanding -* Data Understanding -* Data Preparation -* **Modeling** -* **Regression Results** -* Code Quality - -### GitHub Repository - -Recall that the GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history. - -The requirements are the same as in [Phase 1](https://github.com/learn-co-curriculum/dsc-phase-1-project-v2-3#github-repository), except for the required sections in the `README.md`. - -For this project, the `README.md` file should contain: - -* Overview -* Business and Data Understanding - * Explain your stakeholder audience here -* **Modeling** -* **Regression Results** -* Conclusion - -Just like in Phase 1, the `README.md` file should be the bridge between your non technical presentation and the Jupyter Notebook. It should not contain the code used to develop your analysis, but should provide a more in-depth explanation of your methodology and analysis than what is described in your presentation slides. - -## Grading - -***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 2 are: - -1. Attention to Detail -2. Statistical Communication -3. Data Preparation Fundamentals -4. Linear Modeling - -### Attention to Detail - -Just like in Phase 1, this rubric objective is based on your completion of checklist items. ***In Phase 2, you need to complete 70% (7 out of 10) or more of the checklist elements in order to pass the Attention to Detail objective.*** - -**NOTE THAT THE PASSING BAR IS HIGHER IN PHASE 2 THAN IT WAS IN PHASE 1!** +## Project Overview +Business Understanding +The real estate market is a vital component of regional economic health and stability. This project delves into the dynamics of house sales in a specific northwestern county in the United States, aiming to unravel the key factors influencing property valuation in this area. -The standard will increase with each Phase, until you will be required to complete all elements to pass Phase 5 (Capstone). -#### Exceeds Objective +## Problem Statements -80% or more of the project checklist items are complete +
  • What are the most significant factors influencing house prices in this northwestern county? -#### Meets Objective (Passing Bar) +
  • How can we quantify the relationship between these factors and property value? -70% of the project checklist items are complete +
  • Can we develop a reliable model to predict house prices based on relevant characteristics? -#### Approaching Objective -60% of the project checklist items are complete +## Challenges -#### Does Not Meet Objective +1. Real estate data complexity, encompassing diverse property features and local market trends. +2. Accurately identifying and quantifying the impact of each factor on house prices. +3. Consideration of external factors like economic conditions and interest rates. -50% or fewer of the project checklist items are complete +## Proposed Solutions -### Statistical Communication +Utilizing multiple linear regression, a powerful machine learning technique, to analyze a large dataset of house sales and identify statistical relationships between property features and sale prices. -Recall that communication is one of the key data science "soft skills". In Phase 2, we are specifically focused on Statistical Communication. We define Statistical Communication as: +## Objectives -> Communicating **results of statistical analyses** to diverse audiences via writing and live presentation +1. Develop a robust multiple linear regression model for accurate house price prediction. +2. Identify significant factors influencing property value in the specific market. +3. Provide insights into regional housing market dynamics. -Note that this is the same as in Phase 1, except we are replacing "basic data analysis" with "statistical analyses". +## Research Questions -High-quality Statistical Communication includes rationale, results, limitations, and recommendations: +
  • How do bedrooms, bathrooms, grade, and square footage correlate with sale price in King County? +
  • What increase in home value can homeowners expect after specific renovation projects? +
  • Which renovation projects have the greatest impact on a home's market value? +
  • Are there specific combinations of renovation projects that provide an interdependent effect on home value? -* **Rationale:** Explaining why you are using statistical analyses rather than basic data analysis - * For example, why are you using regression coefficients rather than just a graph? - * What about the problem or data is suitable for this form of analysis? - * For a data science audience, this includes your reasoning for the changes you applied while iterating between models. -* **Results:** Describing the overall model metrics and feature coefficients - * You need at least one overall model metric (e.g. r-squared or RMSE) and at least two feature coefficients. - * For a business audience, make sure you connect any metrics to real-world implications. You do not need to get into the details of how linear regression works. - * For a data science audience, you don't need to explain what a metric is, but make sure you explain why you chose that particular one. -* **Limitations:** Identifying the limitations and/or uncertainty present in your analysis - * This could include p-values/alpha values, confidence intervals, assumptions of linear regression, missing data, etc. - * In general, this should be more in-depth for a data science audience and more surface-level for a business audience. -* **Recommendations:** Interpreting the model results and limitations in the context of the business problem - * What should stakeholders _do_ with this information? +## Data Understanding -#### Exceeds Objective +**Dataset Description** -Communicates the rationale, results, limitations, and specific recommendations of statistical analyses +The analysis utilizes the King County House Sales dataset, comprising over 21,500 records and 20 distinct features. Spanning house sales from May 2014 to May 2015, the dataset offers a comprehensive snapshot of the housing market. -> See above for extended explanations of these terms. +**Key Columns** -#### Meets Objective (Passing Bar) +
  • id: Unique identifier for a house +
  • date: Date of house sale +
  • price: Sale price (prediction target) +
  • bedrooms, bathrooms, sqft_living, sqft_lot, floors, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, sqft_living15, sqft_lot15, sell_yr -Successfully communicates the results of statistical analyses without any major errors +**Constraints and Considerations** -> The minimum requirement is to communicate the _results_, meaning at least one overall model metric (e.g. r-squared or RMSE) as well as at least two feature coefficients. See the Approaching Objective section for an explanation of what a "major error" means. +
  • Data may contain anomalies or inconsistencies necessitating careful examination. +
  • Time frame (May 2014 - May 2015) may not fully reflect current market dynamics. +
  • Scope of data may not capture external factors such as interest rates or economic climate influencing property values. -#### Approaching Objective -Communicates the results of statistical analyses with at least one major error +**Data preparation** +we import the necessary functions and clean the data in the following ways -> A major error means that some aspect of your explanation is fundamentally incorrect. For example, if a feature coefficient is negative and you say that an increase in that feature results in an increase of the target, that would be a major error. Another example would be if you say that the feature with the highest coefficient is the "most statistically significant" while ignoring the p-value. One more example would be reporting a coefficient that is not statistically significant, rather than saying "no statistically significant linear relationship was found" +1. checking the data and null values +2. deleting the columns with null values +3. checking for non-numeric columns +4. checking for duplicates +5. creating the necessary columns +6. checking for outliers using the box plot and deleting the outliers -> "**If a coefficient's t-statistic is not significant, don't interpret it at all.** You can't be sure that the value of the corresponding parameter in the underlying regression model isn't really zero." _DeVeaux, Velleman, and Bock (2012), Stats: Data and Models, 3rd edition, pg. 801_. Check out [this website](https://web.ma.utexas.edu/users/mks/statmistakes/TOC.html) for extensive additional examples of mistakes using statistics. +![alt text](image.png) -> The easiest way to avoid making a major error is to have someone double-check your work. Reach out to peers on Slack and ask them to confirm whether your interpretation makes sense! -#### Does Not Meet Objective +**Exploratory Data Analysis** -Does not communicate the results of statistical analyses +we will perform exploratory data analysis (EDA) to understand the data better and discover any patterns, trends using univariate,bivariate and multivariate analysis -> It is not sufficient to just display the entire results summary. You need to pull out at least one overall model metric (e.g. r-squared, RMSE) and at least two feature coefficients, and explain what those numbers mean. +We will use descriptive statistics and visualizations to summarize the main characteristics of the data and examine the relationships between the features and the target variable. -### Data Preparation Fundamentals +We will also check the distribution and correlation of the variables and identify any potential problems or opportunities for the analysis. -We define this objective as: + # Univariate Analysis -> Applying appropriate **preprocessing** and feature engineering steps to tabular data in preparation for statistical modeling +Univariate analysis involves the examination of single variables.We focus in the summary statistics of target variable-price to help us undersatand the distribution and skewness of house prices. -The two most important components of preprocessing for the Phase 2 project are: + # Visualizing the distribution of 'price' using a histogram -* **Handling Missing Values:** Missing values may be present in the features you want to use, either encoded as `NaN` or as some other value such as `"?"`. Before you can build a linear regression model, make sure you identify and address any missing values using techniques such as dropping or replacing data. -* **Handling Non-Numeric Data:** A linear regression model needs all of the features to be numeric, not categorical. For this project, ***be sure to pick at least one non-numeric feature and try including it in a model.*** You can identify that a feature is currently non-numeric if the type is `object` when you run `.info()` on your dataframe. Once you have identified the non-numeric features, address them using techniques such as ordinal or one-hot (dummy) encoding. + ![alt text](image-1.png) -There is no single correct way to handle either of these situations! Use your best judgement to decide what to do, and be sure to explain your rationale in the Markdown of your notebook. + The histogram shows that the distribution of house price is positively skewed suggesting that while most houses are concentrated around lower prices, there are some properties with significantly higher prices. -Feature engineering is encouraged but not required for this project. + # Bivariate Analysis -#### Exceeds Objective +We perform bivariate analysis to examine the relationship between the target variable - price and the other numeric and continuous features in the data using the scatter plots to show the direction, strength, and shape of the relationship between two numeric variables. -Goes above and beyond with data preparation, such as feature engineering or merging in outside datasets +![alt text](image-3.png) -> One example of feature engineering could be using the `date` feature to create a new feature called `season`, which represents whether the home was sold in Spring, Summer, Fall, or Winter. +The scatter plots show that there is a positive relationship between most of the independent variables and the price of a house. This means that houses with higher values for these variables tend to be more expensive -> One example of merging in outside datasets could be finding data based on ZIP Code, such as household income or walkability, and joining that data with the provided CSV. + # Multivariate Analysis -#### Meets Objective (Passing Bar) + In this section, we will perform multivariate analysis to examine the relationship between the target variable - price and multiple features in the data. We will use heatmap to visualize the correlation matrix of the features and see how they are related to each other and to the price. -Successfully prepares data for modeling, including converting at least one non-numeric feature into ordinal or binary data and handling missing data as needed + ![alt text](image-4.png) -> As a reminder, you can identify the non-numeric features by calling `.info()` on the dataframe and looking for type `object`. +The heatmap shows that Positive correlations are typically represented by shades of red, and negative correlations by shades of blue. We note that bathrooms and sqft_living are highly positively correlated. -> Your final model does not necessarily need to include any features that were originally non-numeric, but you need to demonstrate your ability to handle this type of data. -#### Approaching Objective +**Regression Modelling** + # Simple Linear Regression -Prepares some data successfully, but is unable to utilize non-numeric data +For simple linear regression we will use the one column that has the strongest correlation to the price, this will also be or baseline model for the multiple linear regression. -> If you simply subset the dataframe to only columns with type `int64` or `float64`, your model will run, but you will not pass this objective. +1. Checking for correlation -#### Does Not Meet Objective +from the correlation sqft_living has the highest correlatio with price, we will therefore use sqft_living as the exogenous variable and price as our endogenous variable. +plot using a scatter plot -Does not prepare data for modeling -### Linear Modeling +![alt text](image-5.png) -According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), linear and logistic regression are the most popular machine learning algorithms, used by 83.7% of data scientists. They are small, fast models compared to some of the models you will learn later, but have limitations in the kinds of relationships they are able to learn. +from this we can see that there is a linearity between the two variables satisfying one of the 4 LINE specifications. -In this project you are required to use linear regression as the primary statistical analysis, although you are free to use additional statistical techniques as appropriate. +2. building the model -#### Exceeds Objective +we build the model qand interprate our models results -Goes above and beyond in the modeling process, such as recursive feature selection +![alt text](image-6.png) -#### Meets Objective (Passing Bar) +we are plotting our residuals to understand where our model is perfoming best and where it is performing poorly -Successfully builds a baseline model as well as at least one iterated model, and correctly extracts insights from a final model without any major errors +![alt text](image-7.png) -> We are looking for you to (1) create a baseline model, (2) iterate on that model, making adjustments that are supported by regression theory or by descriptive analysis of the data, and (3) select a final model and report on its metrics and coefficients +our graphs give us the same information as our summary did +from this we can see that our residuals are not normally distributed we can solve this but using multiple linear distribution -> Ideally you would include written justifications for each model iteration, but at minimum the iterations must be _justifiable_ + # Multiple Linear Regression -> For an explanation of "major errors", see the description below +For Multiple Linear Regression, we are going to use more than one predictor variable to predict price for our case -#### Approaching Objective +Our baseline for this model will be the linear Regression that we just did above +We then clean our data -Builds multiple models with at least one major error +![alt text](image-9.png) +The image above is a heatmap of the cleaned data + + # Building the model -> The number one major error to avoid is including the target as one of your features. For example, if the target is `price` you should NOT make a "price per square foot" feature, because that feature would not be available if you didn't already know the price. +1. we build the model,fit it and interprate the results -> Other examples of major errors include: using a target other than `price`, attempting only simple linear regression (not multiple linear regression), dropping multiple one-hot encoded columns without explaining the resulting baseline, or using a unique identifier (`id` in this dataset) as a feature. +2. we check for normality -#### Does Not Meet Objective +![alt text](image-10.png) -Does not build multiple linear regression models +From the diagram above we can see that the errors are not normaly distributed and therefore we will check the other assumptions to evaluate -## Getting Started +3. plotting the model -Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP. +4. independence of errors +We are going to find out the predicted y of the model and calculate the residual from there on -Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project. +![alt text](image-11.png) +This shows where our modle works best -Here are some suggestions for creating your GitHub repository: +5. evaluating the model +From this we can see that due to Outliers,Nonlinear Relationships,Heteroscedasticity and overfitting our MSE and RMSE are high, we will build another model to remidy this factors. -1. Fork the [Phase 2 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-2-project-v2-3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`. -2. Or, create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from the Phase 2 Project Repository into your new repository. - - Recall that you can refer to the [Phase 1 Project Template](https://github.com/learn-co-curriculum/dsc-project-template) as an example structure - - This option will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try forking the project repository instead +## RECOMMENDATIONS +From the 3 modules built we advise potential buyers or sellers to concider model 3 in determining the price of a house. We can also suggest that the factor affecting the price of a house most is square foot living but they should concider increasing the number of bathrooms during renovations for the case of the sellers. -## Summary +## NEXT STEPS +1.Find more features that home buyers often value highly to add to the model +2.Correlate the information of this model with ones for other states -This is your first modeling project! Take what you have learned in Phase 2 to create a project with a more sophisticated analysis than you completed in Phase 1. You will build on these skills as we move into the predictive machine learning mindset in Phase 3. You've got this! +![alt text](image-13.png) \ No newline at end of file diff --git a/images/image-1.png b/images/image-1.png new file mode 100644 index 00000000..8c94665e Binary files /dev/null and b/images/image-1.png differ diff --git a/images/image-10.png b/images/image-10.png new file mode 100644 index 00000000..efba9815 Binary files /dev/null and b/images/image-10.png differ diff --git a/images/image-11.png b/images/image-11.png new file mode 100644 index 00000000..7b88e1ea Binary files /dev/null and b/images/image-11.png differ diff --git a/images/image-12.png b/images/image-12.png new file mode 100644 index 00000000..1087477c Binary files /dev/null and b/images/image-12.png differ diff --git a/images/image-13.png b/images/image-13.png new file mode 100644 index 00000000..f3af4e62 Binary files /dev/null and b/images/image-13.png differ diff --git a/images/image-14.png b/images/image-14.png new file mode 100644 index 00000000..b2059aa4 Binary files /dev/null and b/images/image-14.png differ diff --git a/images/image-2.png b/images/image-2.png new file mode 100644 index 00000000..ba2e1d45 Binary files /dev/null and b/images/image-2.png differ diff --git a/images/image-3.png b/images/image-3.png new file mode 100644 index 00000000..ba2e1d45 Binary files /dev/null and b/images/image-3.png differ diff --git a/images/image-4.png b/images/image-4.png new file mode 100644 index 00000000..cbd3c61a Binary files /dev/null and b/images/image-4.png differ diff --git a/images/image-5.png b/images/image-5.png new file mode 100644 index 00000000..72a192ca Binary files /dev/null and b/images/image-5.png differ diff --git a/images/image-6.png b/images/image-6.png new file mode 100644 index 00000000..12861223 Binary files /dev/null and b/images/image-6.png differ diff --git a/images/image-7.png b/images/image-7.png new file mode 100644 index 00000000..b9ef57e0 Binary files /dev/null and b/images/image-7.png differ diff --git a/images/image-8.png b/images/image-8.png new file mode 100644 index 00000000..f17bdf79 Binary files /dev/null and b/images/image-8.png differ diff --git a/images/image-9.png b/images/image-9.png new file mode 100644 index 00000000..f17bdf79 Binary files /dev/null and b/images/image-9.png differ diff --git a/images/image.png b/images/image.png new file mode 100644 index 00000000..8423d801 Binary files /dev/null and b/images/image.png differ diff --git a/student.ipynb b/student.ipynb index d3bb34af..378e2f01 100644 --- a/student.ipynb +++ b/student.ipynb @@ -7,13 +7,109 @@ "## Final Project Submission\n", "\n", "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", + "* Student name: Calvine Dasilver\n", + "* Student pace: full time\n", "* Scheduled project review date/time: \n", - "* Instructor name: \n", + "* Instructor name: Nikita\n", "* Blog post URL:\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # Demystifying House Sales Analysis with Regression Modeling in a Northwestern County\n", + "\n", + " ## Project Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ###
  • **Business Understanding**\n", + "\n", + "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", + " ##### Problem Statements:\n", + "What are the most significant factors influencing house prices in this northwestern county?How can we quantify the relationship between these factors and property value?Can we develop a reliable model to predict house prices based on relevant characteristics?\n", + " ##### Challenges:\n", + "* Real estate data can be complex and multifaceted, encompassing various property features and local market trends.\n", + "* Accurately identifying and quantifying the relative impact of each factor on house prices can be challenging.\n", + "* External factors like economic conditions and interest rates might also influence prices, requiring careful consideration.\n", + "\n", + "##### Proposed Solutions:\n", + "We propose utilizing multiple linear regression, a powerful machine learning technique. This method allows us to analyze a large dataset of house sales and identify the statistical relationships between various property features (e.g., square footage, number of bedrooms, location) and the corresponding sale prices.\n", + " ##### Objectives:\n", + "1. Develop a robust multiple linear regression model that accurately predicts house prices in the chosen northwestern county.\n", + "2. Identify the most significant factors influencing property value within this specific market.\n", + "3. Provide valuable insights into the housing market dynamics of the region, benefiting potential buyers, real estate agents, and other stakeholders.\n", + " \n", + " \n", + "**Research questions that would help to achieve the objectives**:\n", + "\n", + "1. How does the number of bedrooms, bathrooms, grade and square footage of a house correlate with its sale price in King County?\n", + "2. How much can a homeowner expect the value of their home to increase after a specific renovation project?\n", + "3. Which renovation projects have the most significant impact on a home's market value in the northwestern county?\n", + "4. Are there specific combinations of renovation projects that provide an interdependent effect on a home's market value?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ###
  • **Data Understanding**\n", + "\n", + "Our analysis leverages the King County House Sales dataset - a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", + "\n", + "**The King County House Sales dataset contains the following columns;**\n", + "\n", + "id - unique identified for a house\n", + "\n", + "date - Date house was sold \n", + "\n", + "Price - Sale price (prediction target)\n", + "\n", + "bedrooms - Number of bedrooms,\n", + "\n", + "bathrooms - Number of bathrooms,\n", + "\n", + "sqft_living - Square footage of living space in the home,\n", + "\n", + "sqft_lot - Square footage of the lot,\n", + "\n", + "floors - Number of floors (levels) in house,\n", + "\n", + "view - Quality of view from house,\n", + "\n", + "condition - How good the overall condition of the house is. Related to maintenance of house,\n", + "\n", + "grade - Overall grade of the house. Related to the construction and design of the house,\n", + "\n", + "sqft_above - Square footage of house apart from basement,\n", + "\n", + "sqft_basement - Square footage of the basement,\n", + "\n", + "yr_built - Year when house was built,\n", + "\n", + "yr_renovated - Year when house was renovated,\n", + "\n", + "zipcode - ZIP Code used by the United States Postal Service,\n", + "\n", + "sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors,\n", + "\n", + "sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors, and\n", + "\n", + "sell_yr - Date house was sold.\n", + "\n", + "\n", + "We need to be aware of certain constraints within the data, as these might influence our analysis and interpretation of the results. From the sources;\n", + "\n", + "1. The data may contain anomalies or inconsistencies that require careful examination during analysis. For instance, a record lists a house with 33 bedrooms, which appears to be an outlier\n", + "\n", + "2. It's important to consider the time frame of the data (May 2014 - May 2015) as it may not fully capture the current market dynamics in King County.\n", + "3. It's important to acknowledge the scope of the data. While it provides details on house features, it may not capture external factors such as interest rates or the overall economic climate, which can also play a role in determining property values." + ] + }, { "cell_type": "code", "execution_count": null, @@ -26,21 +122,22 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (learn-env)", "language": "python", - "name": "python3" + "name": "learn-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, + "feature/data-preparation": "main", "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4,