diff --git a/student.ipynb b/student.ipynb index d3bb34af..139e4f02 100644 --- a/student.ipynb +++ b/student.ipynb @@ -7,26 +7,457 @@ "## Final Project Submission\n", "\n", "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", + "* Student name: Calvine Dasilver\n", + "* Student pace: full time\n", "* Scheduled project review date/time: \n", - "* Instructor name: \n", + "* Instructor name: Nikita\n", "* Blog post URL:\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # Demystifying House Sales Analysis with Regression Modeling in a Northwestern County\n", + "\n", + " ## Project Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ###
  • **Business Understanding**\n", + "\n", + "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", + "\n", + " ##### Challenges of a Fluctuating Real Estate Market:\n", + "* Market fluctuations make it difficult for real estate agents to price houses and guide clients on offers.\n", + "* Rapid price fluctuations create a challenging environment for homebuyers, making it difficult to secure a good deal and avoid overpaying.\n", + "* Trying to pick the perfect moment to sell a house for maximum profit feels like playing the lottery – stressful, unpredictable, and with slim odds of success.\n", + "* High land prices and buyers struggling to afford homes make it difficult for builders to build new houses.\n", + "\n", + " \n", + "##### Problem Statements:\n", + "We want to find out what makes houses expensive in a certain county in the northwest US. We'll also look at ways to measure how much these things like number of bedrooms or location affect the price. Finally, we'll see if we can build a tool to predict house prices based on these important features.\n", + "\n", + " ##### Conclusion\n", + " Our study looked at how the ups and downs of the housing market in a northwestern county are making things tough for everyone involved. To help out, we're building a tool to predict house prices. This will give real estate agents valuable information so they can give their clients the best advice in this unpredictable market.\n", + "\n", + " ##### Proposed Solutions:\n", + "We propose utilizing multiple linear regression, a powerful machine learning technique. This method allows us to analyze a large dataset of house sales and identify the statistical relationships between various property features (e.g., square footage, number of bedrooms, location) and the corresponding sale prices.\n", + "\n", + "\n", + " ##### Objectives:\n", + "1. Develop a robust multiple linear regression model that accurately predicts house prices in the chosen northwestern county.\n", + "2. Identify the most significant factors influencing property value within this specific market.\n", + "3. Provide valuable insights into the housing market dynamics of the region, benefiting potential buyers, real estate agents, and other stakeholders.\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ###
  • **Data Understanding**\n", + "\n", + "Our analysis leverages the King County House Sales dataset.This information is stored in a file called \"kc_house_data.csv\".\n", + " It's a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    0712930052010/13/2014221900.031.00118056501.0NaNNONE...7 Average11800.019550.09817847.5112-122.25713405650
    1641410019212/9/2014538000.032.25257072422.0NONONE...7 Average2170400.019511991.09812547.7210-122.31916907639
    256315004002/25/2015180000.021.00770100001.0NONONE...6 Low Average7700.01933NaN9802847.7379-122.23327208062
    3248720087512/9/2014604000.043.00196050001.0NONONE...7 Average1050910.019650.09813647.5208-122.39313605000
    419544005102/18/2015510000.032.00168080801.0NONONE...8 Good16800.019870.09807447.6168-122.04518007503
    \n", + "

    5 rows × 21 columns

    \n", + "
    " + ], + "text/plain": [ + " id date price bedrooms bathrooms sqft_living \\\n", + "0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n", + "1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n", + "2 5631500400 2/25/2015 180000.0 2 1.00 770 \n", + "3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n", + "4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n", + "\n", + " sqft_lot floors waterfront view ... grade sqft_above \\\n", + "0 5650 1.0 NaN NONE ... 7 Average 1180 \n", + "1 7242 2.0 NO NONE ... 7 Average 2170 \n", + "2 10000 1.0 NO NONE ... 6 Low Average 770 \n", + "3 5000 1.0 NO NONE ... 7 Average 1050 \n", + "4 8080 1.0 NO NONE ... 8 Good 1680 \n", + "\n", + " sqft_basement yr_built yr_renovated zipcode lat long \\\n", + "0 0.0 1955 0.0 98178 47.5112 -122.257 \n", + "1 400.0 1951 1991.0 98125 47.7210 -122.319 \n", + "2 0.0 1933 NaN 98028 47.7379 -122.233 \n", + "3 910.0 1965 0.0 98136 47.5208 -122.393 \n", + "4 0.0 1987 0.0 98074 47.6168 -122.045 \n", + "\n", + " sqft_living15 sqft_lot15 \n", + "0 1340 5650 \n", + "1 1690 7639 \n", + "2 2720 8062 \n", + "3 1360 5000 \n", + "4 1800 7503 \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "# Load the data\n", + "data = pd.read_csv('data/kc_house_data.csv')\n", + "data.head()" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataframe Info:\n", + "\n", + "RangeIndex: 21597 entries, 0 to 21596\n", + "Data columns (total 21 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 21597 non-null int64 \n", + " 1 date 21597 non-null object \n", + " 2 price 21597 non-null float64\n", + " 3 bedrooms 21597 non-null int64 \n", + " 4 bathrooms 21597 non-null float64\n", + " 5 sqft_living 21597 non-null int64 \n", + " 6 sqft_lot 21597 non-null int64 \n", + " 7 floors 21597 non-null float64\n", + " 8 waterfront 19221 non-null object \n", + " 9 view 21534 non-null object \n", + " 10 condition 21597 non-null object \n", + " 11 grade 21597 non-null object \n", + " 12 sqft_above 21597 non-null int64 \n", + " 13 sqft_basement 21597 non-null object \n", + " 14 yr_built 21597 non-null int64 \n", + " 15 yr_renovated 17755 non-null float64\n", + " 16 zipcode 21597 non-null int64 \n", + " 17 lat 21597 non-null float64\n", + " 18 long 21597 non-null float64\n", + " 19 sqft_living15 21597 non-null int64 \n", + " 20 sqft_lot15 21597 non-null int64 \n", + "dtypes: float64(6), int64(9), object(6)\n", + "memory usage: 3.5+ MB\n", + "(21597, 21)\n", + "\n", + "Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n", + " 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n", + " 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n", + " 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n", + " dtype='object')\n", + "\n", + "id int64\n", + "date object\n", + "price float64\n", + "bedrooms int64\n", + "bathrooms float64\n", + "sqft_living int64\n", + "sqft_lot int64\n", + "floors float64\n", + "waterfront object\n", + "view object\n", + "condition object\n", + "grade object\n", + "sqft_above int64\n", + "sqft_basement object\n", + "yr_built int64\n", + "yr_renovated float64\n", + "zipcode int64\n", + "lat float64\n", + "long float64\n", + "sqft_living15 int64\n", + "sqft_lot15 int64\n", + "dtype: object\n", + "\n", + "None\n", + "\n", + " id price bedrooms bathrooms sqft_living \\\n", + "count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 \n", + "mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 \n", + "std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 \n", + "min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 \n", + "25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 \n", + "50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n", + "75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n", + "max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n", + "\n", + " sqft_lot floors sqft_above yr_built yr_renovated \\\n", + "count 2.159700e+04 21597.000000 21597.000000 21597.000000 17755.000000 \n", + "mean 1.509941e+04 1.494096 1788.596842 1970.999676 83.636778 \n", + "std 4.141264e+04 0.539683 827.759761 29.375234 399.946414 \n", + "min 5.200000e+02 1.000000 370.000000 1900.000000 0.000000 \n", + "25% 5.040000e+03 1.000000 1190.000000 1951.000000 0.000000 \n", + "50% 7.618000e+03 1.500000 1560.000000 1975.000000 0.000000 \n", + "75% 1.068500e+04 2.000000 2210.000000 1997.000000 0.000000 \n", + "max 1.651359e+06 3.500000 9410.000000 2015.000000 2015.000000 \n", + "\n", + " zipcode lat long sqft_living15 sqft_lot15 \n", + "count 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 \n", + "mean 98077.951845 47.560093 -122.213982 1986.620318 12758.283512 \n", + "std 53.513072 0.138552 0.140724 685.230472 27274.441950 \n", + "min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n", + "25% 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 \n", + "50% 98065.000000 47.571800 -122.231000 1840.000000 7620.000000 \n", + "75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n", + "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 \n" + ] + } + ], "source": [ - "# Your code here - remember to use markdown cells for comments as well!" + "\n", + "def dataset_info(file_path):\n", + " data = pd.read_csv(file_path)\n", + " print(\"Dataframe Info:\")\n", + " print(data.shape, data.columns, data.dtypes, data.info(), data.describe(), sep=\"\\n\\n\")\n", + "\n", + "dataset_info('data/kc_house_data.csv')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Data Buckets**\n", + "\n", + "* We have 6 categories for descriptive information like location(categorical data).\n", + "* There are 12 columns with numerical values like square footage and bedrooms(numeric data).\n", + "* Three columns contain details related to time like year built(temporal data).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**The King County House Sales dataset contains the following columns;**\n", + "\n", + "id - unique identified for a house\n", + "\n", + "date - Date house was sold \n", + "\n", + "Price - Sale price (prediction target)\n", + "\n", + "bedrooms - Number of bedrooms,\n", + "\n", + "bathrooms - Number of bathrooms,\n", + "\n", + "sqft_living - Square footage of living space in the home,\n", + "\n", + "sqft_lot - Square footage of the lot,\n", + "\n", + "floors - Number of floors (levels) in house,\n", + "\n", + "view - Quality of view from house,\n", + "\n", + "condition - How good the overall condition of the house is. Related to maintenance of house,\n", + "\n", + "grade - Overall grade of the house. Related to the construction and design of the house,\n", + "\n", + "sqft_above - Square footage of house apart from basement,\n", + "\n", + "sqft_basement - Square footage of the basement,\n", + "\n", + "yr_built - Year when house was built,\n", + "\n", + "yr_renovated - Year when house was renovated,\n", + "\n", + "zipcode - ZIP Code used by the United States Postal Service,\n", + "\n", + "sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors,\n", + "\n", + "sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors, and\n", + "\n", + "sell_yr - Date house was sold." ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "learn-env", "language": "python", "name": "python3" }, @@ -40,7 +471,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4,