From 02f5aa4204c822d69b4dbdae7f7b5c30b52b9031 Mon Sep 17 00:00:00 2001 From: sandrakiptumm Date: Sat, 27 Apr 2024 21:33:42 +0300 Subject: [PATCH 01/25] Importing modules --- student.ipynb | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/student.ipynb b/student.ipynb index d3bb34af..1fd5648e 100644 --- a/student.ipynb +++ b/student.ipynb @@ -22,6 +22,19 @@ "source": [ "# Your code here - remember to use markdown cells for comments as well!" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import sqlite3\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt" + ] } ], "metadata": { From 99392888b6ff63405f09a378a4624fdc68e37097 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sat, 27 Apr 2024 22:23:48 +0300 Subject: [PATCH 02/25] Update student.ipynb --- student.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/student.ipynb b/student.ipynb index 1fd5648e..981801dd 100644 --- a/student.ipynb +++ b/student.ipynb @@ -7,10 +7,10 @@ "## Final Project Submission\n", "\n", "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", + "* Student name: Calvine Dasilver\n", + "* Student pace: full time\n", "* Scheduled project review date/time: \n", - "* Instructor name: \n", + "* Instructor name: Nikita\n", "* Blog post URL:\n" ] }, From dc953c2a66c113466d164708d18731017bb1c16d Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sat, 27 Apr 2024 23:25:55 +0300 Subject: [PATCH 03/25] Business-understanding business overview and understanding --- student.ipynb | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/student.ipynb b/student.ipynb index 981801dd..4a8a8f58 100644 --- a/student.ipynb +++ b/student.ipynb @@ -14,6 +14,33 @@ "* Blog post URL:\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # King County House Sales Analysis & Regression Modeling\n", + "\n", + " ## Project Overview\n", + "\n", + " ###
  • Business Understanding\n", + "\n", + "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", + " ##### Problem Statements:\n", + "What are the most significant factors influencing house prices in this northwestern county?How can we quantify the relationship between these factors and property value?Can we develop a reliable model to predict house prices based on relevant characteristics?\n", + " ##### Challenges:\n", + "* Real estate data can be complex and multifaceted, encompassing various property features and local market trends.\n", + "* Accurately identifying and quantifying the relative impact of each factor on house prices can be challenging.\n", + "* External factors like economic conditions and interest rates might also influence prices, requiring careful consideration.\n", + "\n", + "##### Proposed Solutions:\n", + "We propose utilizing multiple linear regression, a powerful machine learning technique. This method allows us to analyze a large dataset of house sales and identify the statistical relationships between various property features (e.g., square footage, number of bedrooms, location) and the corresponding sale prices.\n", + " ##### Objectives:\n", + "1. Develop a robust multiple linear regression model that accurately predicts house prices in the chosen northwestern county.\n", + "2. Identify the most significant factors influencing property value within this specific market.\n", + "3. Provide valuable insights into the housing market dynamics of the region, benefiting potential buyers, real estate agents, and other stakeholders.\n", + " " + ] + }, { "cell_type": "code", "execution_count": null, From 94c04b3aa09b8277ab40916e09f15a0bf9a8cc12 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sat, 27 Apr 2024 23:45:03 +0300 Subject: [PATCH 04/25] Update student.ipynb --- student.ipynb | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/student.ipynb b/student.ipynb index 4a8a8f58..ccbd36f3 100644 --- a/student.ipynb +++ b/student.ipynb @@ -38,7 +38,14 @@ "1. Develop a robust multiple linear regression model that accurately predicts house prices in the chosen northwestern county.\n", "2. Identify the most significant factors influencing property value within this specific market.\n", "3. Provide valuable insights into the housing market dynamics of the region, benefiting potential buyers, real estate agents, and other stakeholders.\n", - " " + " \n", + " \n", + "**Research questions that would help to achieve the objectives**:\n", + "\n", + "1. How does the number of bedrooms, bathrooms, grade and square footage of a house correlate with its sale price in King County?\n", + "2. How much can a homeowner expect the value of their home to increase after a specific renovation project?\n", + "3. Which renovation projects have the most significant impact on a home's market value in the northwestern county?\n", + "4. Are there specific combinations of renovation projects that provide an interdependent effect on a home's market value?" ] }, { From a0708a0e09bedc96b91113f89f4007d9916561eb Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 00:13:40 +0300 Subject: [PATCH 05/25] Data-understanding data source and data description --- student.ipynb | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/student.ipynb b/student.ipynb index ccbd36f3..0e889c5f 100644 --- a/student.ipynb +++ b/student.ipynb @@ -48,6 +48,59 @@ "4. Are there specific combinations of renovation projects that provide an interdependent effect on a home's market value?" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ###
  • Data Understanding\n", + "\n", + "Our analysis leverages the King County House Sales dataset - a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", + "\n", + "**The King County House Sales dataset contains the following columns;**\n", + "\n", + "Price - Sale price (prediction target)\n", + "\n", + "bedrooms - Number of bedrooms,\n", + "\n", + "bathrooms - Number of bathrooms,\n", + "\n", + "sqft_living - Square footage of living space in the home,\n", + "\n", + "sqft_lot - Square footage of the lot,\n", + "\n", + "floors - Number of floors (levels) in house,\n", + "\n", + "view - Quality of view from house,\n", + "\n", + "condition - How good the overall condition of the house is. Related to maintenance of house,\n", + "\n", + "grade - Overall grade of the house. Related to the construction and design of the house,\n", + "\n", + "sqft_above - Square footage of house apart from basement,\n", + "\n", + "sqft_basement - Square footage of the basement,\n", + "\n", + "yr_built - Year when house was built,\n", + "\n", + "yr_renovated - Year when house was renovated,\n", + "\n", + "zipcode - ZIP Code used by the United States Postal Service,\n", + "\n", + "sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors,\n", + "\n", + "sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors, and\n", + "\n", + "sell_yr - Date house was sold.\n", + "\n", + "\n", + "We need to be aware of certain constraints within the data, as these might influence our analysis and interpretation of the results. From the sources;\n", + "\n", + "1. The data may contain anomalies or inconsistencies that require careful examination during analysis. For instance, a record lists a house with 33 bedrooms, which appears to be an outlier\n", + "\n", + "2. It's important to consider the time frame of the data (May 2014 - May 2015) as it may not fully capture the current market dynamics in King County.\n", + "3. It's important to acknowledge the scope of the data. While it provides details on house features, it may not capture external factors such as interest rates or the overall economic climate, which can also play a role in determining property values." + ] + }, { "cell_type": "code", "execution_count": null, From 27b7de987bc0143cc0410d7f149c4e786dfe8868 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 00:18:00 +0300 Subject: [PATCH 06/25] Update student.ipynb --- student.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/student.ipynb b/student.ipynb index 0e889c5f..67a5b8be 100644 --- a/student.ipynb +++ b/student.ipynb @@ -22,7 +22,7 @@ "\n", " ## Project Overview\n", "\n", - " ###
  • Business Understanding\n", + " ###
  • **Business Understanding**\n", "\n", "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", " ##### Problem Statements:\n", @@ -52,7 +52,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " ###
  • Data Understanding\n", + " ###
  • **Data Understanding**\n", "\n", "Our analysis leverages the King County House Sales dataset - a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", "\n", From 5dca9f8514bf6244ef977d8f4a11f58ca51b22ae Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 00:28:27 +0300 Subject: [PATCH 07/25] Update student.ipynb --- student.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/student.ipynb b/student.ipynb index 67a5b8be..d54668c9 100644 --- a/student.ipynb +++ b/student.ipynb @@ -18,7 +18,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " # King County House Sales Analysis & Regression Modeling\n", + " # Demystifying House Sales Analysis & Regression Modeling in a Northwestern County\n", "\n", " ## Project Overview\n", "\n", From 39a544a20a1e5a8cfc9aee46a8e07974752b8d90 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 00:43:50 +0300 Subject: [PATCH 08/25] Update student.ipynb --- student.ipynb | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/student.ipynb b/student.ipynb index d54668c9..98841fb8 100644 --- a/student.ipynb +++ b/student.ipynb @@ -18,10 +18,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " # Demystifying House Sales Analysis & Regression Modeling in a Northwestern County\n", - "\n", - " ## Project Overview\n", + " # Demystifying House Sales Analysis & Regression Modeling in a Northwestern County\n", "\n", + " ## Project Overview" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Image, display\n", + "image_path = '/content/Real estate.jpg'\n", + "display(Image(filename=image_path))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ " ###
  • **Business Understanding**\n", "\n", "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", From 516f422bb15391106448be3ce0d1955d731c0687 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 01:13:08 +0300 Subject: [PATCH 09/25] Update student.ipynb --- student.ipynb | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/student.ipynb b/student.ipynb index 98841fb8..c2d72a52 100644 --- a/student.ipynb +++ b/student.ipynb @@ -18,22 +18,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " # Demystifying House Sales Analysis & Regression Modeling in a Northwestern County\n", + " # Demystifying House Sales Analysis with Regression Modeling in a Northwestern County\n", "\n", " ## Project Overview" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import Image, display\n", - "image_path = '/content/Real estate.jpg'\n", - "display(Image(filename=image_path))" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -156,7 +145,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4, From 602cb3201712b528de133d596b0fdbc136ecb368 Mon Sep 17 00:00:00 2001 From: jackoti Date: Sat, 27 Apr 2024 21:46:28 -0700 Subject: [PATCH 10/25] import module and read data --- student.ipynb | 240 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 239 insertions(+), 1 deletion(-) diff --git a/student.ipynb b/student.ipynb index d3bb34af..1f3f5c1e 100644 --- a/student.ipynb +++ b/student.ipynb @@ -22,6 +22,244 @@ "source": [ "# Your code here - remember to use markdown cells for comments as well!" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "DATA PREPARATION" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import csv\n", + "import pandas as pd\n", + "import sqlite3\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    0712930052010/13/2014221900.031.00118056501.0NaNNONE...7 Average11800.019550.09817847.5112-122.25713405650
    1641410019212/9/2014538000.032.25257072422.0NONONE...7 Average2170400.019511991.09812547.7210-122.31916907639
    256315004002/25/2015180000.021.00770100001.0NONONE...6 Low Average7700.01933NaN9802847.7379-122.23327208062
    3248720087512/9/2014604000.043.00196050001.0NONONE...7 Average1050910.019650.09813647.5208-122.39313605000
    419544005102/18/2015510000.032.00168080801.0NONONE...8 Good16800.019870.09807447.6168-122.04518007503
    \n", + "

    5 rows × 21 columns

    \n", + "
    " + ], + "text/plain": [ + " id date price bedrooms bathrooms sqft_living \\\n", + "0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n", + "1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n", + "2 5631500400 2/25/2015 180000.0 2 1.00 770 \n", + "3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n", + "4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n", + "\n", + " sqft_lot floors waterfront view ... grade sqft_above \\\n", + "0 5650 1.0 NaN NONE ... 7 Average 1180 \n", + "1 7242 2.0 NO NONE ... 7 Average 2170 \n", + "2 10000 1.0 NO NONE ... 6 Low Average 770 \n", + "3 5000 1.0 NO NONE ... 7 Average 1050 \n", + "4 8080 1.0 NO NONE ... 8 Good 1680 \n", + "\n", + " sqft_basement yr_built yr_renovated zipcode lat long \\\n", + "0 0.0 1955 0.0 98178 47.5112 -122.257 \n", + "1 400.0 1951 1991.0 98125 47.7210 -122.319 \n", + "2 0.0 1933 NaN 98028 47.7379 -122.233 \n", + "3 910.0 1965 0.0 98136 47.5208 -122.393 \n", + "4 0.0 1987 0.0 98074 47.6168 -122.045 \n", + "\n", + " sqft_living15 sqft_lot15 \n", + "0 1340 5650 \n", + "1 1690 7639 \n", + "2 2720 8062 \n", + "3 1360 5000 \n", + "4 1800 7503 \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = pd.read_csv(\"data/kc_house_data.csv\")\n", + "data.head()" + ] } ], "metadata": { @@ -40,7 +278,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.11.5" } }, "nbformat": 4, From a5f1a608f1c8331cdcb566cab26b015cb6e94baa Mon Sep 17 00:00:00 2001 From: sandrakiptumm Date: Sun, 28 Apr 2024 14:44:45 +0300 Subject: [PATCH 11/25] deleting importations in main --- student.ipynb | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/student.ipynb b/student.ipynb index 1fd5648e..a1e5b9cd 100644 --- a/student.ipynb +++ b/student.ipynb @@ -22,26 +22,13 @@ "source": [ "# Your code here - remember to use markdown cells for comments as well!" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import sqlite3\n", - "import numpy as np\n", - "import seaborn as sns\n", - "import matplotlib.pyplot as plt" - ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (learn-env)", "language": "python", - "name": "python3" + "name": "learn-env" }, "language_info": { "codemirror_mode": { @@ -53,7 +40,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4, From 8b8b1c9a37f3b0643582498e08333f4f820d49fb Mon Sep 17 00:00:00 2001 From: jackoti Date: Sun, 28 Apr 2024 09:51:49 -0700 Subject: [PATCH 12/25] check columns and missing values --- student.ipynb | 239 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 233 insertions(+), 6 deletions(-) diff --git a/student.ipynb b/student.ipynb index 1f3f5c1e..220f662f 100644 --- a/student.ipynb +++ b/student.ipynb @@ -32,7 +32,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 60, "metadata": {}, "outputs": [], "source": [ @@ -41,12 +41,22 @@ "import sqlite3\n", "import numpy as np\n", "import seaborn as sns\n", - "import matplotlib.pyplot as plt" + "import matplotlib.pyplot as plt\n", + "from statsmodels.api import OLS\n", + "%matplotlib inline\n", + "\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.metrics import mean_squared_error\n", + "from sklearn.preprocessing import PolynomialFeatures\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.model_selection import cross_validate\n", + "from sklearn.preprocessing import LabelEncoder" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 61, "metadata": {}, "outputs": [ { @@ -251,15 +261,232 @@ "[5 rows x 21 columns]" ] }, - "execution_count": 3, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv(\"data/kc_house_data.csv\")\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n", + " 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n", + " 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n", + " 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n", + " dtype='object')\n" + ] + } + ], + "source": [ + "print(df.columns)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "id 0\n", + "date 0\n", + "price 0\n", + "bedrooms 0\n", + "bathrooms 0\n", + "sqft_living 0\n", + "sqft_lot 0\n", + "floors 0\n", + "waterfront 2376\n", + "view 63\n", + "condition 0\n", + "grade 0\n", + "sqft_above 0\n", + "sqft_basement 0\n", + "yr_built 0\n", + "yr_renovated 3842\n", + "zipcode 0\n", + "lat 0\n", + "long 0\n", + "sqft_living15 0\n", + "sqft_lot15 0\n", + "dtype: int64" + ] + }, + "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "data = pd.read_csv(\"data/kc_house_data.csv\")\n", - "data.head()" + "#checking null values\n", + "df.isna().sum()" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { From 5cb2038f4b3b5262ca773d660e7975695db67766 Mon Sep 17 00:00:00 2001 From: jackoti Date: Sun, 28 Apr 2024 10:07:16 -0700 Subject: [PATCH 13/25] check data info and statistical summary --- student.ipynb | 413 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 401 insertions(+), 12 deletions(-) diff --git a/student.ipynb b/student.ipynb index 220f662f..0183696a 100644 --- a/student.ipynb +++ b/student.ipynb @@ -336,31 +336,420 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 66, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 21597 entries, 0 to 21596\n", + "Data columns (total 21 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 21597 non-null int64 \n", + " 1 date 21597 non-null object \n", + " 2 price 21597 non-null float64\n", + " 3 bedrooms 21597 non-null int64 \n", + " 4 bathrooms 21597 non-null float64\n", + " 5 sqft_living 21597 non-null int64 \n", + " 6 sqft_lot 21597 non-null int64 \n", + " 7 floors 21597 non-null float64\n", + " 8 waterfront 19221 non-null object \n", + " 9 view 21534 non-null object \n", + " 10 condition 21597 non-null object \n", + " 11 grade 21597 non-null object \n", + " 12 sqft_above 21597 non-null int64 \n", + " 13 sqft_basement 21597 non-null object \n", + " 14 yr_built 21597 non-null int64 \n", + " 15 yr_renovated 17755 non-null float64\n", + " 16 zipcode 21597 non-null int64 \n", + " 17 lat 21597 non-null float64\n", + " 18 long 21597 non-null float64\n", + " 19 sqft_living15 21597 non-null int64 \n", + " 20 sqft_lot15 21597 non-null int64 \n", + "dtypes: float64(6), int64(9), object(6)\n", + "memory usage: 3.5+ MB\n" + ] + } + ], + "source": [ + "#checking on data information\n", + "df.info()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 67, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    idpricebedroomsbathroomssqft_livingsqft_lotfloorssqft_aboveyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    count2.159700e+042.159700e+0421597.00000021597.00000021597.0000002.159700e+0421597.00000021597.00000021597.00000017755.00000021597.00000021597.00000021597.00000021597.00000021597.000000
    mean4.580474e+095.402966e+053.3732002.1158262080.3218501.509941e+041.4940961788.5968421970.99967683.63677898077.95184547.560093-122.2139821986.62031812758.283512
    std2.876736e+093.673681e+050.9262990.768984918.1061254.141264e+040.539683827.75976129.375234399.94641453.5130720.1385520.140724685.23047227274.441950
    min1.000102e+067.800000e+041.0000000.500000370.0000005.200000e+021.000000370.0000001900.0000000.00000098001.00000047.155900-122.519000399.000000651.000000
    25%2.123049e+093.220000e+053.0000001.7500001430.0000005.040000e+031.0000001190.0000001951.0000000.00000098033.00000047.471100-122.3280001490.0000005100.000000
    50%3.904930e+094.500000e+053.0000002.2500001910.0000007.618000e+031.5000001560.0000001975.0000000.00000098065.00000047.571800-122.2310001840.0000007620.000000
    75%7.308900e+096.450000e+054.0000002.5000002550.0000001.068500e+042.0000002210.0000001997.0000000.00000098118.00000047.678000-122.1250002360.00000010083.000000
    max9.900000e+097.700000e+0633.0000008.00000013540.0000001.651359e+063.5000009410.0000002015.0000002015.00000098199.00000047.777600-121.3150006210.000000871200.000000
    \n", + "
    " + ], + "text/plain": [ + " id price bedrooms bathrooms sqft_living \\\n", + "count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 \n", + "mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 \n", + "std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 \n", + "min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 \n", + "25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 \n", + "50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n", + "75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n", + "max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n", + "\n", + " sqft_lot floors sqft_above yr_built yr_renovated \\\n", + "count 2.159700e+04 21597.000000 21597.000000 21597.000000 17755.000000 \n", + "mean 1.509941e+04 1.494096 1788.596842 1970.999676 83.636778 \n", + "std 4.141264e+04 0.539683 827.759761 29.375234 399.946414 \n", + "min 5.200000e+02 1.000000 370.000000 1900.000000 0.000000 \n", + "25% 5.040000e+03 1.000000 1190.000000 1951.000000 0.000000 \n", + "50% 7.618000e+03 1.500000 1560.000000 1975.000000 0.000000 \n", + "75% 1.068500e+04 2.000000 2210.000000 1997.000000 0.000000 \n", + "max 1.651359e+06 3.500000 9410.000000 2015.000000 2015.000000 \n", + "\n", + " zipcode lat long sqft_living15 sqft_lot15 \n", + "count 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 \n", + "mean 98077.951845 47.560093 -122.213982 1986.620318 12758.283512 \n", + "std 53.513072 0.138552 0.140724 685.230472 27274.441950 \n", + "min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n", + "25% 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 \n", + "50% 98065.000000 47.571800 -122.231000 1840.000000 7620.000000 \n", + "75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n", + "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 " + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#statistical summary\n", + "df.describe()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 68, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    datepricebedroomsbathroomssqft_livingconditionyr_built
    010/13/2014221900.031.001180Average1955
    112/9/2014538000.032.252570Average1951
    22/25/2015180000.021.00770Average1933
    312/9/2014604000.043.001960Very Good1965
    42/18/2015510000.032.001680Average1987
    \n", + "
    " + ], + "text/plain": [ + " date price bedrooms bathrooms sqft_living condition yr_built\n", + "0 10/13/2014 221900.0 3 1.00 1180 Average 1955\n", + "1 12/9/2014 538000.0 3 2.25 2570 Average 1951\n", + "2 2/25/2015 180000.0 2 1.00 770 Average 1933\n", + "3 12/9/2014 604000.0 4 3.00 1960 Very Good 1965\n", + "4 2/18/2015 510000.0 3 2.00 1680 Average 1987" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#we'll focus on the columns mentioned above and drop the rest\n", + "df.drop(columns=['id','lat','long','sqft_lot','floors','waterfront','view','zipcode','sqft_living15','sqft_lot15','grade','yr_renovated','sqft_above','sqft_basement'],inplace=True)\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 69, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/plain": [ + "date 0\n", + "price 0\n", + "bedrooms 0\n", + "bathrooms 0\n", + "sqft_living 0\n", + "condition 0\n", + "yr_built 0\n", + "dtype: int64" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#checking for for null values\n", + "df.isna().sum()" + ] }, { "cell_type": "code", From 7b42ab768a5ec8dda38793a846996a72371a3af4 Mon Sep 17 00:00:00 2001 From: jackoti Date: Sun, 28 Apr 2024 10:11:38 -0700 Subject: [PATCH 14/25] added additional information on the statistical summary --- student.ipynb | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/student.ipynb b/student.ipynb index 0183696a..bece4158 100644 --- a/student.ipynb +++ b/student.ipynb @@ -613,6 +613,16 @@ "df.describe()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Statistical summary observation\n", + "count for each column is 21597 this shows that we dont have missing values\n", + "The mean value of the house price is USD 540297 while the minimum house price is USD 78000 and maximum house price is USD 7700000\n", + "The standard deviation of the house price stands at USD 367368." + ] + }, { "cell_type": "code", "execution_count": 68, From eb067369b051a15a46737bc67ace4576e3c30230 Mon Sep 17 00:00:00 2001 From: jackoti Date: Sun, 28 Apr 2024 10:34:20 -0700 Subject: [PATCH 15/25] added additional information --- student.ipynb | 650 +++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 612 insertions(+), 38 deletions(-) diff --git a/student.ipynb b/student.ipynb index bece4158..55d0227c 100644 --- a/student.ipynb +++ b/student.ipynb @@ -32,7 +32,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 76, "metadata": {}, "outputs": [], "source": [ @@ -56,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 77, "metadata": {}, "outputs": [ { @@ -261,7 +261,7 @@ "[5 rows x 21 columns]" ] }, - "execution_count": 61, + "execution_count": 77, "metadata": {}, "output_type": "execute_result" } @@ -273,7 +273,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": 78, "metadata": {}, "outputs": [ { @@ -294,7 +294,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 79, "metadata": {}, "outputs": [ { @@ -324,7 +324,7 @@ "dtype: int64" ] }, - "execution_count": 65, + "execution_count": 79, "metadata": {}, "output_type": "execute_result" } @@ -336,7 +336,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 80, "metadata": {}, "outputs": [ { @@ -381,7 +381,7 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 81, "metadata": {}, "outputs": [ { @@ -603,7 +603,7 @@ "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 " ] }, - "execution_count": 67, + "execution_count": 81, "metadata": {}, "output_type": "execute_result" } @@ -625,7 +625,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 82, "metadata": {}, "outputs": [ { @@ -722,7 +722,7 @@ "4 2/18/2015 510000.0 3 2.00 1680 Average 1987" ] }, - "execution_count": 68, + "execution_count": 82, "metadata": {}, "output_type": "execute_result" } @@ -735,7 +735,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 83, "metadata": {}, "outputs": [ { @@ -751,7 +751,7 @@ "dtype: int64" ] }, - "execution_count": 69, + "execution_count": 83, "metadata": {}, "output_type": "execute_result" } @@ -763,52 +763,626 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, + "execution_count": 84, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    datepricebedroomsbathroomssqft_livingconditionyr_built
    02014221900.031.001180Average1955
    12014538000.032.252570Average1951
    22015180000.021.00770Average1933
    32014604000.043.001960Very Good1965
    42015510000.032.001680Average1987
    \n", + "
    " + ], + "text/plain": [ + " date price bedrooms bathrooms sqft_living condition yr_built\n", + "0 2014 221900.0 3 1.00 1180 Average 1955\n", + "1 2014 538000.0 3 2.25 2570 Average 1951\n", + "2 2015 180000.0 2 1.00 770 Average 1933\n", + "3 2014 604000.0 4 3.00 1960 Very Good 1965\n", + "4 2015 510000.0 3 2.00 1680 Average 1987" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#converting date to datetime format\n", + "df['date']=pd.to_datetime(df['date'])\n", + "#extracting year from date column\n", + "df.date=df['date'].dt.year\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 85, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yr
    0221900.031.001180Average19552014
    1538000.032.252570Average19512014
    2180000.021.00770Average19332015
    3604000.043.001960Very Good19652014
    4510000.032.001680Average19872015
    \n", + "
    " + ], + "text/plain": [ + " price bedrooms bathrooms sqft_living condition yr_built sell_yr\n", + "0 221900.0 3 1.00 1180 Average 1955 2014\n", + "1 538000.0 3 2.25 2570 Average 1951 2014\n", + "2 180000.0 2 1.00 770 Average 1933 2015\n", + "3 604000.0 4 3.00 1960 Very Good 1965 2014\n", + "4 510000.0 3 2.00 1680 Average 1987 2015" + ] + }, + "execution_count": 85, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Creating a new column for sell year\n", + "df['sell_yr'] = pd.to_datetime(df['date'],format='%Y').dt.year\n", + "df.drop(columns='date', inplace=True)\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 86, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.001180Average1955201459
    1538000.032.252570Average1951201463
    2180000.021.00770Average1933201582
    3604000.043.001960Very Good1965201449
    4510000.032.001680Average1987201528
    \n", + "
    " + ], + "text/plain": [ + " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", + "0 221900.0 3 1.00 1180 Average 1955 2014 \n", + "1 538000.0 3 2.25 2570 Average 1951 2014 \n", + "2 180000.0 2 1.00 770 Average 1933 2015 \n", + "3 604000.0 4 3.00 1960 Very Good 1965 2014 \n", + "4 510000.0 3 2.00 1680 Average 1987 2015 \n", + "\n", + " house_age \n", + "0 59 \n", + "1 63 \n", + "2 82 \n", + "3 49 \n", + "4 28 " + ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#creating column house age at year of sale\n", + "df['house_age']=df['sell_yr']-df['yr_built']\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 87, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.00118031955201459
    1538000.032.25257031951201463
    2180000.021.0077031933201582
    3604000.043.00196051965201449
    4510000.032.00168031987201528
    \n", + "
    " + ], + "text/plain": [ + " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", + "0 221900.0 3 1.00 1180 3 1955 2014 \n", + "1 538000.0 3 2.25 2570 3 1951 2014 \n", + "2 180000.0 2 1.00 770 3 1933 2015 \n", + "3 604000.0 4 3.00 1960 5 1965 2014 \n", + "4 510000.0 3 2.00 1680 3 1987 2015 \n", + "\n", + " house_age \n", + "0 59 \n", + "1 63 \n", + "2 82 \n", + "3 49 \n", + "4 28 " + ] + }, + "execution_count": 87, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#transformation of categorical values 'condition' column\n", + "df['condition'].replace(to_replace=['Poor', 'Fair', 'Average', 'Good', 'Very Good'], value=[1, 2, 3, 4, 5], inplace=True)\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 88, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.00118031955201459
    1538000.032.25257031951201463
    2180000.021.0077031933201582
    3604000.043.00196051965201449
    4510000.032.00168031987201528
    \n", + "
    " + ], + "text/plain": [ + " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", + "0 221900.0 3 1.00 1180 3 1955 2014 \n", + "1 538000.0 3 2.25 2570 3 1951 2014 \n", + "2 180000.0 2 1.00 770 3 1933 2015 \n", + "3 604000.0 4 3.00 1960 5 1965 2014 \n", + "4 510000.0 3 2.00 1680 3 1987 2015 \n", + "\n", + " house_age \n", + "0 59 \n", + "1 63 \n", + "2 82 \n", + "3 49 \n", + "4 28 " + ] + }, + "execution_count": 88, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#converting the 'sell_yr' column to int64\n", + "df['sell_yr'] = df['sell_yr'].astype('int64')\n", + "df.head()" + ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 75, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 21597 entries, 0 to 21596\n", + "Data columns (total 8 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 price 21597 non-null float64\n", + " 1 bedrooms 21597 non-null int64 \n", + " 2 bathrooms 21597 non-null float64\n", + " 3 sqft_living 21597 non-null int64 \n", + " 4 condition 21597 non-null int64 \n", + " 5 yr_built 21597 non-null int64 \n", + " 6 sell_yr 21597 non-null int64 \n", + " 7 house_age 21597 non-null int64 \n", + "dtypes: float64(2), int64(6)\n", + "memory usage: 1.3 MB\n" + ] + } + ], + "source": [ + "df.info()" + ] }, { "cell_type": "code", From 77c916b1260f1ff52b6ac1e9dcb6d65eab356c41 Mon Sep 17 00:00:00 2001 From: Sandra Kiptum <100568848+Sandrakiptumm@users.noreply.github.com> Date: Sun, 28 Apr 2024 14:56:07 +0300 Subject: [PATCH 16/25] Update student.ipynb --- student.ipynb | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/student.ipynb b/student.ipynb index 9959cf3b..49b18223 100644 --- a/student.ipynb +++ b/student.ipynb @@ -1472,16 +1472,18 @@ "codemirror_mode": { "name": "ipython", "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - feature/data-preparation - "version": "3.11.5" - main - } + + { + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5", + "feature": "data-preparation", + "main": true +} + }, "nbformat": 4, "nbformat_minor": 2 From 8ca69ac8fedf7107c0e7809bd2fd1077e4175a1e Mon Sep 17 00:00:00 2001 From: sandrakiptumm Date: Sun, 28 Apr 2024 15:19:16 +0300 Subject: [PATCH 17/25] Revert "added additional information" This reverts commit eb067369b051a15a46737bc67ace4576e3c30230. --- student.ipynb | 650 +++----------------------------------------------- 1 file changed, 38 insertions(+), 612 deletions(-) diff --git a/student.ipynb b/student.ipynb index 49b18223..64a1b461 100644 --- a/student.ipynb +++ b/student.ipynb @@ -32,7 +32,7 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 60, "metadata": {}, "outputs": [], "source": [ @@ -56,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 61, "metadata": {}, "outputs": [ { @@ -261,7 +261,7 @@ "[5 rows x 21 columns]" ] }, - "execution_count": 77, + "execution_count": 61, "metadata": {}, "output_type": "execute_result" } @@ -273,7 +273,7 @@ }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 64, "metadata": {}, "outputs": [ { @@ -294,7 +294,7 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 65, "metadata": {}, "outputs": [ { @@ -324,7 +324,7 @@ "dtype: int64" ] }, - "execution_count": 79, + "execution_count": 65, "metadata": {}, "output_type": "execute_result" } @@ -336,7 +336,7 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 66, "metadata": {}, "outputs": [ { @@ -381,7 +381,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 67, "metadata": {}, "outputs": [ { @@ -603,7 +603,7 @@ "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 " ] }, - "execution_count": 81, + "execution_count": 67, "metadata": {}, "output_type": "execute_result" } @@ -625,7 +625,7 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 68, "metadata": {}, "outputs": [ { @@ -722,7 +722,7 @@ "4 2/18/2015 510000.0 3 2.00 1680 Average 1987" ] }, - "execution_count": 82, + "execution_count": 68, "metadata": {}, "output_type": "execute_result" } @@ -735,7 +735,7 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 69, "metadata": {}, "outputs": [ { @@ -751,7 +751,7 @@ "dtype: int64" ] }, - "execution_count": 83, + "execution_count": 69, "metadata": {}, "output_type": "execute_result" } @@ -763,626 +763,52 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    datepricebedroomsbathroomssqft_livingconditionyr_built
    02014221900.031.001180Average1955
    12014538000.032.252570Average1951
    22015180000.021.00770Average1933
    32014604000.043.001960Very Good1965
    42015510000.032.001680Average1987
    \n", - "
    " - ], - "text/plain": [ - " date price bedrooms bathrooms sqft_living condition yr_built\n", - "0 2014 221900.0 3 1.00 1180 Average 1955\n", - "1 2014 538000.0 3 2.25 2570 Average 1951\n", - "2 2015 180000.0 2 1.00 770 Average 1933\n", - "3 2014 604000.0 4 3.00 1960 Very Good 1965\n", - "4 2015 510000.0 3 2.00 1680 Average 1987" - ] - }, - "execution_count": 84, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#converting date to datetime format\n", - "df['date']=pd.to_datetime(df['date'])\n", - "#extracting year from date column\n", - "df.date=df['date'].dt.year\n", - "df.head()" - ] + "outputs": [], + "source": [] }, { "cell_type": "code", - "execution_count": 85, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yr
    0221900.031.001180Average19552014
    1538000.032.252570Average19512014
    2180000.021.00770Average19332015
    3604000.043.001960Very Good19652014
    4510000.032.001680Average19872015
    \n", - "
    " - ], - "text/plain": [ - " price bedrooms bathrooms sqft_living condition yr_built sell_yr\n", - "0 221900.0 3 1.00 1180 Average 1955 2014\n", - "1 538000.0 3 2.25 2570 Average 1951 2014\n", - "2 180000.0 2 1.00 770 Average 1933 2015\n", - "3 604000.0 4 3.00 1960 Very Good 1965 2014\n", - "4 510000.0 3 2.00 1680 Average 1987 2015" - ] - }, - "execution_count": 85, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Creating a new column for sell year\n", - "df['sell_yr'] = pd.to_datetime(df['date'],format='%Y').dt.year\n", - "df.drop(columns='date', inplace=True)\n", - "df.head()" - ] + "outputs": [], + "source": [] }, { "cell_type": "code", - "execution_count": 86, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.001180Average1955201459
    1538000.032.252570Average1951201463
    2180000.021.00770Average1933201582
    3604000.043.001960Very Good1965201449
    4510000.032.001680Average1987201528
    \n", - "
    " - ], - "text/plain": [ - " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", - "0 221900.0 3 1.00 1180 Average 1955 2014 \n", - "1 538000.0 3 2.25 2570 Average 1951 2014 \n", - "2 180000.0 2 1.00 770 Average 1933 2015 \n", - "3 604000.0 4 3.00 1960 Very Good 1965 2014 \n", - "4 510000.0 3 2.00 1680 Average 1987 2015 \n", - "\n", - " house_age \n", - "0 59 \n", - "1 63 \n", - "2 82 \n", - "3 49 \n", - "4 28 " - ] - }, - "execution_count": 86, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#creating column house age at year of sale\n", - "df['house_age']=df['sell_yr']-df['yr_built']\n", - "df.head()" - ] + "outputs": [], + "source": [] }, { "cell_type": "code", - "execution_count": 87, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.00118031955201459
    1538000.032.25257031951201463
    2180000.021.0077031933201582
    3604000.043.00196051965201449
    4510000.032.00168031987201528
    \n", - "
    " - ], - "text/plain": [ - " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", - "0 221900.0 3 1.00 1180 3 1955 2014 \n", - "1 538000.0 3 2.25 2570 3 1951 2014 \n", - "2 180000.0 2 1.00 770 3 1933 2015 \n", - "3 604000.0 4 3.00 1960 5 1965 2014 \n", - "4 510000.0 3 2.00 1680 3 1987 2015 \n", - "\n", - " house_age \n", - "0 59 \n", - "1 63 \n", - "2 82 \n", - "3 49 \n", - "4 28 " - ] - }, - "execution_count": 87, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#transformation of categorical values 'condition' column\n", - "df['condition'].replace(to_replace=['Poor', 'Fair', 'Average', 'Good', 'Very Good'], value=[1, 2, 3, 4, 5], inplace=True)\n", - "df.head()" - ] + "outputs": [], + "source": [] }, { "cell_type": "code", - "execution_count": 88, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    pricebedroomsbathroomssqft_livingconditionyr_builtsell_yrhouse_age
    0221900.031.00118031955201459
    1538000.032.25257031951201463
    2180000.021.0077031933201582
    3604000.043.00196051965201449
    4510000.032.00168031987201528
    \n", - "
    " - ], - "text/plain": [ - " price bedrooms bathrooms sqft_living condition yr_built sell_yr \\\n", - "0 221900.0 3 1.00 1180 3 1955 2014 \n", - "1 538000.0 3 2.25 2570 3 1951 2014 \n", - "2 180000.0 2 1.00 770 3 1933 2015 \n", - "3 604000.0 4 3.00 1960 5 1965 2014 \n", - "4 510000.0 3 2.00 1680 3 1987 2015 \n", - "\n", - " house_age \n", - "0 59 \n", - "1 63 \n", - "2 82 \n", - "3 49 \n", - "4 28 " - ] - }, - "execution_count": 88, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#converting the 'sell_yr' column to int64\n", - "df['sell_yr'] = df['sell_yr'].astype('int64')\n", - "df.head()" - ] + "outputs": [], + "source": [] }, { "cell_type": "code", - "execution_count": 75, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 21597 entries, 0 to 21596\n", - "Data columns (total 8 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 price 21597 non-null float64\n", - " 1 bedrooms 21597 non-null int64 \n", - " 2 bathrooms 21597 non-null float64\n", - " 3 sqft_living 21597 non-null int64 \n", - " 4 condition 21597 non-null int64 \n", - " 5 yr_built 21597 non-null int64 \n", - " 6 sell_yr 21597 non-null int64 \n", - " 7 house_age 21597 non-null int64 \n", - "dtypes: float64(2), int64(6)\n", - "memory usage: 1.3 MB\n" - ] - } - ], - "source": [ - "df.info()" - ] + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] }, { "cell_type": "code", From 9a13fdc738c3b9d4503d3e747366b177a6f76813 Mon Sep 17 00:00:00 2001 From: Sandra Kiptum <100568848+Sandrakiptumm@users.noreply.github.com> Date: Sun, 28 Apr 2024 15:41:34 +0300 Subject: [PATCH 18/25] Update student.ipynb updating main json --- student.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/student.ipynb b/student.ipynb index 64a1b461..055a180a 100644 --- a/student.ipynb +++ b/student.ipynb @@ -906,8 +906,8 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5", - "feature": "data-preparation", - "main": true + "feature/data-preparation": "main" + } }, From 32ee66dcfacb98a831997b91eea0ffe5b29a8ac1 Mon Sep 17 00:00:00 2001 From: Sandra Kiptum <100568848+Sandrakiptumm@users.noreply.github.com> Date: Sun, 28 Apr 2024 15:56:49 +0300 Subject: [PATCH 19/25] Update student.ipynb --- student.ipynb | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/student.ipynb b/student.ipynb index 055a180a..74572468 100644 --- a/student.ipynb +++ b/student.ipynb @@ -899,17 +899,16 @@ "name": "ipython", "version": 3 - { - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5", - "feature/data-preparation": "main" - -} - + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5", + "feature/data-preparation": + "main" + } }, "nbformat": 4, "nbformat_minor": 2 From ca3a1af0a138dcc0fc423c7e6f8321bd6ac24443 Mon Sep 17 00:00:00 2001 From: sandrakiptumm Date: Sun, 28 Apr 2024 20:15:22 +0300 Subject: [PATCH 20/25] Reverting student.ipynb --- student.ipynb | 872 +------------------------------------------------- 1 file changed, 3 insertions(+), 869 deletions(-) diff --git a/student.ipynb b/student.ipynb index 74572468..82ef194b 100644 --- a/student.ipynb +++ b/student.ipynb @@ -22,870 +22,6 @@ "source": [ "# Your code here - remember to use markdown cells for comments as well!" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "DATA PREPARATION" - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [], - "source": [ - "import csv\n", - "import pandas as pd\n", - "import sqlite3\n", - "import numpy as np\n", - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "from statsmodels.api import OLS\n", - "%matplotlib inline\n", - "\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.metrics import mean_squared_error\n", - "from sklearn.preprocessing import PolynomialFeatures\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.model_selection import cross_validate\n", - "from sklearn.preprocessing import LabelEncoder" - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    0712930052010/13/2014221900.031.00118056501.0NaNNONE...7 Average11800.019550.09817847.5112-122.25713405650
    1641410019212/9/2014538000.032.25257072422.0NONONE...7 Average2170400.019511991.09812547.7210-122.31916907639
    256315004002/25/2015180000.021.00770100001.0NONONE...6 Low Average7700.01933NaN9802847.7379-122.23327208062
    3248720087512/9/2014604000.043.00196050001.0NONONE...7 Average1050910.019650.09813647.5208-122.39313605000
    419544005102/18/2015510000.032.00168080801.0NONONE...8 Good16800.019870.09807447.6168-122.04518007503
    \n", - "

    5 rows × 21 columns

    \n", - "
    " - ], - "text/plain": [ - " id date price bedrooms bathrooms sqft_living \\\n", - "0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n", - "1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n", - "2 5631500400 2/25/2015 180000.0 2 1.00 770 \n", - "3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n", - "4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n", - "\n", - " sqft_lot floors waterfront view ... grade sqft_above \\\n", - "0 5650 1.0 NaN NONE ... 7 Average 1180 \n", - "1 7242 2.0 NO NONE ... 7 Average 2170 \n", - "2 10000 1.0 NO NONE ... 6 Low Average 770 \n", - "3 5000 1.0 NO NONE ... 7 Average 1050 \n", - "4 8080 1.0 NO NONE ... 8 Good 1680 \n", - "\n", - " sqft_basement yr_built yr_renovated zipcode lat long \\\n", - "0 0.0 1955 0.0 98178 47.5112 -122.257 \n", - "1 400.0 1951 1991.0 98125 47.7210 -122.319 \n", - "2 0.0 1933 NaN 98028 47.7379 -122.233 \n", - "3 910.0 1965 0.0 98136 47.5208 -122.393 \n", - "4 0.0 1987 0.0 98074 47.6168 -122.045 \n", - "\n", - " sqft_living15 sqft_lot15 \n", - "0 1340 5650 \n", - "1 1690 7639 \n", - "2 2720 8062 \n", - "3 1360 5000 \n", - "4 1800 7503 \n", - "\n", - "[5 rows x 21 columns]" - ] - }, - "execution_count": 61, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = pd.read_csv(\"data/kc_house_data.csv\")\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n", - " 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n", - " 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n", - " 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n", - " dtype='object')\n" - ] - } - ], - "source": [ - "print(df.columns)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "id 0\n", - "date 0\n", - "price 0\n", - "bedrooms 0\n", - "bathrooms 0\n", - "sqft_living 0\n", - "sqft_lot 0\n", - "floors 0\n", - "waterfront 2376\n", - "view 63\n", - "condition 0\n", - "grade 0\n", - "sqft_above 0\n", - "sqft_basement 0\n", - "yr_built 0\n", - "yr_renovated 3842\n", - "zipcode 0\n", - "lat 0\n", - "long 0\n", - "sqft_living15 0\n", - "sqft_lot15 0\n", - "dtype: int64" - ] - }, - "execution_count": 65, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#checking null values\n", - "df.isna().sum()" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 21597 entries, 0 to 21596\n", - "Data columns (total 21 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 id 21597 non-null int64 \n", - " 1 date 21597 non-null object \n", - " 2 price 21597 non-null float64\n", - " 3 bedrooms 21597 non-null int64 \n", - " 4 bathrooms 21597 non-null float64\n", - " 5 sqft_living 21597 non-null int64 \n", - " 6 sqft_lot 21597 non-null int64 \n", - " 7 floors 21597 non-null float64\n", - " 8 waterfront 19221 non-null object \n", - " 9 view 21534 non-null object \n", - " 10 condition 21597 non-null object \n", - " 11 grade 21597 non-null object \n", - " 12 sqft_above 21597 non-null int64 \n", - " 13 sqft_basement 21597 non-null object \n", - " 14 yr_built 21597 non-null int64 \n", - " 15 yr_renovated 17755 non-null float64\n", - " 16 zipcode 21597 non-null int64 \n", - " 17 lat 21597 non-null float64\n", - " 18 long 21597 non-null float64\n", - " 19 sqft_living15 21597 non-null int64 \n", - " 20 sqft_lot15 21597 non-null int64 \n", - "dtypes: float64(6), int64(9), object(6)\n", - "memory usage: 3.5+ MB\n" - ] - } - ], - "source": [ - "#checking on data information\n", - "df.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    idpricebedroomsbathroomssqft_livingsqft_lotfloorssqft_aboveyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    count2.159700e+042.159700e+0421597.00000021597.00000021597.0000002.159700e+0421597.00000021597.00000021597.00000017755.00000021597.00000021597.00000021597.00000021597.00000021597.000000
    mean4.580474e+095.402966e+053.3732002.1158262080.3218501.509941e+041.4940961788.5968421970.99967683.63677898077.95184547.560093-122.2139821986.62031812758.283512
    std2.876736e+093.673681e+050.9262990.768984918.1061254.141264e+040.539683827.75976129.375234399.94641453.5130720.1385520.140724685.23047227274.441950
    min1.000102e+067.800000e+041.0000000.500000370.0000005.200000e+021.000000370.0000001900.0000000.00000098001.00000047.155900-122.519000399.000000651.000000
    25%2.123049e+093.220000e+053.0000001.7500001430.0000005.040000e+031.0000001190.0000001951.0000000.00000098033.00000047.471100-122.3280001490.0000005100.000000
    50%3.904930e+094.500000e+053.0000002.2500001910.0000007.618000e+031.5000001560.0000001975.0000000.00000098065.00000047.571800-122.2310001840.0000007620.000000
    75%7.308900e+096.450000e+054.0000002.5000002550.0000001.068500e+042.0000002210.0000001997.0000000.00000098118.00000047.678000-122.1250002360.00000010083.000000
    max9.900000e+097.700000e+0633.0000008.00000013540.0000001.651359e+063.5000009410.0000002015.0000002015.00000098199.00000047.777600-121.3150006210.000000871200.000000
    \n", - "
    " - ], - "text/plain": [ - " id price bedrooms bathrooms sqft_living \\\n", - "count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 \n", - "mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 \n", - "std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 \n", - "min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 \n", - "25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 \n", - "50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n", - "75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n", - "max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n", - "\n", - " sqft_lot floors sqft_above yr_built yr_renovated \\\n", - "count 2.159700e+04 21597.000000 21597.000000 21597.000000 17755.000000 \n", - "mean 1.509941e+04 1.494096 1788.596842 1970.999676 83.636778 \n", - "std 4.141264e+04 0.539683 827.759761 29.375234 399.946414 \n", - "min 5.200000e+02 1.000000 370.000000 1900.000000 0.000000 \n", - "25% 5.040000e+03 1.000000 1190.000000 1951.000000 0.000000 \n", - "50% 7.618000e+03 1.500000 1560.000000 1975.000000 0.000000 \n", - "75% 1.068500e+04 2.000000 2210.000000 1997.000000 0.000000 \n", - "max 1.651359e+06 3.500000 9410.000000 2015.000000 2015.000000 \n", - "\n", - " zipcode lat long sqft_living15 sqft_lot15 \n", - "count 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 \n", - "mean 98077.951845 47.560093 -122.213982 1986.620318 12758.283512 \n", - "std 53.513072 0.138552 0.140724 685.230472 27274.441950 \n", - "min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n", - "25% 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 \n", - "50% 98065.000000 47.571800 -122.231000 1840.000000 7620.000000 \n", - "75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n", - "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 " - ] - }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#statistical summary\n", - "df.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Statistical summary observation\n", - "count for each column is 21597 this shows that we dont have missing values\n", - "The mean value of the house price is USD 540297 while the minimum house price is USD 78000 and maximum house price is USD 7700000\n", - "The standard deviation of the house price stands at USD 367368." - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
    \n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
    datepricebedroomsbathroomssqft_livingconditionyr_built
    010/13/2014221900.031.001180Average1955
    112/9/2014538000.032.252570Average1951
    22/25/2015180000.021.00770Average1933
    312/9/2014604000.043.001960Very Good1965
    42/18/2015510000.032.001680Average1987
    \n", - "
    " - ], - "text/plain": [ - " date price bedrooms bathrooms sqft_living condition yr_built\n", - "0 10/13/2014 221900.0 3 1.00 1180 Average 1955\n", - "1 12/9/2014 538000.0 3 2.25 2570 Average 1951\n", - "2 2/25/2015 180000.0 2 1.00 770 Average 1933\n", - "3 12/9/2014 604000.0 4 3.00 1960 Very Good 1965\n", - "4 2/18/2015 510000.0 3 2.00 1680 Average 1987" - ] - }, - "execution_count": 68, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#we'll focus on the columns mentioned above and drop the rest\n", - "df.drop(columns=['id','lat','long','sqft_lot','floors','waterfront','view','zipcode','sqft_living15','sqft_lot15','grade','yr_renovated','sqft_above','sqft_basement'],inplace=True)\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "date 0\n", - "price 0\n", - "bedrooms 0\n", - "bathrooms 0\n", - "sqft_living 0\n", - "condition 0\n", - "yr_built 0\n", - "dtype: int64" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#checking for for null values\n", - "df.isna().sum()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -898,16 +34,14 @@ "codemirror_mode": { "name": "ipython", "version": 3 - - }, + }, + "feature/data-preparation": "main", "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5", - "feature/data-preparation": - "main" + "version": "3.11.5" } }, "nbformat": 4, From b776935908faf95b524d4aaa36cea14334beab8f Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Sun, 28 Apr 2024 20:27:51 +0300 Subject: [PATCH 21/25] Update student.ipynb --- student.ipynb | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/student.ipynb b/student.ipynb index c2d72a52..dcd5adbb 100644 --- a/student.ipynb +++ b/student.ipynb @@ -63,6 +63,10 @@ "\n", "**The King County House Sales dataset contains the following columns;**\n", "\n", + "id - unique identified for a house\n", + "\n", + "date - Date house was sold \n", + "\n", "Price - Sale price (prediction target)\n", "\n", "bedrooms - Number of bedrooms,\n", From f5f282a63536e227e6ad900d76b9885efc752b83 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Thu, 2 May 2024 13:00:27 +0300 Subject: [PATCH 22/25] Update student.ipynb --- student.ipynb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/student.ipynb b/student.ipynb index dcd5adbb..e5cca6b8 100644 --- a/student.ipynb +++ b/student.ipynb @@ -31,7 +31,8 @@ "\n", "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", " ##### Problem Statements:\n", - "What are the most significant factors influencing house prices in this northwestern county?How can we quantify the relationship between these factors and property value?Can we develop a reliable model to predict house prices based on relevant characteristics?\n", + "We want to find out what makes houses expensive in a certain county in the northwest US. We'll also look at ways to measure how much these things like number of bedrooms or location affect the price. Finally, we'll see if we can build a tool to predict house prices based on these important features.\n", + "\n", " ##### Challenges:\n", "* Real estate data can be complex and multifaceted, encompassing various property features and local market trends.\n", "* Accurately identifying and quantifying the relative impact of each factor on house prices can be challenging.\n", From 7a14aa6d5336297a947f1b4edcb9d4a4e148919c Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Thu, 2 May 2024 14:24:45 +0300 Subject: [PATCH 23/25] updating business understanding --- student.ipynb | 39 ++++++++++++++++++--------------------- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/student.ipynb b/student.ipynb index 681ea55f..26b0cfb4 100644 --- a/student.ipynb +++ b/student.ipynb @@ -30,28 +30,29 @@ " ###
  • **Business Understanding**\n", "\n", "The real estate market plays a crucial role in the economic health and stability of a region. Understanding the factors that influence house prices is essential for both buyers and sellers to navigate the market effectively. This project focuses on a specific northwestern county in the United States, aiming to shed light on the key determinants of property valuation in this area.\n", - " ##### Problem Statements:\n", + "\n", + " ##### Challenges of a Fluctuating Real Estate Market:\n", + "* Market fluctuations make it difficult for real estate agents to price houses and guide clients on offers.\n", + "* Rapid price fluctuations create a challenging environment for homebuyers, making it difficult to secure a good deal and avoid overpaying.\n", + "* Trying to pick the perfect moment to sell a house for maximum profit feels like playing the lottery – stressful, unpredictable, and with slim odds of success.\n", + "* High land prices and buyers struggling to afford homes make it difficult for builders to build new houses.\n", + "\n", + " \n", + "##### Problem Statements:\n", "We want to find out what makes houses expensive in a certain county in the northwest US. We'll also look at ways to measure how much these things like number of bedrooms or location affect the price. Finally, we'll see if we can build a tool to predict house prices based on these important features.\n", "\n", - " ##### Challenges:\n", - "* Real estate data can be complex and multifaceted, encompassing various property features and local market trends.\n", - "* Accurately identifying and quantifying the relative impact of each factor on house prices can be challenging.\n", - "* External factors like economic conditions and interest rates might also influence prices, requiring careful consideration.\n", + " ##### Conclusion\n", + " Our study looked at how the ups and downs of the housing market in a northwestern county are making things tough for everyone involved. To help out, we're building a tool to predict house prices. This will give real estate agents valuable information so they can give their clients the best advice in this unpredictable market.\n", "\n", - "##### Proposed Solutions:\n", + " ##### Proposed Solutions:\n", "We propose utilizing multiple linear regression, a powerful machine learning technique. This method allows us to analyze a large dataset of house sales and identify the statistical relationships between various property features (e.g., square footage, number of bedrooms, location) and the corresponding sale prices.\n", + "\n", + "\n", " ##### Objectives:\n", "1. Develop a robust multiple linear regression model that accurately predicts house prices in the chosen northwestern county.\n", "2. Identify the most significant factors influencing property value within this specific market.\n", "3. Provide valuable insights into the housing market dynamics of the region, benefiting potential buyers, real estate agents, and other stakeholders.\n", - " \n", - " \n", - "**Research questions that would help to achieve the objectives**:\n", - "\n", - "1. How does the number of bedrooms, bathrooms, grade and square footage of a house correlate with its sale price in King County?\n", - "2. How much can a homeowner expect the value of their home to increase after a specific renovation project?\n", - "3. Which renovation projects have the most significant impact on a home's market value in the northwestern county?\n", - "4. Are there specific combinations of renovation projects that provide an interdependent effect on a home's market value?" + " " ] }, { @@ -60,7 +61,8 @@ "source": [ " ###
  • **Data Understanding**\n", "\n", - "Our analysis leverages the King County House Sales dataset - a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", + "Our analysis leverages the King County House Sales dataset.This information is stored in a file called \"kc_house_data.csv\".\n", + " It's a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", "\n", "**The King County House Sales dataset contains the following columns;**\n", "\n", @@ -103,12 +105,7 @@ "sell_yr - Date house was sold.\n", "\n", "\n", - "We need to be aware of certain constraints within the data, as these might influence our analysis and interpretation of the results. From the sources;\n", - "\n", - "1. The data may contain anomalies or inconsistencies that require careful examination during analysis. For instance, a record lists a house with 33 bedrooms, which appears to be an outlier\n", - "\n", - "2. It's important to consider the time frame of the data (May 2014 - May 2015) as it may not fully capture the current market dynamics in King County.\n", - "3. It's important to acknowledge the scope of the data. While it provides details on house features, it may not capture external factors such as interest rates or the overall economic climate, which can also play a role in determining property values." + "\n" ] }, { From a1ab02ca4716f962b9264aafc3a1e78af92c6507 Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Thu, 2 May 2024 14:58:36 +0300 Subject: [PATCH 24/25] Data understanding --- student.ipynb | 368 +++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 351 insertions(+), 17 deletions(-) diff --git a/student.ipynb b/student.ipynb index 26b0cfb4..7e236fa4 100644 --- a/student.ipynb +++ b/student.ipynb @@ -62,8 +62,355 @@ " ###
  • **Data Understanding**\n", "\n", "Our analysis leverages the King County House Sales dataset.This information is stored in a file called \"kc_house_data.csv\".\n", - " It's a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n", + " It's a rich resource containing over 21,500 records and 20 distinct features(columns). Spanning house sales from May 2014 to May 2015, this dataset provides a comprehensive snapshot of the King County housing market during that period.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontview...gradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
    0712930052010/13/2014221900.031.00118056501.0NaNNONE...7 Average11800.019550.09817847.5112-122.25713405650
    1641410019212/9/2014538000.032.25257072422.0NONONE...7 Average2170400.019511991.09812547.7210-122.31916907639
    256315004002/25/2015180000.021.00770100001.0NONONE...6 Low Average7700.01933NaN9802847.7379-122.23327208062
    3248720087512/9/2014604000.043.00196050001.0NONONE...7 Average1050910.019650.09813647.5208-122.39313605000
    419544005102/18/2015510000.032.00168080801.0NONONE...8 Good16800.019870.09807447.6168-122.04518007503
    \n", + "

    5 rows × 21 columns

    \n", + "
    " + ], + "text/plain": [ + " id date price bedrooms bathrooms sqft_living \\\n", + "0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n", + "1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n", + "2 5631500400 2/25/2015 180000.0 2 1.00 770 \n", + "3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n", + "4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n", + "\n", + " sqft_lot floors waterfront view ... grade sqft_above \\\n", + "0 5650 1.0 NaN NONE ... 7 Average 1180 \n", + "1 7242 2.0 NO NONE ... 7 Average 2170 \n", + "2 10000 1.0 NO NONE ... 6 Low Average 770 \n", + "3 5000 1.0 NO NONE ... 7 Average 1050 \n", + "4 8080 1.0 NO NONE ... 8 Good 1680 \n", + "\n", + " sqft_basement yr_built yr_renovated zipcode lat long \\\n", + "0 0.0 1955 0.0 98178 47.5112 -122.257 \n", + "1 400.0 1951 1991.0 98125 47.7210 -122.319 \n", + "2 0.0 1933 NaN 98028 47.7379 -122.233 \n", + "3 910.0 1965 0.0 98136 47.5208 -122.393 \n", + "4 0.0 1987 0.0 98074 47.6168 -122.045 \n", + "\n", + " sqft_living15 sqft_lot15 \n", + "0 1340 5650 \n", + "1 1690 7639 \n", + "2 2720 8062 \n", + "3 1360 5000 \n", + "4 1800 7503 \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "# Load the data\n", + "data = pd.read_csv('data/kc_house_data.csv')\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataframe Info:\n", + "\n", + "RangeIndex: 21597 entries, 0 to 21596\n", + "Data columns (total 21 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 21597 non-null int64 \n", + " 1 date 21597 non-null object \n", + " 2 price 21597 non-null float64\n", + " 3 bedrooms 21597 non-null int64 \n", + " 4 bathrooms 21597 non-null float64\n", + " 5 sqft_living 21597 non-null int64 \n", + " 6 sqft_lot 21597 non-null int64 \n", + " 7 floors 21597 non-null float64\n", + " 8 waterfront 19221 non-null object \n", + " 9 view 21534 non-null object \n", + " 10 condition 21597 non-null object \n", + " 11 grade 21597 non-null object \n", + " 12 sqft_above 21597 non-null int64 \n", + " 13 sqft_basement 21597 non-null object \n", + " 14 yr_built 21597 non-null int64 \n", + " 15 yr_renovated 17755 non-null float64\n", + " 16 zipcode 21597 non-null int64 \n", + " 17 lat 21597 non-null float64\n", + " 18 long 21597 non-null float64\n", + " 19 sqft_living15 21597 non-null int64 \n", + " 20 sqft_lot15 21597 non-null int64 \n", + "dtypes: float64(6), int64(9), object(6)\n", + "memory usage: 3.5+ MB\n", + "(21597, 21)\n", + "\n", + "Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n", + " 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n", + " 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n", + " 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n", + " dtype='object')\n", + "\n", + "id int64\n", + "date object\n", + "price float64\n", + "bedrooms int64\n", + "bathrooms float64\n", + "sqft_living int64\n", + "sqft_lot int64\n", + "floors float64\n", + "waterfront object\n", + "view object\n", + "condition object\n", + "grade object\n", + "sqft_above int64\n", + "sqft_basement object\n", + "yr_built int64\n", + "yr_renovated float64\n", + "zipcode int64\n", + "lat float64\n", + "long float64\n", + "sqft_living15 int64\n", + "sqft_lot15 int64\n", + "dtype: object\n", + "\n", + "None\n", + "\n", + " id price bedrooms bathrooms sqft_living \\\n", + "count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 \n", + "mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 \n", + "std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 \n", + "min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 \n", + "25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 \n", + "50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n", + "75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n", + "max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n", + "\n", + " sqft_lot floors sqft_above yr_built yr_renovated \\\n", + "count 2.159700e+04 21597.000000 21597.000000 21597.000000 17755.000000 \n", + "mean 1.509941e+04 1.494096 1788.596842 1970.999676 83.636778 \n", + "std 4.141264e+04 0.539683 827.759761 29.375234 399.946414 \n", + "min 5.200000e+02 1.000000 370.000000 1900.000000 0.000000 \n", + "25% 5.040000e+03 1.000000 1190.000000 1951.000000 0.000000 \n", + "50% 7.618000e+03 1.500000 1560.000000 1975.000000 0.000000 \n", + "75% 1.068500e+04 2.000000 2210.000000 1997.000000 0.000000 \n", + "max 1.651359e+06 3.500000 9410.000000 2015.000000 2015.000000 \n", + "\n", + " zipcode lat long sqft_living15 sqft_lot15 \n", + "count 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 \n", + "mean 98077.951845 47.560093 -122.213982 1986.620318 12758.283512 \n", + "std 53.513072 0.138552 0.140724 685.230472 27274.441950 \n", + "min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n", + "25% 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 \n", + "50% 98065.000000 47.571800 -122.231000 1840.000000 7620.000000 \n", + "75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n", + "max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 \n" + ] + } + ], + "source": [ + "\n", + "def dataset_info(file_path):\n", + " data = pd.read_csv(file_path)\n", + " print(\"Dataframe Info:\")\n", + " print(data.shape, data.columns, data.dtypes, data.info(), data.describe(), sep=\"\\n\\n\")\n", "\n", + "dataset_info('data/kc_house_data.csv')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Data Buckets**\n", + "\n", + "There are 6 categorical columns, there are 12 numeric columns and 3 columns that contain temporal data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "**The King County House Sales dataset contains the following columns;**\n", "\n", "id - unique identified for a house\n", @@ -102,34 +449,21 @@ "\n", "sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors, and\n", "\n", - "sell_yr - Date house was sold.\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Your code here - remember to use markdown cells for comments as well!" + "sell_yr - Date house was sold." ] } ], "metadata": { "kernelspec": { - "display_name": "Python (learn-env)", + "display_name": "learn-env", "language": "python", - "name": "learn-env" + "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, - "feature/data-preparation": "main", "file_extension": ".py", "mimetype": "text/x-python", "name": "python", From c18538a0d7b7cdfc84d56ec76f040d6b735b0dfb Mon Sep 17 00:00:00 2001 From: Calvine Dasilver <162190387+Cdasilver29@users.noreply.github.com> Date: Thu, 2 May 2024 20:02:18 +0300 Subject: [PATCH 25/25] update --- student.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/student.ipynb b/student.ipynb index 7e236fa4..139e4f02 100644 --- a/student.ipynb +++ b/student.ipynb @@ -404,7 +404,9 @@ "source": [ "**Data Buckets**\n", "\n", - "There are 6 categorical columns, there are 12 numeric columns and 3 columns that contain temporal data." + "* We have 6 categories for descriptive information like location(categorical data).\n", + "* There are 12 columns with numerical values like square footage and bedrooms(numeric data).\n", + "* Three columns contain details related to time like year built(temporal data).\n" ] }, {