Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data analysis #123

Closed
wants to merge 15 commits into from
Closed
Prev Previous commit
Next Next commit
Update student.ipynb
Drop null ,duplicatedvalues,column names and replace values.
  • Loading branch information
Sally-52 committed Apr 28, 2024
commit 459371ee39d26d043f0944b249ed8bd62189e47b
156 changes: 153 additions & 3 deletions student.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -712,6 +712,25 @@
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each row represents information about a specific house, and each column provides different characteristics of the houses for example the house with id 7129300520 goes for the price of 221900,has three bedrooms,one bathroom,a squarefeet of 1180,a squarefeet lot of 5650. \n",
"This is the same criteria we use in for the other houses.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This function transform_and_drop_yr_renovated(df) transforms the 'yr_renovated' column in a DataFrame and then drops the original column we now have a column stating whether the house renovation took place or not replacing the column there which was showing what year the renovation took place.\n",
"\n",
"This transformation allows you to categorize whether each house has been renovated ('Yes') or not ('No'), based on the presence or absence of a renovation year in the original 'yr_renovated' column.\n",
"\n",
"The df.head() statement prints the first few rows of the transformed DataFrame to check the result."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -820,12 +839,143 @@
"print (info_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"'waterfront' column has 2376 null values.\n",
"\n",
"'view' column has 63 null values.\n",
"\n",
"All other columns have zero null values.\n",
"\n",
"No Duplicated Rows Found: This line indicates that there are no duplicated rows in DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Function for dropping duplicates,nulls and column names and replacing values.\n",
"So we will use the Python function 'dropper'. This function is used for cleaning a dataframe by dropping duplicates,null values and separated columns.In the function below we also include replacing the NaN values in our waterfront column with None."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id date price bedrooms bathrooms sqft_living \\\n",
"0 7129300520 10/13/2014 221900.0 3 1.00 1180 \n",
"1 6414100192 12/9/2014 538000.0 3 2.25 2570 \n",
"2 5631500400 2/25/2015 180000.0 2 1.00 770 \n",
"3 2487200875 12/9/2014 604000.0 4 3.00 1960 \n",
"4 1954400510 2/18/2015 510000.0 3 2.00 1680 \n",
"... ... ... ... ... ... ... \n",
"21592 263000018 5/21/2014 360000.0 3 2.50 1530 \n",
"21593 6600060120 2/23/2015 400000.0 4 2.50 2310 \n",
"21594 1523300141 6/23/2014 402101.0 2 0.75 1020 \n",
"21595 291310100 1/16/2015 400000.0 3 2.50 1600 \n",
"21596 1523300157 10/15/2014 325000.0 2 0.75 1020 \n",
"\n",
" sqft_lot floors waterfront view ... grade sqft_above \\\n",
"0 5650 1.0 NONE NONE ... 7 Average 1180 \n",
"1 7242 2.0 NO NONE ... 7 Average 2170 \n",
"2 10000 1.0 NO NONE ... 6 Low Average 770 \n",
"3 5000 1.0 NO NONE ... 7 Average 1050 \n",
"4 8080 1.0 NO NONE ... 8 Good 1680 \n",
"... ... ... ... ... ... ... ... \n",
"21592 1131 3.0 NO NONE ... 8 Good 1530 \n",
"21593 5813 2.0 NO NONE ... 8 Good 2310 \n",
"21594 1350 2.0 NO NONE ... 7 Average 1020 \n",
"21595 2388 2.0 NONE NONE ... 8 Good 1600 \n",
"21596 1076 2.0 NO NONE ... 7 Average 1020 \n",
"\n",
" sqft_basement yr_built zipcode lat long sqft_living15 \\\n",
"0 0.0 1955 98178 47.5112 -122.257 1340 \n",
"1 400.0 1951 98125 47.7210 -122.319 1690 \n",
"2 0.0 1933 98028 47.7379 -122.233 2720 \n",
"3 910.0 1965 98136 47.5208 -122.393 1360 \n",
"4 0.0 1987 98074 47.6168 -122.045 1800 \n",
"... ... ... ... ... ... ... \n",
"21592 0.0 2009 98103 47.6993 -122.346 1530 \n",
"21593 0.0 2014 98146 47.5107 -122.362 1830 \n",
"21594 0.0 2009 98144 47.5944 -122.299 1020 \n",
"21595 0.0 2004 98027 47.5345 -122.069 1410 \n",
"21596 0.0 2008 98144 47.5941 -122.299 1020 \n",
"\n",
" sqft_lot15 house_renovation \n",
"0 5650 No \n",
"1 7639 Yes \n",
"2 8062 No \n",
"3 5000 No \n",
"4 7503 No \n",
"... ... ... \n",
"21592 1509 No \n",
"21593 7200 No \n",
"21594 2007 No \n",
"21595 1287 No \n",
"21596 1357 No \n",
"\n",
"[21597 rows x 21 columns]\n",
"NO 19075\n",
"NONE 2376\n",
"YES 146\n",
"Name: waterfront, dtype: int64\n"
]
}
],
"source": [
"def dropper(df, one=None, two=None, three=None):\n",
" '''\n",
" Input: DataFrame, request 1,request 2, request 3\n",
" requests:\n",
" 'duplicates' to drop duplicates\n",
" 'nulls' to drop null values\n",
" list containing df column names l = ['','','']\n",
" '''\n",
" request = [one,two,three]\n",
" if 'duplicates' in request:\n",
" df = df.drop_duplicates()\n",
" if 'nulls' in request:\n",
" df = df.dropna()\n",
" for req in request:\n",
" if isinstance(req, list):\n",
" df = df.drop(columns=req, axis=1).reset_index(drop=True)\n",
" return(df)\n",
"data_info = check_dtypes(df)\n",
"print(df)\n",
"\n",
"#Changing values for our column waterfront\n",
"# Assuming your DataFrame is named df\n",
"df['waterfront'] = df['waterfront'].fillna('NONE')\n",
"print(df['waterfront'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After finding the number of null values in the previous function, we have now dropped our null values using df =df.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this dataframe, we have changed the output of waterfront column from NaN to None using the fill.na().\n",
"the column waterfront has the data type interger.\n",
"\n",
"'NO': There are 19075 occurrences of 'NO' in the 'waterfront' column. This indicates that these properties do not have a waterfront view.\n",
"\n",
"'NONE': There are 2376 occurrences of 'NONE' in the 'waterfront' column. This likely indicates that these records originally had missing values (NaN) for the waterfront attribute, and they have been replaced with the string 'NONE'.\n",
"\n",
"'YES': There are 146 occurrences of 'YES' in the 'waterfront' column. This indicates that these properties have a waterfront view."
]
}
],
"metadata": {
Expand Down