- Cleaning and preparing a messy dataset for analysis.
- Building SQL logic for data cleaning operations like cleaning missing values, duplicate values etc.
- Exploratory data analysis using SQL.
- SQL (PostgreSQL)
- Python-Pandas, Psycopg2 (VS Code)
pyscopg2 is a python library used to connect and interact with Postgre databases.
To download psycopg2
sudo apt update
pip3 install psycopg2
Step 1: Creating a database and a dynamic table
Step 2: Import the values into the created table
Step 3: Checking and changing data types
4.1 Checking and cleaning Null values
-
In an end to end data analytics project data cleaning and data exploration are two important steps. Unless the data is cleaned we can't discover actual insights from the data.
-
I mostly use Excel Power query for data cleaning. (You can see one of my data analysis project where I used Excel Power Query for data cleaning.), but for huge amount of data or for high dimensional data Excel is not good enough to handle.
-
Comparatively SQL engine is faster than Excel and with SQL we can implement advanced logic for a task.
So I decided to do some data cleaning using SQL to sharpen my SQL logic.
The dataset is collected from Kaggle. Its a hotel booking dataset with total 32 fields and 119390 rows. It is an ideal dataset for data cleaning practices as it contains lots of null values,missing values and errorneous data.
The collected data set in csv format before cleaning.
-
The collected dataset is in __.csv format. We can insert data from csv file into a SQL database but for this first we need to create a table with simillar column names,
-
Till now SQL databases can't create a table automatically from a csv file.
-
For a dataset with few no. of columns (lets say 4 or 5) we can create a table manually, but for a high dimensional dataset the manual process will consume lot of time.
-
The selected dataset contains total 32 columns, so its not a good idea to create a table manually to add 32 columns.
-
So I used psycopg2 API to interact with PostgreSQL database and to automate the process to create a dynamic table with dynamic column names and data types.
createTable(name='TableName',
allcolumns=list_of_extracted_columns,alldatatypes=list_of_extracted_datatypes)
The function I created (see this file) can create a dynamic table in a database with dynamic column names and data types extracted from csv file using pandas. In such a way we can automate the process to create a table in a database form a csv file.
-
We can also automate the process to insert values to a database table from csv file using psycopg2. For this we need to write atleast 30 lines of code.
-
But using 'pgadmin' I can easily insert the values from a csv file just using 'Import/Export Data' option.
When I can do something easily and quickly, why should I choose the lengthiest way ! This is my philosophy and I am little bit lazy :) :)
So I imported values using the GUI option in pgadmin.
Step 3: Checking and changing data types
From the csv file, pandas extracted column data types as 'int64', 'float64' and 'objects'. But in SQL, data types are identified as 'integer', 'float' or 'double precision'. Although the function I created can fix to change 'int64' to 'integer', 'float64' to 'double precision' but all other data types (like string, date_time, boolean) were identified as 'Objects' by pandas and were changed to 'varchar' when the table was created.
So now, its time to check the columns (there are total 5 columns with wrong data types) to correct the respective data types if required.
Checking and changing data types
Normally following SQL query helps to find the number of null values present in a column.
SELECT count(*) FROM Table
WHERE column is NULL;
But for a high dimensional dataset it is not possible to check null values in each column one by one. So what we can do to count null values in all the columns at the same time?
I created a PL/SQL procedure that can check presence of NULL values in all columns, doesn't matter how many columns a dataset has.
PL/SQL Procedure to count null values in all columns
--Count column wise number of NULL values.
create or replace procedure columnWiseNullCount(total_columns int)
language plpgsql
as $$
declare
total_columns int = total_columns;
columnName varchar(50);
total_nulls int;
begin
for i in 1..total_columns
loop
--Select a column
SELECT Column_name into columnName
FROM information_schema.columns
where table_name = 'bookings'
limit 1
offset i-1;
--Count the null values in that column
select count(*) from bookings into total_nulls
where columnName is null;
if total_nulls > 0 then
raise notice ' %, NULLs: %',columnName, total_nulls;
end if;
end loop;
end;
$$
If we need to create a PL/SQL procedure for each task, again it will take a long time. Let's use the logic-"If I can do something easily, why should I work hard ?
So rather than working hard lets work smart. Lets import the data set into Python IDE and using Pandas-profiling lets generate an automated EDA report.
#import the libraries
import pandas as pd
import pandas_profiling as pf
# import the dataset
df = pd.read_csv('hotel_bookings.csv')
report = pf.ProfileReport(df,explorative=True, dark_mode=True)
report.to_file('report.html')
We can host the 'report.html' file in local server by just clicking it and based on the report we can do whatever need to do with this dataset.
The generated EDA report shows that following columns contain missing values.
So now it has become easier to find which columns contain NULL values.
- Since the columns 'agent' and 'company' contain high % of NULL values, so entire columns are deleted.
- Since columns 'children' and 'country' contain few rows with NULL values so respective rows are removed rather than removing the entire column.
Check the file how I cleaned the missing values and null values
Using CTID in PostgreSQL we can check the number of rows that are exactly duplicate. And then we need to delete the duplicate rows if there are some. Check this file.
Now lets explore the dataset to answer the following questions-
- What is the busiest month, i.e on which month highest bookings were done ?
- Whether in weeknights or in weekend nigths number of customers were higher ?
- What kind of hotel is mostly preffered , city hotel or resort hotel ?
- What is the mostly reserved room type?
- What is the mostly used way to book a hotel?
- How many customers carry babies or children with them?