Dec 17 2020 - End-to-End Machine learning project in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to readers for easier onboarding to Azure Databricks!

Series of Azure Databricks posts:

Dec 01: What is Azure Databricks
Dec 02: How to get started with Azure Databricks
Dec 03: Getting to know the workspace and Azure Databricks platform
Dec 04: Creating your first Azure Databricks cluster
Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
Dec 06: Importing and storing data to Azure Databricks
Dec 07: Starting with Databricks notebooks and loading data to DBFS
Dec 08: Using Databricks CLI and DBFS CLI for file upload
Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
Dec 13: Using Python Databricks Koalas with Azure Databricks
Dec 14: From configuration to execution of Databricks jobs
Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
Dec 16: Databricks experiments, models and MLFlow

In the past couple of days we looked into configurations and infrastructure and today it is again time to do an analysis, let's call it end-to-end analysis using R or Python or SQL.

1. Notebook, Cluster and Data

Create new notebook, I am calling my Day17_Analysis and selecting Python as kernel language. Attach cluster to your notebook and start the cluster (if it is not yet running). Import data using SparkR:

%r
library(SparkR)

data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv", header="true")

display(data_r)
data_r <- as.data.frame(data_r)

And we can also do the same for Python:

import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')

We can use also Python to insert the data and get the dataset insight.

import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()

Importing also all other packages that will be relevant in following steps:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

2.Data wrangling

So let's continue using Python. You can get the sense of the dataset by using Python describe function:

data_py.describe()

%r
library(SparkR)

data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv", header="true")

display(data_r)
data_r <- as.data.frame(data_r)

And we can also do the same for Python:

import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')

We can use also Python to insert the data and get the dataset insight.

import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()

Importing also all other packages that will be relevant in following steps:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

2.Data wrangling

So let's continue using Python. You can get the sense of the dataset by using Python describe function:

data_py.describe()

And also work with duplicate values (remove them) and missing values (remove them or replace them with mean value):

#remove duplicates
sum(data_py.duplicated())
data_py.drop_duplicates(inplace=True)

#remove rows with empty values
data_py.isnull().sum(axis=0)
data_py.dropna(axis=0, how='any', inplace=True)

#fill the missing values with mean
data_py.fillna(0, inplace=True)
data_py['quality'].fillna(data_py['quality'].mean(), inplace=True)
data_py.apply(lambda x: x.fillna(x.mean(), inplace=True), axis=0)

You can also find and filter out the outlier by using IQR - Interquartile rang:

Q1 = data_py.quantile(0.25)
Q3 = data_py.quantile(0.75)
IQR = Q3 - Q1
data_py2 = data_py[~((data_py < (Q1 - 1.5 * IQR)) |(data_py > (Q3 + 1.5 * IQR))).any(axis=1)]
#print(data_py2.shape)
print(data_py2 < (Q1 - 1.5 * IQR)) |(data_py2 > (Q3 + 1.5 * IQR))

3.Exploring dataset

We can check the distribution of some variables and best way is to show it with graphs:

fig, axs = plt.subplots(1,5,figsize=(20,4),constrained_layout=True)

data_py['fixed acidity'].plot(kind='hist', ax=axs[0])
data_py['pH'].plot(kind='hist', ax=axs[1])
data_py['quality'].plot(kind='line', ax=axs[2])
data_py['alcohol'].plot(kind='hist', ax=axs[3])
data_py['total sulfur dioxide'].plot(kind='hist', ax=axs[4])

Adding also a plot of counts per quality:

counts = data_py.groupby(['quality']).count()['pH']  # pH or anything else - just for count 
counts.plot(kind='bar', title='Quantity by Quality')
plt.xlabel('Quality', fontsize=18) 
plt.ylabel('Count', fontsize=18)

Adding some boxplots will also give a great understanding of the data and statistics of particular variable. So, let's take pH and Quality

sns.boxplot(x='quality',y='pH',data=data_py,palette='GnBu_d')
plt.title("Boxplot - Quality and pH")
plt.show()

or quality with fixed acidity:

sns.boxplot(x="quality",y="fixed acidity",data=data_py,palette="coolwarm")
plt.title("Boxplot of Quality and Fixed Acidity")
plt.show()

And also add some correlation among all the variables in dataset:

plt.figure(figsize=(10,10))
sns.heatmap(data_py.corr(),annot=True,linewidth=0.5,center=0,cmap='coolwarm')
plt.show()

4.Modeling

We will split the dataset into Y-set - our predict variable and X-set - all the other variables. After that, we will do splitting of the y-set and x-set into train and test subset.

X = data_py.iloc[:,:11].values
Y = data_py.iloc[:,-1].values

#Splitting the dataset into training and test set
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=0)

We will also to the feature scaling

#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

And get the general understanding of explained variance:

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

You will see, that three variables together contribute more than 50% of all variance of the model.

Based on the train and test test, let us now fit the different type of model into the dataset. Using Logistic regression:

#Fitting Logistic Regression into dataset
lr_c=LogisticRegression(random_state=0)
lr_c.fit(X_train,Y_train)
lr_pred=lr_c.predict(X_test)
lr_cm=confusion_matrix(Y_test,lr_pred)
print("The accuracy of  LogisticRegression is:",accuracy_score(Y_test, lr_pred))

and create a confusion matrix to see the correctly predicted values per category.

#Making confusion matrix
print(lr_cm)

I will repeat this for the following algorithms: SVM, RandomForest, KNN, Naive Bayes and I will make a comparison at the end.

SVM

#Fitting SVM into dataset
cl = SVC(kernel="rbf")
cl.fit(X_train,Y_train)
svm_pred=cl.predict(X_test)
svm_cm = confusion_matrix(Y_test,cl.predict(X_test))
print("The accuracy of  SVM is:",accuracy_score(Y_test, svm_pred))

RandomForest

#Fitting Randomforest into dataset
rdf_c=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
rdf_c.fit(X_train,Y_train)
rdf_pred=rdf_c.predict(X_test)
rdf_cm=confusion_matrix(Y_test,rdf_pred)
print("The accuracy of RandomForestClassifier is:",accuracy_score(rdf_pred,Y_test))

KNN

#Fitting KNN into dataset
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
knn_pred=knn.predict(X_test)
knn_cm=confusion_matrix(Y_test,knn_pred)
print("The accuracy of KNeighborsClassifier is:",accuracy_score(knn_pred,Y_test))

and Naive Bayes

#Fitting Naive bayes into dataset
gaussian=GaussianNB()
gaussian.fit(X_train,Y_train)
bayes_pred=gaussian.predict(X_test)
bayes_cm=confusion_matrix(Y_test,bayes_pred)
print("The accuracy of naives bayes is:",accuracy_score(bayes_pred,Y_test))

And the accuracy for all the model fitting is the following:

LogisticRegression is: 0.4722502522704339
SVM is: 0.48335015136226034
KNeighborsClassifier is: 0.39455095862764883
naives bayes is: 0.46316851664984865

It is clear which model would give improvements,

Tomorrow we will do use Azure Data Factory with Databricks.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks.md

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks.md

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks

1. Notebook, Cluster and Data

2.Data wrangling

2.Data wrangling

Files

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks.md

Latest commit

History

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks.md

File metadata and controls

Dec 17 2020 - End-to-End Machine learning project in Azure Databricks

1. Notebook, Cluster and Data

2.Data wrangling

2.Data wrangling