eda_and_prediction

This report is about exploratory data analysis and predicting diabetics. The dataset used to make the prediction is the diabetic_data.csv.

The statistical and graphyical exploratory data analysis was done using PySpark.

In order to predict the outcome, eight machine learning algorithms were utilized. Out of these, five were single machine learning algorithms, and three were ensemble learning algorithms. The five single machine learning algorithms include Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbour (KNN), Naive Bayes (NB), and Multi-Layer Perceptron (MLP) from Artificial Neural networks (ANN). The three ensemble learning algorithms used were Random Forest (RF), Gradient Boosting (GB), and AdaBoost (AB).

Testing Accuracy: Logistic Regression - 57% Decision Tree - 50% K-Nearest Neighbour - 51% Naive Bayes - 14% Artificial Neural Networks - 50% Random Forest - 58% Gradient Boosting - 59% AdaBoost - 58%

F1 score Logistic Regression - 0.52 Decision Tree - 0.49 K-Nearest Neighbour - 0.50 Naive Bayes - 0.08 Artificial Neural Networks - 0.50 Random Forest - 0.54 Gradient Boosting - 0.53 AdaBoost - 0.54

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
Diabetics.ipynb		Diabetics.ipynb
README.md		README.md
Take - Home Project Report.pdf		Take - Home Project Report.pdf
diabetic_data.csv		diabetic_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eda_and_prediction

About

Releases

Packages

Languages

Rnamrata/diabetic_data_eda

Folders and files

Latest commit

History

Repository files navigation

eda_and_prediction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages