CapstoneProject

BrainStation Capstone Project - Heart Disease Prediction
Yumemi Kinsella
Github: https://github.com/Yuuuuume7/CapstoneProject

Data source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?select=heart_2020_cleaned.csv

Application : https://yuuuuume7-heartdisease-pred.streamlit.app/

Included in Repository:

Notebooks
Supplementary Files
Project Overview
Project Organization and Flowchart

1.0 - Notebooks

Notebook '1.Capstone Project - EDA':

Includes data cleaning and EDA

Notebook '2.Capstone Project - Sampling imbaranced':

Includes sampling process Sampled the original dataset in 3 ways:
- Under sampling
- Over sampling
- SMOTE

Notebook '3.Capstone Project - Feature Engineering':

Includes converting categorical value into numerical value
Includes Feature selecting Tried 3 ways
- SelectKBest
- RFE
- RFECV

Notebook '4. Capstone Project - Hyperparameter Optimization; Logistic Regression'

Includes optimizing hyperparameters of Logistic Regression Model
- Cross Validation
- Pipeline
Includes model evaluation
- Precision/Recall/F1
Includes selecting best Logistic Regression Model

Notebook '5. Capstone Project - Hyperparameter Optimization; Naive Bayes'

Includes baseline model
Includes optimizing hyperparameters of Naive Bayes Model
- Cross Validation
- Pipeline
Includes model evaluation
- Precision/Recall/F1
Includes selecting best Naive Bayes Model

Notebook '6. Capstone Project - Hyperparameter Optimization; Decision Trees'

Includes baseline model
Includes optimizing hyperparameters of Decision Trees Model
- GridSearchCV
Includes model evaluation
- Precision/Recall/F1
Includes selecting best Decision Trees Model

Notebook '7. Capstone Project - Hyperparameter Optimization; Random Forest'

Includes baseline model
Includes optimizing hyperparameters of Random Forest Model
- GridSearchCV
- Cross Validation
Includes model evaluation
- Precision/Recall/F1
Includes selecting best Random Forest Model

Notebook '8. Capstone Project - Hyperparameter Optimization; SVM

Includes baseline model
Includes optimizing hyperparameters of SVM Model
- GridSearchCV
- Cross Validation
Includes model evaluation
- Precision/Recall/F1
Includes selecting best SVM Model

Notebook '9. Capstone Project - Final model'

Includes comparing the best models of 6 machine learning methods
- Cross Validation
- Precision/Recall/F1
Includes selecting the final model

2.0 - Files

File 'heart_2020_cleaned':

An original csv file
- from Kaggle
- for Notebok 1

File 'capstone_clean_heart_disease':

A cleaned csv file. Removed unnecessary columns and rows
- from Notebook 1
- for Notebook 2, Notebook 3

File 'under_sampled_df':

An under sampled csv file which is only included the train set. Removed some of the "No" value of the target variable for balancing the data.
- from Notebook 2
- for Notebook 3

File 'over_sampled_df':

An over sampled csv file which is only included the train set. Increased rows to have the same number as the majority class by randomly selecting examples from the minority class.
- from Notebook 2
- for Notebook 3

File 'smote_df':

Sampled csv file by using SMOTE method, which is only included the train set. Increased rows to have the same number as the majority class by creating artificial examples based on the existing data.
- from Notebook 2
- for Notebook 3

File 'test_sampled_df':

A test set csv file for under sampled data and over sampled data
- from Notebook 2
- for Notebook 3

File 'test_smote_df':

A test set csv file for SMOTE data
- from Notebook 2
- for Notebook 3

File 'capstone_clean_heart_disease_fe':

An original cleaned dataset with features converted into numerical values
- from Notebook 3
- for Notebook 4

File 'under_sampled_df_fe':

An under sampled dataset with features converted into numerical values(train set only)
- from Notebook 3
- for Notebook 4

File 'over_sampled_df_fe':

An over sampled dataset with features converted into numerical values(train set only)
- from Notebook 3
- for Notebook 4

File 'test_sampled_df_fe':

A test set for under/over sampling data with features converted into numerical values
- from Notebook 3
- for Notebook 4

3.0 - Project Overview

the structure of your Areas of Interest Submission, explaining:

The Problem Area, including those affected
Firstly, according to the Centers for Disease Control and Prevention (CDC), heart disease is the leading cause of death in the United States. About 695,000 people in the United States died from heart disease in 2021 — that’s 1 in every 5 deaths. This means that about 20% of Americans die from heart disease.

Secondly, the official poverty rate in the US was 11% in 2020.

Thirdly, about 22% of Americans have avoided some sort of medical care — including doctor visits, medications, vaccinations, annual exams, screenings, vision checks and routine blood work — because of the expense.

Medical bills in the United Sates are very expensive. There may be some people died from heart disease without seeing a doctor because of medical expense, but those people might have still alive if they did go to see a doctor.
My proposed Data Science solution
For reducing deaths from heart disease due to people's low-income, create a heart disease prediction model with personal information such as Body Mass Index (BMI), how do they feel mentally and physically recently, and so forth. This prediction is for people who can't afford to go see a doctor easily, which means the person who probably don't know about their health condition, so the personal information is not included medical information such as blood pressure or if you have diseases. For ordinary people use, create an application which tells you if you suspect that you have heart disease and gives you some advice for your lifestyle or habit based on the prediction model.
The impact of my solution
The target user : People who cannot afford medical care due to their low-income.
Societal value: People don’t have to spend extra money for medical expenses if you don’t suspect that you have heart disease. And People can change their lifestyle or habit to make the chances of getting heart disease lower if they need.

4.0 - Project Organization and Flowchart

A description of my dataset
I found this dataset from Kaggle, but it originally comes from the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to collect data on the health status of the United States residents.

The cleaned dataset contains 319073 respondents' data. Those dataset has 14 columns;
4 numerical variables: BMI, PhysicalHealth, MentalHealth and SleepTime
10 categorical variables: Smoking, AlcoholDrinking, DiffWalking, Sex, AgeCategory, Race, PhysicalActivity, GenHealth, Asthma, HeartDisease
Those dataset have no missing or invalid values.
Column Information
HeartDisease : Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) / categorical
BMI : Body Mass Index (BMI) / numeric
Smoking : Have you smoked at least 100 cigarettes in your entire life? (Note: 5 packs = 100 cigarettes) / categorical
AlcoholDrinking : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) / categorical
PhysicalHealth : Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 / numeric
MentalHealth : Thinking about your mental health, for how many days during the past 30 days was your mental health not good? / numeric
DiffWalking : Do you have serious difficulty walking or climbing stairs? / categorical
Sex : Are you male or female? / categorical
AgeCategory : Fourteen-level age category / categorical
Race : Imputed race/ethnicity value / categorical
PhysicalActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job / categorical
GenHealth : Would you say that in general your health is... / categorical
SleepTime : On average, how many hours of sleep do you get in a 24-hour period? / numeric
Asthma : (Ever told) (you had) asthma? / categorical

datasets

Original dataset
- Total rows: 319,073 Train set: "Yes" value 19,091 rows "No" value 204,260 rows Test set: "Yes" value 8,178 rows "No" value 87,544 rows
Under sampled dataset
- Total rows: 133,904 Train set: "Yes" value 19,091 rows "No" value 19,091 rows Test set: "Yes" value 8,178 rows "No" value 87,544 rows
Over sampled dataset
- Total rows: 504,242 Train set: "Yes" value 204,260 rows "No" value 204,260 rows Test set: "Yes" value 8,178 rows "No" value 87,544 rows
SMOTE dataset
- Total rows: 504,242 Train set: "Yes" value 204,260 rows "No" value 204,260 rows Test set: "Yes" value 8,178 rows "No" value 87,544 rows

Liblaries

Basic

pandas
numpy
matplotlib.pyplot
seaborn

For sampling

RandomUnderSampler from imblearn.under_sampling
RandomOverSampler from imblearn.over_sampling
SMOTE from imblearn.over_sampling
Counter from collections

For model preparation

train_test_split from sklearn.model_selection
StandardScaler from sklearn.preprocessing
MinMaxScaler from sklearn.preprocessing
PCA from sklearn.decomposition
KernelPCA from sklearn.decomposition

For machine learning

LogisticRegression from sklearn.linear_model
GaussianNB from sklearn.naive_bayes
MultinomialNB from sklearn.naive_bayes
BernoulliNB from sklearn.naive_bayes
DecisionTreeClassifier from sklearn.tree
RandomForestClassifier from sklearn.ensemble
LinearSVC from sklearn.svm
SVC from sklearn.svm

For feature selecting

SelectKBest from sklearn.feature_selection
f_regression from sklearn.feature_selection
RFE from sklearn.feature_selection
RFECV from sklearn.feature_selection

For hyperparameter optimization

Pipeline from sklearn.pipeline
GridSearchCV from sklearn.model_selection

For evaluation

classification_report from sklearn.metrics
ConfusionMatrixDisplay from sklearn.metrics
f1_score from sklearn.metrics
accuracy_score from sklearn.metrics

Flowchart

Data Collection
- Download the data from Kaggle
Data Cleaning
- Remove dupulicate rows
- Remove unnecessary columns This project is for ordinaly people, who don't go see a doctor. Therefore, it's concidered that people don't know what diseases they have, so I removed disease infomation.
EDA
- Check the relationship between heart disease existance and other features
Data Sampling
The original data is imbalanced data. Therefore, I did sample in 3 ways.
- Under sampling
- Over sampling
- SMOTE
Feature Engineering
- Convert categorical value into numerical value
- Feature Selection I used and compared the results of 3 methods for feature selecting
  - SelecrKBest
  - RFE
  - RFECV
Hyperparameter Tuning I optimized 5 methods of machine learning models on each datasets.
- Machine Learning Methods
  - Logistic Regression
  - Naive Bayes
  - Decision Tree
  - Random Forest
  - SVM
- Hyperparameter Optimization Methods
  - Cross Validation
  - Pipeline
  - GridSearchSV
Modeling
- Create the final model with Logistic Regression
Create an Application
- Create a web application with Streamlit
- Create a web application with Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.devcontainer		.devcontainer
App		App
CVS file		CVS file
Notebook		Notebook
Presentation		Presentation
.gitignore		.gitignore
Capstone Flow Chart2.png		Capstone Flow Chart2.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CapstoneProject

Included in Repository:

1.0 - Notebooks

2.0 - Files

3.0 - Project Overview

4.0 - Project Organization and Flowchart

datasets

Liblaries

Flowchart

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Yuuuuume7/CapstoneProject

Folders and files

Latest commit

History

Repository files navigation

CapstoneProject

Included in Repository:

1.0 - Notebooks

2.0 - Files

3.0 - Project Overview

4.0 - Project Organization and Flowchart

datasets

Liblaries

Flowchart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages