The goal of this project is to build a machine learning model that can predict whether or not a patient has diabetes based on certain medical factors.
Data:
The dataset I used for this project is the Pima Indians Diabetes Dataset, available on Kaggle. The dataset consists of 768 observations and 8 features, including:
Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years)
The target variable is a binary variable indicating whether or not the patient has diabetes.
Steps:
- Load the Pima Indians Diabetes Dataset (available on https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).
- Explore the dataset to gain insights and visualize the data.
- Preprocess the data by handling missing values, scaling the features, and splitting the data into training and testing sets.
- Build and train several machine learning models, including Logistic Regression, Decision Tree, Random Forest, and Support Vector Machines (SVM).
- Evaluate the performance of each model using appropriate evaluation metrics, such as accuracy, precision, recall, and F1 score.
- Select the best-performing model.