This project focuses on a classification model. We have applied different Machine Learning models to predict the presence of Parkinson’s disease in a patient.
Parkinson’s disease is a progressive disorder that affects the nervous system and the parts of the body controlled by the nerves. Symptoms start slowly. The first symptom may be a barely noticeable tremor in just one hand. Tremors are common, but the disorder may also cause stiffness or slowing of movement. Although Parkinson’s disease can’t be cured, medications might significantly improve your symptoms. Occasionally, your health care provider may suggest surgery to regulate certain regions of your brain and improve your symptoms.
The dataset is available at Kaggle
https://www.kaggle.com/datasets/gargmanas/parkinsonsdataset
The dataset consists of 24 columns and 195 records. The dataset contains 23 attributes and 1 target variable.
Citation: 'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)
- name - ASCII subject name and recording number
- MDVP:Fo(Hz) - Average vocal fundamental frequency
- MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
- MDVP:Flo(Hz) - Minimum vocal fundamental frequency
- MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency
- MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA - Several measures of variation in amplitude
- NHR, HNR - Two measures of ratio of noise to tonal components in the voice
- status - Health status of the subject (one) - Parkinson's, (zero) - healthy
- RPDE, D2 - Two nonlinear dynamical complexity measures
- DFA - Signal fractal scaling exponent
- spread1, spread2, PPE - Three nonlinear measures of fundamental frequency variation
- Tonnetz - The set of pitch classes used to characterize each note
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
- Logistic Regression
- Decision Tree
- Pruned Decision Tree
- Random Forest
- Xgboost Classifier
- Support Vector Machine
- KNN
- ANN
- Data Cleaning
- Handling Missing Values
- Feature Selection (Removing highly correlated data)
- Feature Scaling
- Feature Dimentionality Reduction (PCA)
- SMOTE
- Accuracy
- Confusion Matrix
- Precision
- Recall
- F1 Score
For our best performing artificial neural network (ANN) model, we employed a meticulous approach to enhance both feature quality and data balance, leading to superior predictive performance. The key steps involved in craŌing this robust model are detailed below:
- Feature Selection: We prioritized the quality of input features by conducting feature selection, specifically removing highly correlated data. This process ensures that the model focuses on the most informative aspects of the dataset, contributing to improved generalization.
- Data Standardization: To facilitate consistent and meaningful comparisons between features, we applied data standardization. This step ensures that all features are on a similar scale, preventing any single feature from dominating the learning process. Standardization contributes to a stable and efficient training process.
- SMOTE Oversampling: Addressing class imbalance is crucial for training a model that is sensitive to all classes. We leveraged the Synthetic Minority Over-sampling Technique (SMOTE) to balance the class distribution. This approach involves generating synthetic samples of the minority class, creating a more representative dataset and preventing the model from being biased towards the majority class.
- Model Training Configuration: For training the ANN, we chose Binary Cross Entropy as the loss function. This loss function is suitable for binary classification tasks, aligning with the nature of our problem. We employed the Adam optimizer, a widely used optimization algorithm known for its efficiency and effectiveness in optimizing neural networks.
- Evaluation Metric: To measure the model's overall performance, we utilized accuracy as the evaluation metric. Accuracy provides a comprehensive view of the model's ability to correctly classify instances, making it a meaningful metric for our binary classification task.
This meticulous combination of feature selection, data standardization, SMOTE oversampling, and thoughƞul model training parameters resulted in an ANN that excels in both training and generalization. The model demonstrates robustness, effectively addressing challenges such as class imbalance and high feature correlation. Its performance is summarized using accuracy, a metric that aligns with our goal of achieving accurate and balanced predictions. This approach not only enhances the model's predictive power but also ensures its reliability in real-world scenarios where class imbalances and correlated features are common challenges.