This project aims to predict student dropout and academic success using demographic, socioeconomic, and academic data. The project is implemented in Python and R.
The data is sourced from Kaggle and contains information on 4424 students across several attributes like gender, age, nationality, qualifications, academic performance etc. More details on the data dictionary can be found in the link below.
- Dataset Link: https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention
The Python and R implementations are in the following notebooks in the src folder:
- student_predictions.ipynb: Interactive Python notebook implementing including data preprocessing, EDA, machine learning models, and deep learning models.
- student_predictions.py: Complete Python implementation including data preprocessing, EDA, machine learning models, and deep learning models.
- student_predictions.R: Complete R implementation including data preprocessing, EDA, machine learning models, and deep learning models.
The following machine learning and deep learning models are implemented:
- Logistic Regression
- K-Nearest Neighbors
- Random Forest (primary model)
- Decision Tree
- Support Vector Machine
- 2-layer Neural Network
- 3-layer Neural Network
- 5-layer Neural Network with Dropout
- Convolutional Neural Network (best performer)
The notebooks can be run end-to-end to reproduce the analysis and modeling pipeline. Models are evaluated using 5-fold stratified cross validation. Class imbalance is handled via oversampling and undersampling techniques.
- Pandas, NumPy, Matplotlib, Seaborn
- Scikit-learn, Keras, Tensorflow
- dplyr, tidyverse, corrplot, skimr, keras
- cluster, factoextra, moments, caret, randomForest, gridExtra
- Naman Pandey - https://www.linkedin.com/in/nmn-pandey/
- Drishti Doshi - https://www.linkedin.com/in/drishti-doshi-45060221a/
- Sujeet Sharma - https://www.linkedin.com/in/sujeet-sharma-644247109/
Let me know if you have any questions.