This project performs Credit Risk Analysis by leveraging machine learning to predict whether a loan will be fully paid or charged off (i.e., defaulted). The goal is to assist financial institutions in assessing borrower risk using historical loan and customer financial data.
The analysis includes data preprocessing, exploratory data analysis (EDA), feature engineering, and model development using multiple classifiers. The final model aims to identify key features that influence creditworthiness and assist in automated loan decision-making.
- Load and preprocess loan application data
- Perform EDA to discover trends and risk patterns
- Engineer meaningful features to improve model learning
- Train and compare various ML models
- Evaluate models using appropriate classification metrics
- Identify top predictive features
The dataset includes anonymized financial details of borrowers and their loan status. Key features include:
loan_amnt
– Total amount fundedterm
– Length of the loanpurpose
– Reason for the loanemp_length
– Employment lengthannual_inc
– Annual incomeloan_status
– Target variable (Fully Paid
orCharged Off
)dti
– Debt-to-income ratiocredit_score
– Credit score of applicant
Note: The dataset was preprocessed to remove irrelevant or missing values, and categorical variables were encoded.
- Python
- Jupyter Notebook
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost
The following classification algorithms were applied:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
Evaluation Metrics:
- Accuracy
- Precision, Recall, F1-Score
- ROC-AUC Score
- Confusion Matrix
- Distribution of
loan_status
- Loan purpose vs default rate
- Credit score and DTI distributions
- Annual income trends and correlation with loan status
- Visualizations: bar plots, histograms, correlation heatmaps
- Label encoding of categorical features
- Credit score binning for risk segmentation
- Ratio features:
income_to_loan
- Feature selection based on correlation and model importance
- Random Forest and XGBoost outperformed Logistic Regression, with XGBoost yielding the best ROC-AUC score.
- Key predictive features:
credit_score
dti
annual_inc
purpose
- The models successfully distinguish high-risk (charged off) vs low-risk (fully paid) applicants.
- Clone the repository
git clone https://github.com/yourusername/credit-risk-analysis.git cd credit-risk-analysis