Skip to content

This project builds a personal loan prediction web app using a data science pipeline. It includes supervised learning with Gradient Boosting, feature engineering, classification, and validation. Built with Streamlit/CLI, it estimates loan eligibility and repayment, featuring modular code, notebooks, and deployment scripts.

Notifications You must be signed in to change notification settings

AshwinAshok3/Personal-Loan

Repository files navigation

📁 Project Structure Overview

The directory structure is as follows:

  • data/: Contains the dataset(s) used for training and evaluation.
  • src/: Houses all source code modules, including data preprocessing, feature engineering, model training, and evaluation scripts.
  • models/: Stores serialized models for future inference or analysis.
  • artifact/: Likely used for storing intermediate outputs or artifacts generated during the pipeline execution.
  • catboost_info/: Contains logs and information specific to CatBoost model training.
  • docx/: Presumably includes documentation or reports related to the project.
  • .idea/: Configuration files for the development environment (e.g., PyCharm).
  • application.py: Script to run the application, possibly for inference or deployment.
  • main.py: The main execution script orchestrating the entire pipeline.
  • requirements.txt: Lists all Python dependencies required to run the project.
  • setup.py: Script for installing the project as a package.
  • README.md: Provides an overview and instructions for the project.
  • Methods.txt: Details the methodologies and approaches used in the project.

🔍 Data Exploration and Preprocessing

Data Loading and Initial Analysis

  • Loaded the dataset from the data/ directory using Pandas, ensuring efficient memory usage and correct data types.
  • Conducted an initial exploration to understand the distribution, central tendencies, and variability of features.
  • Identified and handled missing values, ensuring data integrity for subsequent analysis.

Data Cleaning

  • Removed irrelevant or redundant features such as ID and ZIP Code to prevent noise in the model.
  • Addressed duplicate entries to maintain data quality.
  • Detected and treated outliers using statistical methods to prevent skewed model training.

Feature Engineering

  • Created new features that capture underlying patterns, such as interaction terms or aggregated metrics.
  • Transformed categorical variables using one-hot encoding to convert them into a machine-readable format.
  • Scaled numerical features using StandardScaler to ensure uniformity across features.

🧠 Feature Selection and Model Building

Recursive Feature Elimination with Cross-Validation (RFECV)

  • Implemented RFECV to identify the most significant features contributing to the target variable.
  • Utilized models like Logistic Regression, Random Forest, Gradient Boosting, and Decision Tree as estimators in RFECV.
  • Determined the optimal number of features that yield the best cross-validation score, enhancing model performance and reducing overfitting.

Handling Class Imbalance

  • Addressed the imbalance in the target variable using SMOTE (Synthetic Minority Over-sampling Technique).
  • Generated synthetic samples for the minority class, achieving a balanced dataset and improving model generalization.

Model Training

Trained a suite of supervised classification models, including:

  • Logistic Regression: Served as a baseline model due to its simplicity and interpretability.
  • Support Vector Classifier (SVC): Captured complex relationships using kernel tricks.
  • Random Forest Classifier: Leveraged ensemble learning to improve prediction accuracy.
  • K-Nearest Neighbors (KNN): Classified instances based on proximity in feature space.
  • Radius Neighbors Classifier: Similar to KNN but considered all points within a fixed radius.
  • AdaBoost & Bagging Classifier: Combined weak learners to form a strong classifier.
  • Gradient Boosting, CatBoost, LightGBM, XGBoost, XGBRF: Employed advanced boosting techniques for superior performance.

Model Evaluation

  • Evaluated models using metrics such as Accuracy, F1 Score, ROC-AUC, and Confusion Matrix.
  • Applied Stratified K-Fold cross-validation to ensure robustness and prevent data leakage.
  • Visualized model performance using ROC curves and precision-recall plots.

🛠️ Custom Functions and Utilities

Within the src/ directory, you developed several custom modules and functions:

  • Data Preprocessing Module: Encapsulated functions for data cleaning, transformation, and feature engineering.
  • Model Training Module: Included functions to train various models, perform hyperparameter tuning, and evaluate performance.
  • Visualization Module: Contained functions to generate insightful plots for EDA and model evaluation.
  • Utility Functions: Provided helper functions for tasks like saving/loading models, logging, and configuration management.

Each function was designed with modularity and reusability in mind, adhering to best coding practices.


🧪 Exception Handling and Logging

  • Implemented comprehensive exception handling across modules to capture and log errors gracefully.
  • Ensured that the pipeline could handle unexpected inputs or issues without crashing.
  • Maintained detailed logs for debugging and monitoring purposes, facilitating easier maintenance and updates.

🤖 Neural Network Implementation

  • Integrated neural networks using TensorFlow and Keras for modeling complex patterns in the data.
  • Designed architectures with appropriate layers, activation functions, and regularization techniques.
  • Conducted hyperparameter tuning using Keras Tuner to optimize network performance.
  • Compared neural network results with traditional machine learning models to assess improvements.

📊 Results and Insights

  • Achieved high performance on test data, with models like XGBoost and CatBoost delivering superior results.
  • Identified key features influencing personal loan acceptance, such as Income, CCAvg, Education, and CD Account.
  • Provided actionable insights for the marketing department to target potential customers effectively.

📂 Deployment and Application

  • Developed application.py to serve the trained model for inference, possibly through a web interface or API.
  • Ensured that the application could handle real-time predictions with appropriate input validations.
  • Facilitated easy deployment and scalability of the model in production environments.

📝 Documentation and Reporting

  • Maintained detailed documentation in README.md and Methods.txt, outlining the project's objectives, methodologies, and usage instructions.
  • Structured the codebase for clarity, with comments and docstrings explaining the purpose and functionality of each component.
  • Created reports and visualizations to communicate findings effectively to stakeholders.

⏱️ Time and Effort Investment

  • Dedicated significant time to data exploration, understanding the nuances of the dataset.
  • Invested effort in implementing and comparing multiple models, ensuring a comprehensive analysis.
  • Focused on building a robust and scalable pipeline, reflecting a deep understanding of machine learning workflows.

📌 Conclusion

This Personal Loan Classification project exemplifies a thorough and methodical approach to solving a real-world problem. By combining data preprocessing, feature engineering, advanced modeling techniques, and thoughtful deployment strategies, you've created a comprehensive solution that can significantly aid in targeted marketing efforts.


About

This project builds a personal loan prediction web app using a data science pipeline. It includes supervised learning with Gradient Boosting, feature engineering, classification, and validation. Built with Streamlit/CLI, it estimates loan eligibility and repayment, featuring modular code, notebooks, and deployment scripts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published