This team project was adopted from the Kaggle competition called "Costa Rican Household Poverty Level Prediction". Sponsored by the Inter-American Development Bank, this competition seeks to enhance traditional poverty prediction methods by promoting the development and application of machine learning models. https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction
In this study, we propose and evaluate the effectiveness of four supervised machine learning models —Gradient Boosted Decision Trees, Naive Bayes, Logistic Regression, and K-Nearest Neighbors (KNN)— in predicting Costa Rican household poverty levels based on known individual and household level features provided by the dataset. We conclude by selecting the model which offers the most effective performance for predicting poverty levels of Costa Rican households. This research aims to augment policy-making and strategic planning by offering an efficient, data-driven approach to improve household well-being. This capability is particularly crucial in developing and implementing social welfare programs, where the objective is to target households experiencing varying dimensions and levels of deprivation.
The primary dataset used in this study was compiled by the Inter-American Development Bank through a proxy means test, which comprises a mix of self-reported answers concerning household composition, educational outcomes, and observable physical characteristics pertaining to their housing conditions (e.g., overcrowding, roof type), and asset ownership (computers, electronic devices, among others). The target variables are ordinal variables representing ’extreme poverty’ (4), ’overall poverty’ (3), ’vulnerable households’ (2), and ’non-vulnerable households’ (1).
In our project, we start by reviewing literature and performing exploratory analysis on the provided training dataset from Kaggle. Next, we clean the data, fill appropriate null values, aggregate individual level observations to a single household level observation, and generate feature engineering. This becomes the modified dataset that we use to build and test our models moving forward. Then, we test four different machine learning models to evaluate which model will be the most effective for predicting the target class. These models were selected due to their varying characteristics and effectiveness at handling classification tasks in a diverse set of domains. Of these four models, we select decision trees and move forward to building the final model. Finally, we experiment with building decision trees based on various model specifications: feature selection (only including features with at least +/- 15% correlation to the target), SMOTE (to balance our dataset), and random forest modeling. Based on the successes and limitations of the different models in the context of our modified dataset, we conclude by choosing the Random Forest model trained on selected features from the SMOTE balanced dataset as our final model.
kaggle_data (folder):
- train.csv: Original training dataset provided by Kaggle
- test.csv: Original testing dataset provided by Kaggle
- codebook.csv: Description of features found in the original training dataset
Checkpoint1.ipynb: Starting with the original dataset, we begin by reviewing the literature and exploring the data to help familiarize ourselves with the scope and direction of the project.
Checkpoint 2a.ipynb: Following from our findings in Checkpoint 1, we clean and engineer features from the original dataset into our own modified dataset. We also aggregate all individual level observations to a single household level observation in the modified dataset.
modified_data (folder):
- modified_train.csv: Our modified dataset created and saved from Checkpoint 2a. Contains the original data aggregated at the household level, including data cleaning and feature engineering.
- modified_codebook.csv: Descriptions of our feature labels in our modified dataset. Contains the features from the original data codebook and includes descriptions for our new generated features.
Checkpoint2b.ipynb: We build and test four machine learning models on our modified dataset. After evaluating their performance, we present a comparative analysis of these four models to determine the most effective approach.
Final Checkpoint.ipynb: This is our code for testing various specifications of decision tree models, comparing their performance, and selecting our final model.
Final_tree.png: A visual representation of our final random forest model.
Final Report_Pura Vida.pdf: Our final project report. The paper presents a comprehensive discussion of our research process, approach, and final conclusions about which model is most effective for predicting household poverty level in Costa Rica.