In the process of machine learning, input data always plays an essential role in the model training
However, real world data is usually not perfect. Very often a ML algorithm trained on the data is biased.
This repo is created to research on different methods of data imputation techniques, and their according effects to machine bias produced by several popular ML algorithms.
- The methods of generating missing data entries
- Missing Completely At Random
- Missing At Random
- Not Missing At Random
- The methods of guessing a missing data entry
- Mean Imputation
- Similar Imputation (KNNImputer)
- Multiple Imputation (IterativeImputer)
- Different machine learning algorithms
- Logisitic Regression
- Multi-Layer Perceptron
- Decision Tree
- Random Forest
- SVM (linear)
- K-Nearest Neighbors
- Several popular biased datasets
- Iris Dataset (UCI) (No Longer Used)
- Bank Dataset (UCI)
- Adult Dataset (UCI)
- Compas Dataset
- Heart Disease Dataset (UCI) (No Longer Used)
- Drug Consumption Dataset (UCI) (No Longer Used)
- Titanic Dataset (Kaggle) (Kaggle account required)
- German Credit Dataset (UCI)
- Communities and Crime Dataset (UCI)
- Recidivism in juvenile justice (No Longer Used)
ratio_analysis_plots
plots for MCAR experimentsother_analysis_plots
plots for MAR and NMAR experimentsdataset_analysis_plots
plots for Feature Selection experimentsnouse
outdated experimental data
utils/*.py
main body of experiment setup (dataset loading, imputation methods, missingness induction functions)main.py
download the required datasets to local folderscript_prepare.py
parameter search on classifiers for each datasetscript_single_task.py
multi-process MCAR experiment scriptscript_single_task_ext.py
multi-process MAR and NMAR experiment scriptscript_plot.py
generate MCAR related plots from experimental outputsscript_plot_ext.py
generate MAR and NMAR related plots from experimental outputsscript_dataset_analysis.py
generate Feature Selection experimental plots
Due to the multiprocessing nature of Python3, scripts involving multiprocessing cannot be run on Windows.
research notes.ipynb
literature search and notes of AI fairness papersnotebooks/*.ipynb
analysis of outputs (MCAR, MAR, NMAR experiments) and initial work for Feature Selection experimentsAIF360_Related/*.ipynb
experiments of our methods in combination with preprocessing methods provided by IBM AIF360 package
Instead of inducing MCAR missingness on whole data, induce on selected features by Feature Selection. Then apply imputation to see a better bias reduction.
- IBM AIF360
- Missing-data imputation
- Multiple Imputation in Stata
- COMPAS Recidivism Risk Score Data and Analysis
- Responsibily
- Fairness Measures
- More in
research notes.ipynb