This is an imputation package for missing data, which can be easily installed with pip.
The code repository associated with the paper: "Conditional expectation with regularization for missing data imputation." This paper is under evaluation for the journal you can find it at https://arxiv.org/abs/2302.00911
Conditional Distribution-based Imputation of Missing Values with Regularization (DIMV): An algorithm for imputing missing data with low RMSE, scalability, and explainability. Ideal for critical domains like medicine and finance, DIMV offers reliable analysis, approximated confidence regions, and robustness to assumptions, making it a versatile choice for data imputation. DIMV is under the assumption that it relies on the normally distributed assumption as part of its theoretical foundation. The assumption of normality is often used in statistical methods and imputation techniques because it simplifies data modeling.
In this comparison, we evaluate DIMV's performance on both small datasets with randomly missing data patterns and medium datasets (MNIST and FashionMNIST) with monotone missing data patterns (cutting a piece of the image on the top right).
For small datasets with random missing data:
For medium datasets (MNIST and FashionMNIST):

Here's an illustration of DIMV's imputation for MNIST and FashionMNIST:


DIMV has shown promising performance in terms of computational efficiency and robustness across small to medium datasets, accommodating a variety of missing data patterns. However, like many imputation methods, DIMV may face challenges with computational time when dealing with large datasets or high-dimensional data. For instance, popular imputation methods like k-nearest Neighbors Imputation (KNNI) can sometimes encounter performance issues in these scenarios.
The codes are structured as follows:
.
├── README.md
├── example.ipynb
├── requirements.txt
└── src
├── DIMVImputation.py
├── __init__.py
├── conditional_expectation.py
├── dpers.py
└── utils.py
In /src
folders:
DIMVImputation.py
implements the DIMV imputation algorithm for imputing missing data.dpers.py
that implements the DPER algorithm for computing the covariance matrix used in the DIMV (Conditional expectation with regularization for missing data imputation) algorithm. (input is a normalized input matrix).conditional_expectation.py
contains the computation for the regularized conditional expectation for a sliced position in the dataset, given the covariance matrix.
example.ipynb
is a Jupyter Notebook file that contains examples demonstrating how to use the functionalities and methods.
Install the package with:
!pip install git+https://github.com/maianhpuco/DIMVImputation.git
- Step 1: Clone the repository
git clone <repository-url>
Then, create a virtual environment and activate the environment.
- Step 2: Install the libraries from the "requirements.txt" file.
pip install -r requirements.txt
For example, let's create a sample dataset named missing_data using a numpy array.
#Create train test split
test_size = .2
split_index = int(len(missing_data) * (1 - test_size))
X_train_ori, X_test_ori = data[:split_index, :], data[split_index:, :]
X_train_miss = missing_data[:split_index, :]
X_test_miss = missing_data[split_index:, :]
from DIMVImputation import DIMVImputation
# Create an instance of the DIMVImputation class
imputer = DIMVImputation()
# Fit the imputer on the training set to compute the covariance matrix
imputer.fit(X_train_miss, initializing=False)
# Apply imputation to the missing data that we want to impute
X_test_imputed = imputer.transform(X_test_miss)
The .fit()
function is applied to the training set to compute the covariance matrix, which is then calculated based on the training set.
Fit the model on the train set:
from DIMVImputation.DIMVImputation import DIMVImputation
# Create an instance of the DIMVImputation class
imputer = DIMVImputation()
# Fit the imputer on the training set to compute the covariance matrix
imputer.fit(X_train_miss, initializing=False)
# Apply imputation to the missing data that we want to impute
X_test_imputed = imputer.transform(X_test_miss)
By default, DIMVImputation
uses cross-validation to determine the optimal value for the regularization parameter (alpha). The default regularization parameter values include alphas of 0.0, 0.01, 0.1, 1.0, 10.0, and 100.0. Moreover, the default percentage of data utilized for training in cross-validation is set to 100%.
- To specify a custom range of alpha values for cross-validation, use .cross_validate() to conduct a grid search for the best alpha value. Once determined, this transformation is applied to the missing data (X_test_miss). For instance:
# Define your alpha grid and specify the data percentage for cross-validation
imputer.cross_validate(alphas=[0.0, 0.01, 0.1, 1.0])
X_test_imp = imputer.transform(X_test_miss, cross_validation=False)
- If you aim to modify the percentage of training data utilized in cross-validation (note: this doesn't affect the .fit() method's training set), you can adjust it as follows:
# Define your alpha grid and set the data percentage for cross-validation
imputer.cross_validate(train_percent=80, alphas=[0.0, 0.01, 0.1, 1.0])
X_test_imp = imputer.transform(X_test_miss, cross_validation=False)
- To incorporate FeatureSelection by eliminating irrelevant features based on a threshold, apply the following settings. This feature selection criterion will be applied to both cross-validation and the
.fit()
method:
imputer.cross_validate(
train_percent=80,
alphas=[0.0, 0.01, 0.1, 1.0],
features_corr_threshold=0.3,
mlargest_features=5
)
X_test_imp = imputer.transform(X_test_miss, cross_validation=False)