The goal of this library is to automatize several tasks related to Deep Generative Models based on json configuration files.
- Requirements
- Data
- Metadata
- Runner
- Tasks
- Base Tasks
- Runnable Tasks
- Compute Means And Modes
- Encode
- Generate Missing Mask
- Impute With MIDA
- Means And Modes Imputation
- MissForest Imputation
- Multi-Process Task Runner
- Normal Noise Imputation
- Sample ARAE
- Sample GAN
- Sample MedGAN
- Sample VAE
- Serial Task Runner
- Train ARAE
- Train AutoEncoder
- Train DeNoising AutoEncoder
- Train GAIN
- Train GAN
- Train MIDA
- Train MedGAN
- Train MissForest
- Train VAE
- Zero Imputation
- Architecture
- Examples
- References
This code was developed with Python 3.
All the python libraries required to run this code are listed on the file requirements.txt
.
I recommend installing everything with pip inside a virtual environment.
Run this code inside the project directory root:
pip install -r requirements.txt
Make sure that the project main module (deep_generative_models
) can be found by python.
There are several ways to achieve this:
- By executing the scripts with the project root directory as the working directory (i.e. executing
cd
to the project root directory before executing the project scripts). - Adding the project root directory to the
PYTHONPATH
environment variable (see python documentation). - Adding a
.pth
file to the site-packages directory of the virtual environment with the absolute path of the root directory (see python documentation).
This library works with data in numpy format. Categorical variables should be one-hot-encoded and numerical variables should be scaled between 0 and 1.
Some specific information about the data is needed for this library to work. Metadata json files should have the following format:
{
"variables": [],
"variable_sizes": [],
"variable_types": [],
"features": [],
"index_to_value": [],
"value_to_index": {},
"num_features": 0,
"num_variables": 0,
"num_binary_variables": 0,
"num_categorical_variables": 0,
"num_numerical_variables": 0,
"num_samples": 0,
"num_classes": 0,
"classes": []
}
Properties:
"variables"
(string array): Each position determines the name of the corresponding variable."variable_types"
(string array): Each position determines the type of the corresponding variable and can take the value"categorical"
,"numerical"
, or"binary"
."variable_sizes"
(int array): Each position determines the size of the corresponding variable (see constraints)."num_variables"
(int): Number of variables."num_binary_variables"
(int): Number of variables with binary type."num_categorical_variables"
(int): Number of variables with categorical type."num_numerical_variables"
(int): Number of variables with numerical type."features"
(string array): Each position determines the name of the corresponding feature."num_features"
(int): Number of features."value_to_index"
(dictionary of dictionaries): The mappingi = value_to_index["category"]["value"]
indicates that the variable with name"category"
is mapped to the feature with indexi
when the variable value is equal to"value"
."index_to_value"
(string-array array): Each positioni
ofindex_to_value
has the form["category", "value"]
, which is the inverse ofvalue_to_index
(see constraints)."num_samples"
(int): Number of samples."num_classes"
(optional int): Number of different classes (if present)."classes"
(optional string array): The name of each class (if present).
I am aware that some values are redundant, but if they are pre-calculated I can save some time.
Constraints:
len(variables) == len(variable_sizes) == len(variable_types) == num_variables > 0
sum([1 for i in range(num_variables) if variable_types[i] == "binary"]) == num_binary_variables >= 0
sum([1 for i in range(num_variables) if variable_types[i] == "categorical"]) == num_categorical_variables >= 0
sum([1 for i in range(num_variables) if variable_types[i] == "numerical"]) == num_numerical_variables >= 0
num_variables = num_binary_variables + num_categorical_variables + num_numerical_variables
variable_sizes[i] > 2 for i in range(num_variables) if variable_types[i] == "categorical"
variable_sizes[i] == 1 for i in range(num_variables) if variable_types[i] != "categorical"
len(features) == len(index_to_value) == num_features >= num_variables
-
index_to_value[index][0] == category and index_to_value[index][1] == value for category in value_to_index.keys() for value, index in value_to_index[category].items()
The runner is the main entry point for this library. It receives a configuration file as the only argument.
python -m deep_generative_models.tasks.runner.py configuration.json
The file configuration.json
should have the following format:
{
"task": "...",
"arguments": {}
}
The "task"
can take any of the following string values:
- Train:
"TrainARAE"
"TrainAutoEncoder"
"TrainDeNoisingAutoEncoder"
"TrainGAIN"
"TrainGAN"
"TrainMIDA"
"TrainMedGAN"
"TrainVAE"
- Sample:
"SampleARAE"
"SampleGAN"
"SampleMedGAN"
"SampleVAE"
- Encode:
"Encode"
- Imputation:
"ComputeMeansAndModes"
"GANIterativeImputation"
"GenerateMissingMask"
"ImputeWithGAIN"
"ImputeWithMIDA"
"ImputeWithVAE"
"MeansAndModesImputation"
"MissForestImputation"
"NormalNoiseImputation"
"TrainMissForest"
"ZeroImputation"
- Tasks that run other tasks:
"MultiProcessTaskRunner"
"SerialTaskRunner"
The arguments depend on the task (see each task documentation).
The task type defines the list of mandatory and optional arguments they possess. Some tasks are a "base" for other tasks and cannot be executed. The runnable tasks inherit the arguments for their base tasks, and can be executed in two ways:
- using the runner,
- using the task script directly (see each task documentation).
The architecture json file should have the following format:
{
"arguments": {},
"components": {}
}
The "arguments"
object contains global architecture configuration which is shared among components.
The key for each element in the "coponents"
object defines a reference name for the component, and the value should have the following format:
{
"factory": "...",
"arguments": {}
}
The "factory"
defines who will create the component and can take any of the following string values:
"CodeCritic"
"CodeDiscriminator"
"CodeGenerator"
"Critic"
"Discriminator"
"MultiInputEncoder"
"MultiInputGAINDiscriminator"
"MultiOutputDecoder"
"MultiOutputGenerator"
"MultiVariableAutoEncoder"
"MultiVariableDeNoisingAutoencoder"
"MultiVariableGAINGenerator"
"MultiVariableMIDA"
"MultiVariableVAE"
"SingleInputEncoder"
"SingleInputGAINDiscriminator"
"SingleOutputDecoder"
"SingleOutputGenerator"
"SingleVariableAutoEncoder"
"SingleVariableDeNoisingAutoencoder"
"SingleVariableGAINGenerator"
"SingleVariableMIDA"
"SingleVariableVAE"
Components at the same time can reference to other factories:
- Layers:
"AdditiveNormalNoise"
"HiddenLayers"
"MeanAndModesImputation"
"MultiInputDropout"
"MultiInputLayer"
"MultiOutputLayer"
"NormalNoiseImputation"
"SingleInputLayer"
"SingleOutputLayer"
"ZeroImputation"
- Activations:
"GumbelSoftmaxSampling"
"SoftmaxSampling"
- Losses:
"AutoEncoderLoss"
"GAINDiscriminatorLoss"
"GAINGeneratorLoss"
"GANDiscriminatorLoss"
"GANGeneratorLoss"
"MultiReconstructionLoss"
"RMSE"
"VAELoss"
"ValidationImputationLoss"
"ValidationReconstructionLoss"
"WGANCriticLoss"
"WGANCriticLossWithGradientPenalty"
"WGANGeneratorLoss"
- Optimizers:
"WGANOptimizer"
There are also factories for PyTorch elements (see the argument definition in the corresponding PyTorch documentation):
- Layers:
"Dropout"
- Activations:
"LeakyReLU"
"ReLU"
"Sigmoid"
"Tanh"
- Losses:
"BCE"
"CrossEntropy"
"MSE"
- Optimizers:
"Adam"
"SGD"
IMPORTANT: The following examples do not describe by any means the best combination of architecture or hyper-parameters to achieve good results. The goal of the examples is to illustrate how to use this library.
Data is not included in the examples because of space limitations and because it does not belong to me. You can find example scripts to download the kind of data and metadata required for this project in rcamino/dataset-pre-processing.
For example, to download and pre-process the "Default of Credit Card Clients" dataset from the UCI repository, first create the directories examples/data/default_of_credit_card_clients
and then execute:
python $DATASET_PRE_PROCESSING_ROOT/uci/default_of_credit_card_clients/download_and_transform.py \
--directory=examples/data/default_of_credit_card_clients
If you want to run examples that need the data split into train and test (i.e. 90 train / 10% test split):
python $DATASET_PRE_PROCESSING_ROOT/train_test_split.py --stratify --shuffle \
$DIRECTORY\features.npy \
0.9 \
$DIRECTORY\features-train.npy \
$DIRECTORY\features-test.npy \
--labels=$DIRECTORY\labels.npy \
--train_labels=$DIRECTORY\labels-train.npy \
--test_labels=$DIRECTORY\labels-test.npy
You can find more information about the dataset in the dataset repository or in my code repository.
For simplicity, these tasks use paths relative to the project root. This means that unless you want to edit the tasks yourself, most of these tasks will expect that:
- data can be found in
examples/data
, - and other tasks configuration files can be found in
examples/tasks
.
The tasks and architectures depend heavily on the data and metadata.
- Description: Train and sample from a single-variable GAN using the Default of Credit Card Clients dataset. It is called "single-variable" (in opposed to "multi-variable") because it does not take into account the types of each variable. The metadata is basically ignored. It would be the more "traditional" way of training GANs.
- Task directory:
examples/tasks/default_of_credit_card_clients/single_variable_gan
. - Expected data directory:
examples/data/default_of_credit_card_clients
. - Architecture (
architecture.json
):- The generator and the discriminator have only one hidden layer of size 50.
- The input noise for the generator is also of size 50.
- The discriminator has one training step per each generator training step.
- Both use an Adam optimizer and learning rate of 0.001
- Tasks:
train.json
: the train task with 10 epochs and batch size of 128.sample.json
: the sample task using the default sampling method.train-then-sample.json
: it is just a serial task that runs the train and then the sample task.
- Files that will be generated:
checkpoint.pickle
: the trained models.train.csv
: training information (i.e. loss).sample.npy
: a small dataset generated by the trained model.
Many of the concepts on this library are defined and discussed in my publications:
-
Camino, R., Hammerschmidt, C., & State, R. (2018). Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202. (arXiv) (ICML 2018 TADGM) (old code)
-
Camino, R. D., Hammerschmidt, C. A., & State, R. (2019). Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666. (arXiv) (old code)
-
Camino, R. D., & Hammerschmidt, C. (2020). Working with deep generative models and tabular data imputation. (OpenReview) (ICML 2020 Artemiss)
-
Camino, R. D., & Hammerschmidt, C. A. (2020). Oversampling Tabular Data with Deep Generative Models: Is it worth the effort?. (PMLR) (OpenReview) (NeurIPS 2020 ICBINB)
Also check these other repositories and papers that I referenced: