Datasets.

rcamino · Jul 6, 2018 · ce64e96 · ce64e96
1 parent f574404
commit ce64e96
Show file tree

Hide file tree

Showing 9 changed files with 409 additions and 0 deletions.
diff --git a/multi_categorical_gans/__init__.py b/multi_categorical_gans/__init__.py
diff --git a/multi_categorical_gans/datasets/README.md b/multi_categorical_gans/datasets/README.md
@@ -0,0 +1,5 @@
+# Datasets
+In this package you will find scripts to process or generate the datasets from the paper:
+
+- [Synthetic data generation](synthetic/README.md)
+- [US Census 1990](uscensus/README.md)
diff --git a/multi_categorical_gans/datasets/__init__.py b/multi_categorical_gans/datasets/__init__.py
diff --git a/multi_categorical_gans/datasets/synthetic/README.md b/multi_categorical_gans/datasets/synthetic/README.md
@@ -0,0 +1,76 @@
+# Synthetic data generation
+
+We generated several synthetic datasets for our experiments.
+
+We decided to save the data in `data/synthetic`:
+
+```bash
+mkdir -p data/uscensus/fixed_2
+mkdir -p data/uscensus/fixed_10
+mkdir -p data/uscensus/mix_small
+mkdir -p data/uscensus/mix_big
+```
+
+The basic arguments for the script that generates synthetic datasets are:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py
+usage: generate.py [-h] [--min_variable_size MIN_VARIABLE_SIZE]
+                   [--max_variable_size MAX_VARIABLE_SIZE] [--seed SEED]
+                   [--class_distribution CLASS_DISTRIBUTION]
+                   [--class_distribution_type {probs,logits,uniform}]
+                   num_samples num_variables metadata_path output_path
+
+```
+
+The first variable can be considered as a class or label.
+It has a fixed categorical distribution that can be defined with the `class_distribution` and `class_distribution_type` parameters:
+
+- when `class_distribution_type=uniform`, `class_distribution` must be an integer defining the number of classes;
+- when `class_distribution_type=probs`, `class_distribution` must be a list of comma separated positive floats
+adding up to one that defines the probability of each class;
+- when `class_distribution_type=logits`, `class_distribution` must be a list of comma separated floats
+that will be used as input for a softmax that will define the probability of each class;
+
+
+For the following variables, one categorical distribution is defined at random for each possible value of the previous variable.
+The parameters `min_variable_size` and `max_variable_size` define the range for the number of possible values of every variable.
+During the generation of a sample, the categorical distribution is selected depending on the value drawn for the previous variable.
+
+To generate a dataset similar to the one called `FIXED 2` in the paper:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py -min_variable_size=2 --max_variable_size=2 10000 9 \
+    data/synthetic/fixed_2/metadata.json \
+    data/synthetic/fixed_2/synthetic.features.npz
+```
+
+To generate a dataset similar to the one called `FIXED 10` in the paper:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=10 --max_variable_size=10 10000 9 \
+    data/synthetic/fixed_10/metadata.json \
+    data/synthetic/fixed_10/synthetic.features.npz
+```
+
+To generate a dataset similar to the one called `MIX SMALL` in the paper:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=2 --max_variable_size=10 10000 9 \
+    data/synthetic/fixed_2/metadata.json \
+    data/synthetic/fixed_2/synthetic.features.npz
+```
+
+To generate a dataset similar to the one called `MIX BIG` in the paper:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=2 --max_variable_size=10 10000 99 \
+    data/synthetic/fixed_2/metadata.json \
+    data/synthetic/fixed_2/synthetic.features.npz
+```
+
+For more information about the transformation run:
+
+```bash
+python multi_categorical_gans/cases/synthetic/generate.py -h
+```
diff --git a/multi_categorical_gans/datasets/synthetic/__init__.py b/multi_categorical_gans/datasets/synthetic/__init__.py
diff --git a/multi_categorical_gans/datasets/synthetic/generate.py b/multi_categorical_gans/datasets/synthetic/generate.py
@@ -0,0 +1,185 @@
+from __future__ import print_function
+
+import argparse
+import json
+import torch
+
+import numpy as np
+
+from scipy.sparse import csr_matrix, save_npz
+from torch.distributions.one_hot_categorical import OneHotCategorical
+
+
+distribution_types = ["probs", "logits", "uniform"]
+
+
+class Variable(object):
+
+    def __init__(self, distributions):
+        self.distributions = distributions
+
+    def sample(self, previous_sample):
+        k = previous_sample.argmax().item()
+        distribution = self.distributions[k]
+        return distribution.sample()
+
+
+def add_one(ones, rows, cols, i, j, sample):
+    k = sample.argmax().item()
+    ones.append(1)
+    rows.append(i)
+    cols.append(j + k)
+
+
+def generate_one_hot_variable(distribution, distribution_type):
+    assert distribution_type in distribution_types
+    variable = OneHotCategorical(**{distribution_type: torch.FloatTensor(distribution)})
+    assert all([prob > 0 for prob in variable.probs])
+    return variable
+
+
+def print_matrix_stats(matrix, num_samples, num_features):
+    num_ones = matrix.sum()
+    num_positions = num_samples * num_features
+
+    num_ones_per_row = np.asarray(matrix.sum(axis=1)).ravel()
+    num_ones_per_column = np.asarray(matrix.sum(axis=0)).ravel()
+
+    print("Min:", matrix.min())
+    print("Max:", matrix.max())
+    print("Rows:", matrix.shape[0])
+    print("Columns:", matrix.shape[1])
+    print("Mean ones per row:", num_ones_per_row.mean())
+    print("Mean ones per column:", num_ones_per_column.mean())
+    print("Total ones:", num_ones)
+    print("Total positions:", num_positions)
+    print("Total ratio of ones:", num_ones / float(num_positions))
+    print("Empty rows:", np.sum(num_ones_per_row == 0))
+    print("Full rows:", np.sum(num_ones_per_row == num_features))
+    print("Empty columns:", np.sum(num_ones_per_column == 0))
+    print("Full columns:", np.sum(num_ones_per_column == num_samples))
+
+
+def generate_one_hot(num_samples, num_variables, min_variable_size, max_variable_size, metadata_path, output_path,
+                     class_distribution=2, class_distribution_type="uniform", seed=None):
+
+    if seed is not None:
+        np.random.seed(seed)
+
+    assert 2 <= min_variable_size <= max_variable_size
+    assert class_distribution is not None
+    if class_distribution_type == "uniform":
+        num_classes = int(class_distribution[0])
+        class_distribution = [1.0 / num_classes for _ in range(num_classes)]
+        class_distribution_type = "probs"
+
+    # generate classes
+    class_variable = generate_one_hot_variable(class_distribution, class_distribution_type)
+    num_classes = class_variable.event_shape[0]
+
+    # generate variables
+    variables = []
+    variable_sizes = [num_classes]
+    num_features = num_classes
+    last_variable_size = num_classes
+    for _ in range(num_variables):
+        if min_variable_size == max_variable_size:
+            variable_size = min_variable_size
+        else:
+            variable_size = np.random.randint(low=min_variable_size, high=max_variable_size + 1)
+
+        variable_sizes.append(variable_size)
+        distributions = {}
+        for input_value in range(last_variable_size):
+            logits = torch.FloatTensor(size=(variable_size,)).normal_(0, 1)
+            distributions[input_value] = OneHotCategorical(logits=logits)
+
+        variables.append(Variable(distributions))
+        num_features += variable_size
+        last_variable_size = variable_size
+
+    # generate metadata
+    metadata = {
+        "seed": seed,
+        "variable_sizes": variable_sizes,
+        "class_probs": class_variable.probs.tolist(),
+        "variable_probs": [[sub_variable.probs.tolist() for sub_variable in variable.distributions.values()]
+                           for variable in variables]
+    }
+
+    with open(metadata_path, "w") as metadata_file:
+        json.dump(metadata, metadata_file, indent=2)
+
+    # generate data
+    ones = []
+    rows = []
+    cols = []
+
+    for i in range(num_samples):
+        j = 0
+        class_sample = class_variable.sample()
+        add_one(ones, rows, cols, i, j, class_sample)
+        j += class_sample.shape[0]
+        previous_sample = class_sample
+        for variable in variables:
+            sample = variable.sample(previous_sample)
+            add_one(ones, rows, cols, i, j, sample)
+            j += sample.shape[0]
+            previous_sample = sample
+
+    output = csr_matrix((ones, (rows, cols)), shape=(num_samples, num_features), dtype=np.uint8)
+
+    print_matrix_stats(output, num_samples, num_features)
+
+    save_npz(output_path, output)
+
+
+def main():
+    options_parser = argparse.ArgumentParser(description="Generate one hot encoded data with cascade dependencies.")
+
+    options_parser.add_argument("num_samples", type=int, help="Number of output samples.")
+
+    options_parser.add_argument("num_variables", type=int, help="Number of output categorical variables.")
+
+    options_parser.add_argument("metadata_path", type=str,
+                                help="Output data file path indicating the class distribution and the variable maps.")
+
+    options_parser.add_argument("output_path", type=str,
+                                help="Output data file path in sparse format.")
+
+    options_parser.add_argument("--min_variable_size", type=int, default=2,
+                                help="Minimum random size of each categorical variable. Should be at least 2.")
+
+    options_parser.add_argument("--max_variable_size", type=int, default=10,
+                                help="Maximum random size of each categorical variable.")
+
+    options_parser.add_argument("--seed", type=int, help="Random number generator seed.", default=42)
+
+    options_parser.add_argument("--class_distribution", type=str, default="2",
+                                help="Defines the distribution of the class variable. See 'class_distribution_type'.")
+
+    options_parser.add_argument("--class_distribution_type", type=str, default="uniform", choices=distribution_types,
+                                help="If uniform, same probability is assigned to every class;" +
+                                     " the 'class_distribution' should be the number of classes." +
+                                     "\nIf probs, explicit probabilities per class" +
+                                     " are defined in 'class_distribution' separated by commas." +
+                                     "\nIf logits, the values separated by commas defined in 'class_distribution'" +
+                                     " will be used as softmax logits."
+                                )
+
+    options = options_parser.parse_args()
+
+    generate_one_hot(options.num_samples,
+                     options.num_variables,
+                     options.min_variable_size,
+                     options.max_variable_size,
+                     options.metadata_path,
+                     options.output_path,
+                     [float(x) for x in options.class_distribution.split(",")],
+                     options.class_distribution_type,
+                     options.seed
+                     )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/multi_categorical_gans/datasets/uscensus/README.md b/multi_categorical_gans/datasets/uscensus/README.md
@@ -0,0 +1,33 @@
+# US Census 1990
+
+This is one of the datasets we used for our experiments.
+
+We decided to save the data in `data/uscensus`:
+
+```bash
+mkdir -p data/uscensus
+```
+
+To download the data you can visit the
+[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990))
+or download it directly:
+
+```bash
+cd data/uscensus
+wget https://archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990.data.txt
+```
+
+To transform the csv data one-hot-encoding each categorical variable we run:
+
+```bash
+python multi_categorical_gans/cases/uscensus/transform.py \
+    data/uscensus/USCensus1990.data.txt \
+    data/uscensus/USCensus1990.features.npz \
+    data/uscensus/metadata.json
+```
+
+For more information about the transformation run:
+
+```bash
+python multi_categorical_gans/cases/uscensus/transform.py -h
+```
diff --git a/multi_categorical_gans/datasets/uscensus/__init__.py b/multi_categorical_gans/datasets/uscensus/__init__.py