-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
409 additions
and
0 deletions.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Datasets | ||
In this package you will find scripts to process or generate the datasets from the paper: | ||
|
||
- [Synthetic data generation](synthetic/README.md) | ||
- [US Census 1990](uscensus/README.md) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Synthetic data generation | ||
|
||
We generated several synthetic datasets for our experiments. | ||
|
||
We decided to save the data in `data/synthetic`: | ||
|
||
```bash | ||
mkdir -p data/uscensus/fixed_2 | ||
mkdir -p data/uscensus/fixed_10 | ||
mkdir -p data/uscensus/mix_small | ||
mkdir -p data/uscensus/mix_big | ||
``` | ||
|
||
The basic arguments for the script that generates synthetic datasets are: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py | ||
usage: generate.py [-h] [--min_variable_size MIN_VARIABLE_SIZE] | ||
[--max_variable_size MAX_VARIABLE_SIZE] [--seed SEED] | ||
[--class_distribution CLASS_DISTRIBUTION] | ||
[--class_distribution_type {probs,logits,uniform}] | ||
num_samples num_variables metadata_path output_path | ||
|
||
``` | ||
|
||
The first variable can be considered as a class or label. | ||
It has a fixed categorical distribution that can be defined with the `class_distribution` and `class_distribution_type` parameters: | ||
|
||
- when `class_distribution_type=uniform`, `class_distribution` must be an integer defining the number of classes; | ||
- when `class_distribution_type=probs`, `class_distribution` must be a list of comma separated positive floats | ||
adding up to one that defines the probability of each class; | ||
- when `class_distribution_type=logits`, `class_distribution` must be a list of comma separated floats | ||
that will be used as input for a softmax that will define the probability of each class; | ||
|
||
|
||
For the following variables, one categorical distribution is defined at random for each possible value of the previous variable. | ||
The parameters `min_variable_size` and `max_variable_size` define the range for the number of possible values of every variable. | ||
During the generation of a sample, the categorical distribution is selected depending on the value drawn for the previous variable. | ||
|
||
To generate a dataset similar to the one called `FIXED 2` in the paper: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py -min_variable_size=2 --max_variable_size=2 10000 9 \ | ||
data/synthetic/fixed_2/metadata.json \ | ||
data/synthetic/fixed_2/synthetic.features.npz | ||
``` | ||
|
||
To generate a dataset similar to the one called `FIXED 10` in the paper: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=10 --max_variable_size=10 10000 9 \ | ||
data/synthetic/fixed_10/metadata.json \ | ||
data/synthetic/fixed_10/synthetic.features.npz | ||
``` | ||
|
||
To generate a dataset similar to the one called `MIX SMALL` in the paper: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=2 --max_variable_size=10 10000 9 \ | ||
data/synthetic/fixed_2/metadata.json \ | ||
data/synthetic/fixed_2/synthetic.features.npz | ||
``` | ||
|
||
To generate a dataset similar to the one called `MIX BIG` in the paper: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py --min_variable_size=2 --max_variable_size=10 10000 99 \ | ||
data/synthetic/fixed_2/metadata.json \ | ||
data/synthetic/fixed_2/synthetic.features.npz | ||
``` | ||
|
||
For more information about the transformation run: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/synthetic/generate.py -h | ||
``` |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,185 @@ | ||
from __future__ import print_function | ||
|
||
import argparse | ||
import json | ||
import torch | ||
|
||
import numpy as np | ||
|
||
from scipy.sparse import csr_matrix, save_npz | ||
from torch.distributions.one_hot_categorical import OneHotCategorical | ||
|
||
|
||
distribution_types = ["probs", "logits", "uniform"] | ||
|
||
|
||
class Variable(object): | ||
|
||
def __init__(self, distributions): | ||
self.distributions = distributions | ||
|
||
def sample(self, previous_sample): | ||
k = previous_sample.argmax().item() | ||
distribution = self.distributions[k] | ||
return distribution.sample() | ||
|
||
|
||
def add_one(ones, rows, cols, i, j, sample): | ||
k = sample.argmax().item() | ||
ones.append(1) | ||
rows.append(i) | ||
cols.append(j + k) | ||
|
||
|
||
def generate_one_hot_variable(distribution, distribution_type): | ||
assert distribution_type in distribution_types | ||
variable = OneHotCategorical(**{distribution_type: torch.FloatTensor(distribution)}) | ||
assert all([prob > 0 for prob in variable.probs]) | ||
return variable | ||
|
||
|
||
def print_matrix_stats(matrix, num_samples, num_features): | ||
num_ones = matrix.sum() | ||
num_positions = num_samples * num_features | ||
|
||
num_ones_per_row = np.asarray(matrix.sum(axis=1)).ravel() | ||
num_ones_per_column = np.asarray(matrix.sum(axis=0)).ravel() | ||
|
||
print("Min:", matrix.min()) | ||
print("Max:", matrix.max()) | ||
print("Rows:", matrix.shape[0]) | ||
print("Columns:", matrix.shape[1]) | ||
print("Mean ones per row:", num_ones_per_row.mean()) | ||
print("Mean ones per column:", num_ones_per_column.mean()) | ||
print("Total ones:", num_ones) | ||
print("Total positions:", num_positions) | ||
print("Total ratio of ones:", num_ones / float(num_positions)) | ||
print("Empty rows:", np.sum(num_ones_per_row == 0)) | ||
print("Full rows:", np.sum(num_ones_per_row == num_features)) | ||
print("Empty columns:", np.sum(num_ones_per_column == 0)) | ||
print("Full columns:", np.sum(num_ones_per_column == num_samples)) | ||
|
||
|
||
def generate_one_hot(num_samples, num_variables, min_variable_size, max_variable_size, metadata_path, output_path, | ||
class_distribution=2, class_distribution_type="uniform", seed=None): | ||
|
||
if seed is not None: | ||
np.random.seed(seed) | ||
|
||
assert 2 <= min_variable_size <= max_variable_size | ||
assert class_distribution is not None | ||
if class_distribution_type == "uniform": | ||
num_classes = int(class_distribution[0]) | ||
class_distribution = [1.0 / num_classes for _ in range(num_classes)] | ||
class_distribution_type = "probs" | ||
|
||
# generate classes | ||
class_variable = generate_one_hot_variable(class_distribution, class_distribution_type) | ||
num_classes = class_variable.event_shape[0] | ||
|
||
# generate variables | ||
variables = [] | ||
variable_sizes = [num_classes] | ||
num_features = num_classes | ||
last_variable_size = num_classes | ||
for _ in range(num_variables): | ||
if min_variable_size == max_variable_size: | ||
variable_size = min_variable_size | ||
else: | ||
variable_size = np.random.randint(low=min_variable_size, high=max_variable_size + 1) | ||
|
||
variable_sizes.append(variable_size) | ||
distributions = {} | ||
for input_value in range(last_variable_size): | ||
logits = torch.FloatTensor(size=(variable_size,)).normal_(0, 1) | ||
distributions[input_value] = OneHotCategorical(logits=logits) | ||
|
||
variables.append(Variable(distributions)) | ||
num_features += variable_size | ||
last_variable_size = variable_size | ||
|
||
# generate metadata | ||
metadata = { | ||
"seed": seed, | ||
"variable_sizes": variable_sizes, | ||
"class_probs": class_variable.probs.tolist(), | ||
"variable_probs": [[sub_variable.probs.tolist() for sub_variable in variable.distributions.values()] | ||
for variable in variables] | ||
} | ||
|
||
with open(metadata_path, "w") as metadata_file: | ||
json.dump(metadata, metadata_file, indent=2) | ||
|
||
# generate data | ||
ones = [] | ||
rows = [] | ||
cols = [] | ||
|
||
for i in range(num_samples): | ||
j = 0 | ||
class_sample = class_variable.sample() | ||
add_one(ones, rows, cols, i, j, class_sample) | ||
j += class_sample.shape[0] | ||
previous_sample = class_sample | ||
for variable in variables: | ||
sample = variable.sample(previous_sample) | ||
add_one(ones, rows, cols, i, j, sample) | ||
j += sample.shape[0] | ||
previous_sample = sample | ||
|
||
output = csr_matrix((ones, (rows, cols)), shape=(num_samples, num_features), dtype=np.uint8) | ||
|
||
print_matrix_stats(output, num_samples, num_features) | ||
|
||
save_npz(output_path, output) | ||
|
||
|
||
def main(): | ||
options_parser = argparse.ArgumentParser(description="Generate one hot encoded data with cascade dependencies.") | ||
|
||
options_parser.add_argument("num_samples", type=int, help="Number of output samples.") | ||
|
||
options_parser.add_argument("num_variables", type=int, help="Number of output categorical variables.") | ||
|
||
options_parser.add_argument("metadata_path", type=str, | ||
help="Output data file path indicating the class distribution and the variable maps.") | ||
|
||
options_parser.add_argument("output_path", type=str, | ||
help="Output data file path in sparse format.") | ||
|
||
options_parser.add_argument("--min_variable_size", type=int, default=2, | ||
help="Minimum random size of each categorical variable. Should be at least 2.") | ||
|
||
options_parser.add_argument("--max_variable_size", type=int, default=10, | ||
help="Maximum random size of each categorical variable.") | ||
|
||
options_parser.add_argument("--seed", type=int, help="Random number generator seed.", default=42) | ||
|
||
options_parser.add_argument("--class_distribution", type=str, default="2", | ||
help="Defines the distribution of the class variable. See 'class_distribution_type'.") | ||
|
||
options_parser.add_argument("--class_distribution_type", type=str, default="uniform", choices=distribution_types, | ||
help="If uniform, same probability is assigned to every class;" + | ||
" the 'class_distribution' should be the number of classes." + | ||
"\nIf probs, explicit probabilities per class" + | ||
" are defined in 'class_distribution' separated by commas." + | ||
"\nIf logits, the values separated by commas defined in 'class_distribution'" + | ||
" will be used as softmax logits." | ||
) | ||
|
||
options = options_parser.parse_args() | ||
|
||
generate_one_hot(options.num_samples, | ||
options.num_variables, | ||
options.min_variable_size, | ||
options.max_variable_size, | ||
options.metadata_path, | ||
options.output_path, | ||
[float(x) for x in options.class_distribution.split(",")], | ||
options.class_distribution_type, | ||
options.seed | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# US Census 1990 | ||
|
||
This is one of the datasets we used for our experiments. | ||
|
||
We decided to save the data in `data/uscensus`: | ||
|
||
```bash | ||
mkdir -p data/uscensus | ||
``` | ||
|
||
To download the data you can visit the | ||
[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)) | ||
or download it directly: | ||
|
||
```bash | ||
cd data/uscensus | ||
wget https://archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/USCensus1990.data.txt | ||
``` | ||
|
||
To transform the csv data one-hot-encoding each categorical variable we run: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/uscensus/transform.py \ | ||
data/uscensus/USCensus1990.data.txt \ | ||
data/uscensus/USCensus1990.features.npz \ | ||
data/uscensus/metadata.json | ||
``` | ||
|
||
For more information about the transformation run: | ||
|
||
```bash | ||
python multi_categorical_gans/cases/uscensus/transform.py -h | ||
``` |
Empty file.
Oops, something went wrong.