diff --git a/multi_categorical_gans/datasets/README.md b/multi_categorical_gans/datasets/README.md index 36d5de5..609ce52 100644 --- a/multi_categorical_gans/datasets/README.md +++ b/multi_categorical_gans/datasets/README.md @@ -1,5 +1,34 @@ # Datasets + In this package you will find scripts to process or generate the datasets from the paper: - [Synthetic data generation](synthetic/) - [US Census 1990](uscensus/) + +## Loading and saving + +We work either with dense or sparse numpy arrays. The module `multi_categorical_gans.datasets.formats` presents some +functions to operate with both data formats in an abstract way. + +## Train and test split + +Example of how to split a dataset into 90% train and 10% test: + +```bash +python multi_categorical_gans/datasets/train_test_split.py \ + data/uscensus/USCensus1990.features.npz \ + --percent 90 \ + data/uscensus/USCensus1990-train.features.npz \ + data/uscensus/USCensus1990-test.features.npz +``` + +For more information about the split run: + +```bash +python multi_categorical_gans/datasets/train_test_split.py -h +``` + +## The dataset wrapper + +The class `multi_categorical_gans.datasets.dataset.Dataset` can wrap a dense numpy array to provide simple operations +for training, like `split(proportion)` (useful for validation) or `batch_iterator(batch_size, shuffle=True)`. \ No newline at end of file diff --git a/multi_categorical_gans/datasets/synthetic/README.md b/multi_categorical_gans/datasets/synthetic/README.md index d2a943b..8f6bb07 100644 --- a/multi_categorical_gans/datasets/synthetic/README.md +++ b/multi_categorical_gans/datasets/synthetic/README.md @@ -43,7 +43,7 @@ To generate a dataset similar to the one called `FIXED 2` in the paper: python multi_categorical_gans/datasets/synthetic/generate.py 10000 9 \ data/synthetic/fixed_2/metadata.json \ data/synthetic/fixed_2/synthetic.features.npz \ - -min_variable_size=2 --max_variable_size=2 + --min_variable_size=2 --max_variable_size=2 ``` To generate a dataset similar to the one called `FIXED 10` in the paper: