This repository contains the code that was used in the paper ``Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data'' https://arxiv.org/abs/1809.03207
Install using pip
$ pip install git+https://github.com/bluelabsio/BL-SAR-PU.git
To install from a specific branch under development, use:
$ pip install git+https://github.com/bluelabsio/BL-SAR-PU.git@<branch-name>
# Data
The data directory contains the original data, the preprocessed data and the PU labellings. For each dataset, there is a data directory `<dataset>`
### Data Directory Structure
Each data directory has the following structure:
```bash
<dataset>
├── original
│ └── <original_downloaded_dataset>
└── processed
├── <dataset>._.class.csv
├── <dataset>._.data.csv
├── labelings
│ └── <dataset>._.train_test._.<partition_id>.csv
└── partitions
├── <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>.e.csv
└── <dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>._.lab.<sampling_id>.csv
<dataset>/original
contains the original data.
<dataset>/processed
contains the processed/reformatted data, the data partitions (train/test) and the PU labellings.
A PU dataset is the combination of 4 files, where each line is an example.
<dataset>._.data.csv
contains the attribute values, separated by spaces.<dataset>._.class.csv
contains the true class values (0/1).<dataset>._.train_test._.<partition_id>.csv
contains the partitions. 1 for train, 2 for test.<dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>._.lab.<sampling_id>.csv
contains the PU labels, where the positive examples were sampled according to the propensity scores<dataset>._.propmodel.<modeltype>._.train_test._.<propensity_attributes>.e.csv
assigned by the propensity model of typemodeltype
for the attributespropensity_attributes
.
For each dataset there is a notebook to download and prepocess it:
notebooks/data_preprocessing/<Dataset>.ipynb
Currently, the available datasets are:
- 20ng
- Adult
- BreastCancer
- Covtype
- Diabetes
- ImageSegmentation
- Mushroom
- Splice
All the notebooks can be run from the terminal with a shell script, using the provided jupyter kernel:
$ ./generateData.sh env_sarpu
To be able to do some controlled experiments, we extended the datasets with artificially generated attributes.
Extended versions of the datasets are generated by the notebook
notebooks/data_preprocessing/Extended Data.ipynb
The notebook notebooks/Experiments
shows how to run experiments
Through command line, experiments can be run as follows:
(env_sarpu) $ python -m sarpu label $data_dir $data_name $labeling_model_type $propensity_attributes $nb_assignments
data_dir
: The base directory for the data. This is probably "Data"data_name
: The dataset to uselabeling_model_type
: unique labeling mechanism name, usually "simple_0.2_0.8"propensity_attributes
: 1.-3.5 for attributes [1,3,5] and signs [1,-1,1].nb_assignments
: how many labellings to produce
The labellings are saved in the data director under <dataset>/processed/labelings/
(env_sarpu) $ python -m sarpu train_eval $data_dir $results_dir $data_name $labeling_model_type $propensity_attributes $labeling $partition $settings $pu_method
data_dir
: The base directory for the data. This is probably "Data"data_dir
: The base directory for the results. This is probably "Results"data_name
: The dataset to uselabeling_model_type
: unique labeling mechanism name, usually "simple_0.2_0.8"propensity_attributes
: 1.-3.5 for attributes [1,3,5] and signs [1,-1,1].labeling
: which labeling to use (id)partition
: which train/test partition to use (id)settings
: Which settings to use, i.e. which type of model for classification, which type of model for propensity scores and which attributes to use for classification. The three values are separated by._.
. The models can be "lr" (Logistic regression) and the classification attributes are either "all" or something like "1.3-5.9-11" which indicates that attributes [1,3,4,5,9,10,11] should be used for classification.pu_method
: which pu_method to usesupervised
: standard supervised learning with access to the true labelsnegative
: standard supervised learning given the PU labelssar-e
: propensity score weighting given the correct propensity scores.scar-c
: propensity score weighting given the label frequency as the propensity score for all examples.sar-em
: The EM-based SAR-PU methodscar-km2
: propensity score weighting with an estimated label frequency as the propensity score for all examples. km2 is used to estimate the label frequency.scar-tice
: propensity score weighting with an estimated label frequency as the propensity score for all examples. tice is used to estimate the label frequency.
The experiment results are saved in the folder Results/<data_name>/<labeling_model>._.<propensity_attributes>.__.<settings>.__.<labeling>.__.<partition>/<pu_method>/
. This folder contains the following files:
- e.model: the propensity score model
- f.model: the classification model
- info.csv: info that was output during the training, such as training time, number of iterations and intermediate results
- results.csv: many evaluation metrics for the propensity score model, classification model, calculated on the train and test data.
Data: contains the data, both original and SAR PU. The data can be downloaded and generated using the notebooks in notebooks/data_preprocessing/data_preprocessing
lib: external libraries that are used in the code
- km: the class prior estimator from Ramaswamy
- tice: class prior estimator from our AAAI paper
notebooks: Notebooks for all experiments etc
- data_preprocessing: downloads data and generates sar pu versions
- Experiments: fast way to test something. Specify the SAR mechanism, dataset, and settings. Then compare the different methods and analyse the behaviour of the SAR mechanism.
Results: The raw results generated by our experiments. Unless specifically asked otherwise (by setting a flag), this folder is checked for results before running experiments to save time.
sarpu: The library with the sarpu code