A repository to benchmark active learning strategies on microscopy data.
The following active learning strategies are implemented:
random:randomly sample data points from unlabeled pool (used as benchmark)mc_dropout:sample data points based on uncertainty quantified using Monte Carlo dropout.tta:sample data points based on uncertainty quantified using test time augmentation.hybrid:sample data points based on uncertainty quantified using mc dropout and clustering over the datas latent space representation.
The al strategy can be specified in the cfg.yaml in ./pipeline_configs. Benchmarking can be done with
python -m src.al_main -c <path_to_config>installation on a ubuntu 22.04 can be done with the following script
$ ./shell_scripts/install.shall used datasets need to follow the COCO format
to add a new dataset, the following steps need to be performed
- create a directory in
data/named after the new dataset - the new directory contains folders for every data split e.g. train, test
- each data split folder contains the data stored in COCO format in the
cell_acdc_coco_ds.jsonand the images in theimagesfolder - the dataset name together with the split types must be added in
src/globals.pyto theDATASETS_DSPLITSdictionary - the dataset can be specified in the configuration yamls as
datasetname_splittypenew datasets can be added using the Data2cocoConverter class in utils.datapreprocessing.data2coco
The active learning is built ontop of the detectron2 implementation of Mask R-CNN. Training a model without active learning can be done by running
python -m src.pipeline_runner -c <path_to_config>this project uses a venv which is initailized by running
$ python -m venv ac_acdc_envthe venv is activated using the following command
$ source ac_acdc_env/bin/activate requirements can be installed by running
$ pip install -r requirements.txt hyperparameters used for model training, testing and during active learning are specified in configuration.yaml files in the pipeline_config directory. The configuration file containes detectron2 configurations and custom active learning configurations.
The following active learning configs can be specified
- DATASETS.TRAIN_UNLABELED: name of training dataset used for active learning
- INCREMENT_SIZE: number of data points which are annotated each active learning iteration
- INIT_SIZE: size of initial training dataset
- MAX_LOOPS: maximal number of active learning loops
- NUM_MC_SAMPLES: number of monte carlo samples used for uncertianty estivation
- OBJECT_TO_IMG_AGG: aggregation of object uncertainties to an uncertainty value for the entire image (
mean,max,min,sum) - OUTPUT_DIR: path to output directory
- QUERY_STRATEGY: used active learning strategy (
random,mc_drop,tta,hybrid) - TTA_MAX_NOISE: max intensity of gaussian noise applied during test-time augmentation (only applied in QUERY_STRATEGY:tta)
- SAMPLE_EVERY: number of images which form the subset used for AL sampling
- MAX_TRAINING_EPOCHS: number of training epochs during active learning
- RETRAIN: flag if model should be retrained from scratch every active learning iteration (only
trueimplemented)