The starter repository for submissions to the GeneDisco challenge for optimized experimental design in genetic perturbation experiments.
GeneDisco (to be published at ICLR-22) is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery. GeneDisco contains a curated set of multiple publicly available experimental data sets as well as open-source implementations of state-of-the-art active learning policies for experimental design and exploration.
pip install -r requirements.txt
- Create a cache directory. This will hold any preprocessed and downloaded datasets for faster future invocation.
$ mkdir /path/to/genedisco_cache
- Replace the above with your desired cache directory location.
- Create an output directory. This will hold all program outputs and results.
$ mkdir /path/to/genedisco_output
- Replace the above with your desired output directory location.
Experiments (all baselines, acquisition functions, input and target datasets, multiple seeds) included in GeneDisco can be executed sequentially for e.g. acquired batch size 64
, 8
cycles and a bayesian_mlp
model using:
run_experiments \
--cache_directory=/path/to/genedisco_cache \
--output_directory=/path/to/genedisco_output \
--acquisition_batch_size=64 \
--num_active_learning_cycles=8 \
--max_num_jobs=1
Results are written to the folder at /path/to/genedisco_cache
, and processed datasets will be cached at /path/to/genedisco_cache
(please replace both with your desired paths) for faster startup in future invocations.
Note that due to the number of experiments being run by the above command, we recommend execution on a compute cluster.
The GeneDisco codebase also supports execution on slurm compute clusters (the slurm
command must be available on the executing node) using the following command and using dependencies in a Python virtualenv available at /path/to/your/virtualenv
(please replace with your own virtualenv path):
run_experiments \
--cache_directory=/path/to/genedisco_cache \
--output_directory=/path/to/genedisco_output \
--acquisition_batch_size=64 \
--num_active_learning_cycles=8 \
--schedule_on_slurm \
--schedule_children_on_slurm \
--remote_execution_virtualenv_path=/path/to/your/virtualenv
Other scheduling systems are currently not supported by default.
To run one active learning loop cycle, for example, with the "topuncertain"
acquisition function, the "achilles"
feature set and
the "schmidt_2021_ifng"
task, execute the following command:
active_learning_loop \
--cache_directory=/path/to/genedisco/genedisco_cache \
--output_directory=/path/to/genedisco/genedisco_output \
--model_name="bayesian_mlp" \
--acquisition_function_name="topuncertain" \
--acquisition_batch_size=64 \
--num_active_learning_cycles=8 \
--feature_set_name="achilles" \
--dataset_name="schmidt_2021_ifng"
To run a custom acquisition function, set --acquisition_function_name="custom"
and --acquisition_function_path
to the file path that contains your custom acquisition function (e.g. main.py in this repo).
active_learning_loop \
--cache_directory=/path/to/genedisco/genedisco_cache \
--output_directory=/path/to/genedisco/genedisco_output \
--model_name="bayesian_mlp" \
--acquisition_function_name="custom" \
--acquisition_function_path=/path/to/src/main.py \
--acquisition_batch_size=64 \
--num_active_learning_cycles=8 \
--feature_set_name="achilles" \
--dataset_name="schmidt_2021_ifng"
...where "/path/to/custom_acquisition_function.py"
contains code for your custom acquisition function corresponding to the BaseBatchAcquisitionFunction interface, e.g.:
import numpy as np
from typing import AnyStr, List
from slingpy import AbstractDataSource
from slingpy.models.abstract_base_model import AbstractBaseModel
from genedisco.active_learning_methods.acquisition_functions.base_acquisition_function import \
BaseBatchAcquisitionFunction
class RandomBatchAcquisitionFunction(BaseBatchAcquisitionFunction):
def __call__(self,
dataset_x: AbstractDataSource,
batch_size: int,
available_indices: List[AnyStr],
last_selected_indices: List[AnyStr] = None,
model: AbstractBaseModel = None,
temperature: float = 0.9,
) -> List:
selected = np.random.choice(available_indices, size=batch_size, replace=False)
return selected
Note that the last class implementing BaseBatchAcquisitionFunction
is loaded by GeneDisco if there are multiple valid acquisition functions present in the loaded file.
For submission, you will need two things:
- A command line container build tool (Docker or Podman are the most common)
- The eval.ai-CLI tool, which can be installed with pip
- EvalAI-CLI Documentation
- Run
$ pip install evalai
Please note that all your submitted code must either be loaded via a dependency in requirements.txt
or be present in the src/
directory in this starter repository for the submission to succeed.
Once you have set up your submission environment, you will need to create a lightweight container image that contains your acquisition function.
- Navigate to the directory to which you have cloned this repo to.
$ cd /path/to/genedisco-starter
- Ensure you have ONE acquisition function (inheriting from
BaseBatchAcquisitionFunction
) in main.py- This is your pre-defined program entry point.
- Build your container image
$ docker build -t submission:latest .
- Save your image name to a shell variable
$ IMAGE="submission:latest"
- Use the EvalAI-CLI command to submit your image
- Run the following command to submit your container image:
$ evalai push $IMAGE --phase gsk-genedisco-challenge-1528
- Please note that you have a maximum number of submissions that any submission will be counted against.
- Run the following command to submit your container image:
That’s it! Our pipeline will take your image and test your function.
If you have any questions or concerns, please reach out to us at [email protected]
No. Participants are asked to compare their solutions internally against the provided baselines in the genedisco
repository.
A final private leaderboard will be created using the submissions received via eval.ai
after the challenge submission period is closed.
You are able to change the acquisition function only (please see instructions above). We have chosen to fix the predictor and other aspects of the active learning loop in order to enable comparability of the developed solutions (there is a complex dependency between model, acquisition function and data).
We will score submissions against the two subtasks using the metrics calculated in genedisco
- overall hit rate (% of top movers discovered) and model performance (mean squared model error) at the end of the active learning loop (after the last cycle).
Note that we will score the solutions against a held-out set of tasks not known to the participants via genedisco
.
We will use the last submission to calculate the team's score to avoid conferring an advantage to teams that produce more submissions.
"Are the tasks evaluated independently? So I can Make 2 submissions with different Acquisition Functions and win in both subtasks?"
Yes we will open up a 2nd subtask to the challenge on eval.ai shortly so that participating team's can submit independent solutions for each subchallenge.
"The submissions limit (10) seems to be quite low. Especially that the evaluation measure in the challenge is not explicitly stated."
The idea is for participants to develop and evaluate their own solutions internally against the many existing baselines already implemented in GeneDisco - hence there is no public leaderboard. There will be a final private leaderboard that we will score through the eval.ai submission system.
Yes, we will be fixing batch size and cycle number to numbers appropriate for the held-out tasks that the submissions will be evaluated against. You can expect them to be similar to the ones used in the paper.
Performance after the final iteration is the metric by which submissions will be compared. Interim steps will not be contributing to the final score.
"For the competition, will we be provided the models' checkpoints or this is hidden from us?"
The retrained model is exposed to your acquisition function at every iteration via the model parameter to the acquisition function method invocation. You are free to use it as you see fit and as the API allows. However, please note that you will not receive the checkpoint files from us as your acquisition function will be running end-to-end on our cloud servers for evaluation (there is no interactive mode for participants; the code is considered final once submitted).
Please consider citing, if you reference or use our methodology, code or results in your work:
@inproceedings{mehrjou2022genedisco,
title={{GeneDisco: A Benchmark for Experimental Design in Drug Discovery}},
author={Mehrjou, Arash and Soleymani, Ashkan and Jesson, Andrew and Notin, Pascal and Gal, Yarin and Bauer, Stefan and Schwab, Patrick},
booktitle={{International Conference on Learning Representations (ICLR)}},
year={2022}
}
Arash Mehrjou, GlaxoSmithKline plc
Jacob A. Sackett-Sanders, GlaxoSmithKline plc
Patrick Schwab, GlaxoSmithKline plc
PS, JSS and AM are employees and shareholders of GlaxoSmithKline plc.