Skip to content

Commit

Permalink
deploy: 25e4067
Browse files Browse the repository at this point in the history
  • Loading branch information
leojklarner committed Dec 10, 2023
0 parents commit f284e38
Show file tree
Hide file tree
Showing 161 changed files with 41,539 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: e130a654607eacaeb4961cd81dbb102d
tags: 645f666f9bcd5a90fca523b33c5a78b7
9,192 changes: 9,192 additions & 0 deletions .doctrees/codeautolink-cache.json

Large diffs are not rendered by default.

Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/modules/dataloader.doctree
Binary file not shown.
Binary file added .doctrees/modules/kernels.doctree
Binary file not shown.
Binary file added .doctrees/modules/representations.doctree
Binary file not shown.

Large diffs are not rendered by default.

402 changes: 402 additions & 0 deletions .doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb

Large diffs are not rendered by default.

691 changes: 691 additions & 0 deletions .doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb

Large diffs are not rendered by default.

311 changes: 311 additions & 0 deletions .doctrees/nbsphinx/notebooks/molecular_preference_learning.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "64c388ef",
"metadata": {},
"source": [
"# Learning an Objective Function through Interaction with a Human Chemist #\n",
"\n",
"An example notebook using Gaussian process preference learning [1] to model the latent utility function of a human medicinal chemist from pairwise preference data. The human medicinal chemist is presented with pairwise observations of molecules $(m_1, m_2)$ and asked to indicate their preference $r(m_1, m_2) \\in \\{0, 1\\}$. The preference Gaussian process then models the latent utility function $g(m)$ via a fit to the pairwise preferences.\n",
"\n",
"In this tutorial we use the pairwise GP with Laplace approximation introduced in [1] as our preference GP. It is also possible to use a skew GP model as in [2]."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7dbb60d1",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Library imports\"\"\"\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\") # Turn off Graphein warnings\n",
"\n",
"# To import from the Gauche package\n",
"import sys\n",
"sys.path.append('..')\n",
"\n",
"from itertools import combinations\n",
"from botorch import fit_gpytorch_model\n",
"from botorch.models.pairwise_gp import PairwiseGP, PairwiseLaplaceMarginalLogLikelihood\n",
"import gpytorch\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"import torch\n",
"from scipy.stats import kendalltau\n",
"from matplotlib import pyplot as plt\n",
"from gauche.dataloader import MolPropLoader\n",
"from gauche.dataloader.data_utils import transform_data\n",
"from gauche.kernels.fingerprint_kernels.tanimoto_kernel import TanimotoKernel\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"id": "30dbbc8a",
"metadata": {},
"source": [
"We use the photoswitch dataset for purposes of illustration. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "001f142e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n",
"To turn validation off, use dataloader.read_csv(..., validate=False).\n"
]
}
],
"source": [
"\"\"\"Load the Photoswitch dataset\"\"\"\n",
"\n",
"loader = MolPropLoader()\n",
"loader.load_benchmark(\"Photoswitch\")\n",
"\n",
"# Featurise the molecules. \n",
"\n",
"# We use the fragprints representations (a concatenation of Morgan fingerprints and RDKit fragment features)\n",
"\n",
"loader.featurize('ecfp_fragprints')\n",
"X_fragprints = loader.features\n",
"y = loader.labels"
]
},
{
"cell_type": "markdown",
"id": "94ede917",
"metadata": {},
"source": [
"We implement a utility function for generating ground truth pairwise comparison data i.e. preferences voiced by a simulated medicinal chemist."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ba8c3779",
"metadata": {},
"outputs": [],
"source": [
"def generate_comparisons(y, n_comp, noise=0.0, replace=False):\n",
" \"\"\"Function simulating the preferences of a human chemist.\n",
" \n",
" Args:\n",
" y: 1D NumPy array of training data labels\n",
" n_comp: Int indicating the number of pairwise comparisons to generate\n",
" noise: Float indicating the level of noise in the chemist's decisions\n",
" replace: Bool indicating whether to generate comparisons with replacement\n",
" \n",
" Returns:\n",
" comp_pairs: A NumPy array of comparison pairs of the form (m1, m2)\n",
" \n",
" \"\"\"\n",
" # generate all possible pairs of elements in y\n",
" all_pairs = np.array(list(combinations(range(y.shape[0]), 2)))\n",
" # randomly select n_comp pairs from all_pairs\n",
" comp_pairs = all_pairs[\n",
" np.random.choice(range(len(all_pairs)), n_comp, replace=replace)\n",
" ]\n",
" # add gaussian noise to the latent y values\n",
" c0 = y[comp_pairs[:, 0]] + np.random.standard_normal(len(comp_pairs)) * noise\n",
" c1 = y[comp_pairs[:, 1]] + np.random.standard_normal(len(comp_pairs)) * noise\n",
" reverse_comp = (c0 < c1)\n",
" comp_pairs[reverse_comp, :] = np.flip(comp_pairs[reverse_comp, :], 1)\n",
" comp_pairs = torch.tensor(comp_pairs).long()\n",
"\n",
" return comp_pairs"
]
},
{
"cell_type": "markdown",
"id": "d9cecd55",
"metadata": {},
"source": [
"As a performance metric we use the Kendall-Tau rank correlation. Importantly, when a GP is learned on pairwise comparisons, the absolute scale of the predictions will be unavailable but the rank order will be available. As such, we can use a rank-based performance metric."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cba57506",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Utility function for computing the Kendall-Tau rank correlation between the predictions and the ground truth.\"\"\"\n",
"\n",
"def eval_kt_cor(model, test_X, test_y):\n",
" \"\"\"Kendall-Tau rank correlation\n",
" Args:\n",
" model: Instance of pairwise GP\n",
" test_X: n x d Tensor of test input locations\n",
" test_y: n x 1 Tensor of test labels\n",
" \n",
" Returns:\n",
" The Kendall-Tau rank correlation, a number between 0 and 1.\n",
" \"\"\"\n",
" pred_y = model.posterior(test_X).mean.squeeze().detach().numpy()\n",
" return kendalltau(pred_y, test_y).correlation"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "59127084",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Utility function for model evaluation\"\"\"\n",
"\n",
"# Experiment parameters\n",
"n_trials = 20\n",
"test_set_size = 0.2 # train/test split\n",
"m = 500\n",
"noise = 0 # simulate a noiseless oracle\n",
"\n",
"def evaluate_model(X, y):\n",
" \"\"\"Helper function for model evaluation\n",
" Args:\n",
" X: n x d NumPy array of the full set of inputs. Typically some molecular representation such as fragprints, framgents or fingerprints\n",
" y: n x d NumPy array of the full set of output labels\n",
" Returns:\n",
" Mean KT correlation on the train set, mean KT correlation on the test set.\n",
" \"\"\"\n",
"\n",
" # initialise performance metric lists\n",
" kt_list_train = []\n",
" kt_list_test = []\n",
" \n",
" print('\\nBeginning training loop...')\n",
"\n",
" for i in range(0, n_trials):\n",
" \n",
" print(f'Starting trial {i}')\n",
" \n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_set_size, random_state=i)\n",
" \n",
" train_comp = torch.tensor(generate_comparisons(y_train.squeeze(-1), m, noise=noise))\n",
"\n",
" # Convert numpy arrays to PyTorch tensors and flatten the label vectors\n",
" X_train = torch.tensor(X_train.astype(np.float64))\n",
" X_test = torch.tensor(X_test.astype(np.float64))\n",
" y_train = torch.tensor(y_train).flatten()\n",
" y_test = torch.tensor(y_test).flatten()\n",
"\n",
" # initialise pairwise GP model\n",
" model = PairwiseGP(X_train, train_comp, covar_module=gpytorch.kernels.ScaleKernel(TanimotoKernel()))\n",
" # Find optimal model hyperparameters\n",
" mll = PairwiseLaplaceMarginalLogLikelihood(model.likelihood, model)\n",
"\n",
" # Use the BoTorch utility for fitting GPs in order to use the LBFGS-B optimiser (recommended)\n",
" fit_gpytorch_model(mll)\n",
"\n",
" # Get into evaluation (predictive posterior) mode\n",
" model.eval()\n",
"\n",
" # To compute metrics and detach gradients. Must unsqueeze dimension\n",
" y_test = y_test.detach().unsqueeze(dim=1)\n",
" \n",
" # Compute Kendall-Tau rank correlation\n",
" train_score = eval_kt_cor(model, X_train, y_train)\n",
" test_score = eval_kt_cor(model, X_test, y_test)\n",
" \n",
" kt_list_train.append(train_score)\n",
" kt_list_test.append(test_score)\n",
"\n",
" \n",
" kt_list_train = np.array(kt_list_train)\n",
" kt_list_test = np.array(kt_list_test)\n",
" \n",
" print(\"\\nmean train KT: {:.4f} +- {:.4f}\".format(np.mean(kt_list_train), np.std(kt_list_train)/np.sqrt(len(kt_list_train))))\n",
" print(\"\\nmean test KT: {:.4f} +- {:.4f}\".format(np.mean(kt_list_test), np.std(kt_list_test)/np.sqrt(len(kt_list_test))))\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5d7af0f0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Beginning training loop...\n",
"Starting trial 0\n",
"Starting trial 1\n",
"Starting trial 2\n",
"Starting trial 3\n",
"Starting trial 4\n",
"Starting trial 5\n",
"Starting trial 6\n",
"Starting trial 7\n",
"Starting trial 8\n",
"Starting trial 9\n",
"Starting trial 10\n",
"Starting trial 11\n",
"Starting trial 12\n",
"Starting trial 13\n",
"Starting trial 14\n",
"Starting trial 15\n",
"Starting trial 16\n",
"Starting trial 17\n",
"Starting trial 18\n",
"Starting trial 19\n",
"\n",
"mean train KT: 0.7939 +- 0.0027\n",
"\n",
"mean test KT: 0.7228 +- 0.0046\n"
]
}
],
"source": [
"evaluate_model(X_fragprints, y)"
]
},
{
"cell_type": "markdown",
"id": "a8fd9a3a",
"metadata": {},
"source": [
"## References \n",
"\n",
"[1] Chu, W. and Ghahramani, Z., [Preference learning with Gaussian processes](https://icml.cc/Conferences/2005/proceedings/papers/018_Preference_ChuGhahramani.pdf). ICML, 2005.\n",
"\n",
"[2] Benavoli, A., Azzimonti, D. and Piga, D., 2021, July. [Preferential Bayesian optimisation with Skew Gaussian Processes](https://dl.acm.org/doi/abs/10.1145/3449726.3463128). In Proceedings of the Genetic and Evolutionary Computation Conference Companion (pp. 1842-1850)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "gauche_pip",
"language": "python",
"name": "gauche_pip"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit f284e38

Please sign in to comment.