deploy: 25e4067

leojklarner · Dec 10, 2023 · f284e38 · f284e38
commit f284e38
Show file tree

Hide file tree

Showing 161 changed files with 41,539 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: e130a654607eacaeb4961cd81dbb102d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/codeautolink-cache.json b/.doctrees/codeautolink-cache.json
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/modules/dataloader.doctree b/.doctrees/modules/dataloader.doctree
diff --git a/.doctrees/modules/kernels.doctree b/.doctrees/modules/kernels.doctree
diff --git a/.doctrees/modules/representations.doctree b/.doctrees/modules/representations.doctree
diff --git a/.doctrees/nbsphinx/notebooks/bayesian_optimisation_over_molecules.ipynb b/.doctrees/nbsphinx/notebooks/bayesian_optimisation_over_molecules.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb b/.doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb b/.doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/molecular_preference_learning.ipynb b/.doctrees/nbsphinx/notebooks/molecular_preference_learning.ipynb
@@ -0,0 +1,311 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "64c388ef",
+   "metadata": {},
+   "source": [
+    "# Learning an Objective Function through Interaction with a Human Chemist #\n",
+    "\n",
+    "An example notebook using Gaussian process preference learning [1] to model the latent utility function of a human medicinal chemist from pairwise preference data. The human medicinal chemist is presented with pairwise observations of molecules $(m_1, m_2)$ and asked to indicate their preference $r(m_1, m_2) \\in \\{0, 1\\}$. The preference Gaussian process then models the latent utility function $g(m)$ via a fit to the pairwise preferences.\n",
+    "\n",
+    "In this tutorial we use the pairwise GP with Laplace approximation introduced in [1] as our preference GP. It is also possible to use a skew GP model as in [2]."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7dbb60d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"Library imports\"\"\"\n",
+    "\n",
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\") # Turn off Graphein warnings\n",
+    "\n",
+    "# To import from the Gauche package\n",
+    "import sys\n",
+    "sys.path.append('..')\n",
+    "\n",
+    "from itertools import combinations\n",
+    "from botorch import fit_gpytorch_model\n",
+    "from botorch.models.pairwise_gp import PairwiseGP, PairwiseLaplaceMarginalLogLikelihood\n",
+    "import gpytorch\n",
+    "import numpy as np\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "import torch\n",
+    "from scipy.stats import kendalltau\n",
+    "from matplotlib import pyplot as plt\n",
+    "from gauche.dataloader import MolPropLoader\n",
+    "from gauche.dataloader.data_utils import transform_data\n",
+    "from gauche.kernels.fingerprint_kernels.tanimoto_kernel import TanimotoKernel\n",
+    "\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30dbbc8a",
+   "metadata": {},
+   "source": [
+    "We use the photoswitch dataset for purposes of illustration. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "001f142e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n",
+      "To turn validation off, use dataloader.read_csv(..., validate=False).\n"
+     ]
+    }
+   ],
+   "source": [
+    "\"\"\"Load the Photoswitch dataset\"\"\"\n",
+    "\n",
+    "loader = MolPropLoader()\n",
+    "loader.load_benchmark(\"Photoswitch\")\n",
+    "\n",
+    "# Featurise the molecules. \n",
+    "\n",
+    "# We use the fragprints representations (a concatenation of Morgan fingerprints and RDKit fragment features)\n",
+    "\n",
+    "loader.featurize('ecfp_fragprints')\n",
+    "X_fragprints = loader.features\n",
+    "y = loader.labels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94ede917",
+   "metadata": {},
+   "source": [
+    "We implement a utility function for generating ground truth pairwise comparison data i.e. preferences voiced by a simulated medicinal chemist."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ba8c3779",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_comparisons(y, n_comp, noise=0.0, replace=False):\n",
+    "    \"\"\"Function simulating the preferences of a human chemist.\n",
+    "    \n",
+    "    Args:\n",
+    "        y: 1D NumPy array of training data labels\n",
+    "        n_comp: Int indicating the number of pairwise comparisons to generate\n",
+    "        noise: Float indicating the level of noise in the chemist's decisions\n",
+    "        replace: Bool indicating whether to generate comparisons with replacement\n",
+    "    \n",
+    "    Returns:\n",
+    "        comp_pairs: A NumPy array of comparison pairs of the form (m1, m2)\n",
+    "    \n",
+    "    \"\"\"\n",
+    "    # generate all possible pairs of elements in y\n",
+    "    all_pairs = np.array(list(combinations(range(y.shape[0]), 2)))\n",
+    "    # randomly select n_comp pairs from all_pairs\n",
+    "    comp_pairs = all_pairs[\n",
+    "        np.random.choice(range(len(all_pairs)), n_comp, replace=replace)\n",
+    "    ]\n",
+    "    # add gaussian noise to the latent y values\n",
+    "    c0 = y[comp_pairs[:, 0]] + np.random.standard_normal(len(comp_pairs)) * noise\n",
+    "    c1 = y[comp_pairs[:, 1]] + np.random.standard_normal(len(comp_pairs)) * noise\n",
+    "    reverse_comp = (c0 < c1)\n",
+    "    comp_pairs[reverse_comp, :] = np.flip(comp_pairs[reverse_comp, :], 1)\n",
+    "    comp_pairs = torch.tensor(comp_pairs).long()\n",
+    "\n",
+    "    return comp_pairs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9cecd55",
+   "metadata": {},
+   "source": [
+    "As a performance metric we use the Kendall-Tau rank correlation. Importantly, when a GP is learned on pairwise comparisons, the absolute scale of the predictions will be unavailable but the rank order will be available. As such, we can use a rank-based performance metric."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "cba57506",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"Utility function for computing the Kendall-Tau rank correlation between the predictions and the ground truth.\"\"\"\n",
+    "\n",
+    "def eval_kt_cor(model, test_X, test_y):\n",
+    "    \"\"\"Kendall-Tau rank correlation\n",
+    "    Args:\n",
+    "        model: Instance of pairwise GP\n",
+    "        test_X: n x d Tensor of test input locations\n",
+    "        test_y: n x 1 Tensor of test labels\n",
+    "        \n",
+    "    Returns:\n",
+    "        The Kendall-Tau rank correlation, a number between 0 and 1.\n",
+    "    \"\"\"\n",
+    "    pred_y = model.posterior(test_X).mean.squeeze().detach().numpy()\n",
+    "    return kendalltau(pred_y, test_y).correlation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "59127084",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"Utility function for model evaluation\"\"\"\n",
+    "\n",
+    "# Experiment parameters\n",
+    "n_trials = 20\n",
+    "test_set_size = 0.2 # train/test split\n",
+    "m = 500\n",
+    "noise = 0 # simulate a noiseless oracle\n",
+    "\n",
+    "def evaluate_model(X, y):\n",
+    "    \"\"\"Helper function for model evaluation\n",
+    "    Args:\n",
+    "        X: n x d NumPy array of the full set of inputs. Typically some molecular representation such as fragprints, framgents or fingerprints\n",
+    "        y: n x d NumPy array of the full set of output labels\n",
+    "    Returns:\n",
+    "        Mean KT correlation on the train set, mean KT correlation on the test set.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # initialise performance metric lists\n",
+    "    kt_list_train = []\n",
+    "    kt_list_test = []\n",
+    "    \n",
+    "    print('\\nBeginning training loop...')\n",
+    "\n",
+    "    for i in range(0, n_trials):\n",
+    "        \n",
+    "        print(f'Starting trial {i}')\n",
+    "                \n",
+    "        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_set_size, random_state=i)\n",
+    "        \n",
+    "        train_comp = torch.tensor(generate_comparisons(y_train.squeeze(-1), m, noise=noise))\n",
+    "\n",
+    "        # Convert numpy arrays to PyTorch tensors and flatten the label vectors\n",
+    "        X_train = torch.tensor(X_train.astype(np.float64))\n",
+    "        X_test = torch.tensor(X_test.astype(np.float64))\n",
+    "        y_train = torch.tensor(y_train).flatten()\n",
+    "        y_test = torch.tensor(y_test).flatten()\n",
+    "\n",
+    "        # initialise pairwise GP model\n",
+    "        model = PairwiseGP(X_train, train_comp, covar_module=gpytorch.kernels.ScaleKernel(TanimotoKernel()))\n",
+    "        # Find optimal model hyperparameters\n",
+    "        mll = PairwiseLaplaceMarginalLogLikelihood(model.likelihood, model)\n",
+    "\n",
+    "        # Use the BoTorch utility for fitting GPs in order to use the LBFGS-B optimiser (recommended)\n",
+    "        fit_gpytorch_model(mll)\n",
+    "\n",
+    "        # Get into evaluation (predictive posterior) mode\n",
+    "        model.eval()\n",
+    "\n",
+    "        # To compute metrics and detach gradients. Must unsqueeze dimension\n",
+    "        y_test = y_test.detach().unsqueeze(dim=1)\n",
+    "        \n",
+    "        # Compute Kendall-Tau rank correlation\n",
+    "        train_score = eval_kt_cor(model, X_train, y_train)\n",
+    "        test_score = eval_kt_cor(model, X_test, y_test)\n",
+    "        \n",
+    "        kt_list_train.append(train_score)\n",
+    "        kt_list_test.append(test_score)\n",
+    "\n",
+    "        \n",
+    "    kt_list_train = np.array(kt_list_train)\n",
+    "    kt_list_test = np.array(kt_list_test)\n",
+    "    \n",
+    "    print(\"\\nmean train KT: {:.4f} +- {:.4f}\".format(np.mean(kt_list_train), np.std(kt_list_train)/np.sqrt(len(kt_list_train))))\n",
+    "    print(\"\\nmean test KT: {:.4f} +- {:.4f}\".format(np.mean(kt_list_test), np.std(kt_list_test)/np.sqrt(len(kt_list_test))))\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "5d7af0f0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Beginning training loop...\n",
+      "Starting trial 0\n",
+      "Starting trial 1\n",
+      "Starting trial 2\n",
+      "Starting trial 3\n",
+      "Starting trial 4\n",
+      "Starting trial 5\n",
+      "Starting trial 6\n",
+      "Starting trial 7\n",
+      "Starting trial 8\n",
+      "Starting trial 9\n",
+      "Starting trial 10\n",
+      "Starting trial 11\n",
+      "Starting trial 12\n",
+      "Starting trial 13\n",
+      "Starting trial 14\n",
+      "Starting trial 15\n",
+      "Starting trial 16\n",
+      "Starting trial 17\n",
+      "Starting trial 18\n",
+      "Starting trial 19\n",
+      "\n",
+      "mean train KT: 0.7939 +- 0.0027\n",
+      "\n",
+      "mean test KT: 0.7228 +- 0.0046\n"
+     ]
+    }
+   ],
+   "source": [
+    "evaluate_model(X_fragprints, y)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8fd9a3a",
+   "metadata": {},
+   "source": [
+    "## References \n",
+    "\n",
+    "[1] Chu, W. and Ghahramani, Z., [Preference learning with Gaussian processes](https://icml.cc/Conferences/2005/proceedings/papers/018_Preference_ChuGhahramani.pdf). ICML, 2005.\n",
+    "\n",
+    "[2] Benavoli, A., Azzimonti, D. and Piga, D., 2021, July. [Preferential Bayesian optimisation with Skew Gaussian Processes](https://dl.acm.org/doi/abs/10.1145/3449726.3463128). In Proceedings of the Genetic and Evolutionary Computation Conference Companion (pp. 1842-1850)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "gauche_pip",
+   "language": "python",
+   "name": "gauche_pip"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}