deploy: 15af202

leojklarner · Dec 13, 2023 · 343976f · 343976f
commit 343976f
Show file tree

Hide file tree

Showing 173 changed files with 48,781 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: e130a654607eacaeb4961cd81dbb102d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/codeautolink-cache.json b/.doctrees/codeautolink-cache.json
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/modules/dataloader.doctree b/.doctrees/modules/dataloader.doctree
diff --git a/.doctrees/modules/kernels.doctree b/.doctrees/modules/kernels.doctree
diff --git a/.doctrees/modules/representations.doctree b/.doctrees/modules/representations.doctree
diff --git a/.doctrees/nbsphinx/notebooks/bayesian_gnn_on_molecules.ipynb b/.doctrees/nbsphinx/notebooks/bayesian_gnn_on_molecules.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/bayesian_optimisation_over_molecules.ipynb b/.doctrees/nbsphinx/notebooks/bayesian_optimisation_over_molecules.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb b/.doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb b/.doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb
diff --git a/.doctrees/nbsphinx/notebooks/loading_and_featurising_molecules.ipynb b/.doctrees/nbsphinx/notebooks/loading_and_featurising_molecules.ipynb
@@ -0,0 +1,369 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d3636d97-3a25-4ed5-bc5e-aa043befa7ad",
+   "metadata": {},
+   "source": [
+    "# Loading and Featurising Molecular Data\n",
+    "\n",
+    "**In this noteboook, we will use GAUCHE's to quickly and easily load and preprocess molecular property and yield prediction datasets**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d65c40a-8ebb-4e0b-84dc-ab32a3fd4a3e",
+   "metadata": {},
+   "source": [
+    "## Molecular Property Prediction\n",
+    "\n",
+    "The [MolPropLoader class](https://leojklarner.github.io/gauche/modules/dataloader.html#module-gauche.dataloader.molprop_loader) provides a range of useful helper function for loading and featurising molecular property prediction datasets. It comes with a number of built-in datasets that you can use to test your models:\n",
+    "\n",
+    "* `Photoswitch`: The task is to predict the values of the E isomer π − π∗ transition wavelength for 392 photoswitch molecules.\n",
+    "* `ESOL` The task is to predict the logarithmic aqueous solubility values for 1128 organic small molecules.\n",
+    "* `FreeSolv` The task is to predict the hydration free energy values for 642 organic small molecules.\n",
+    "* `Lipophilicity` The task is to predict the octanol/water distribution coefficients for 4200 organic small molecules.\n",
+    "\n",
+    "You can load them by calling the `load_benchmark` function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the `read_csv(path, smiles_column, label_column)` with the path to your dataset and the name of the columns containing the SMILES strings and labels instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "cf4f109d-6602-4f9f-b3c6-7e23bf3a2d9c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n",
+      "To turn validation off, use dataloader.read_csv(..., validate=False).\n"
+     ]
+    }
+   ],
+   "source": [
+    "from gauche.dataloader import MolPropLoader\n",
+    "\n",
+    "# load a benchmark dataset\n",
+    "loader = MolPropLoader()\n",
+    "loader.load_benchmark(\"Photoswitch\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c50cec1",
+   "metadata": {},
+   "source": [
+    "\n",
+    "As you can see, the dataloader automatically runs a validation function that filters out invalid SMILES strings and non-numeric labels. The valid and canonicalised SMILES strings and labels are now stored in the `loader.features` and `loader.labels` attributes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "bdfe1f50",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['Cn1nnc(N=Nc2ccccc2)n1',\n",
+       " 'Cn1cnc(N=Nc2ccccc2)n1',\n",
+       " 'Cn1ccc(N=Nc2ccccc2)n1',\n",
+       " 'Cc1cn(C)nc1N=Nc1ccccc1',\n",
+       " 'Cn1cc(N=Nc2ccccc2)cn1']"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[310.],\n",
+       "       [310.],\n",
+       "       [320.],\n",
+       "       [325.],\n",
+       "       [328.]])"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "display(loader.features[:5])\n",
+    "display(loader.labels[:5])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "64ba2e12",
+   "metadata": {},
+   "source": [
+    "We can now use the `loader.featurize` function to featurise the molecules. These featurisers are simply functions that take a list of SMILES strings and return a list of feature vectors. GAUCHE comes with a number of built-in featurisers that you can use:\n",
+    "\n",
+    "* `ecfp_fingerprints`: Extended Connectivity Fingerprints (ECFP) that encode all circular substructures up to a certain diameter.\n",
+    "* `fragments`: A featuriser that encodes the presence of a number of predefined rdkit fragments.\n",
+    "* `ecfp_fragprints`: A combination of `ecfp_fingerprints` and `fragments`.\t\n",
+    "* `molecular_graphs`: A featuriser that encodes the molecular graph as a graph of atoms and bonds.\n",
+    "* `bag_of_smiles`: A featuriser that encodes the SMILES strings as a bag of characters.\n",
+    "* `bag_of_selfies`: A featuriser that encodes the SMILES strings as a bag of SELFIES characters.\n",
+    "\n",
+    "When calling the `loader.featurize` function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For example, we can specify the diameter of the ECFP fingerprints or the maximum number of fragments to encode. For a full list of keyword arguments, please refer to the [documentation](https://leojklarner.github.io/gauche/modules/representations.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "646d0e5d-f103-463a-86a8-4f8603af1acd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0],\n",
+       "       [0, 0, 0, ..., 0, 0, 0]])"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "loader.featurize(\"ecfp_fingerprints\")\n",
+    "loader.features[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecd5f08d",
+   "metadata": {},
+   "source": [
+    "We can also pass any custom featuriser that maps a list of SMILES strings to a list of feature vectors. For example, we can just return the length of the SMILES strings as a feature vector:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "7229fa2a-4db5-42c8-8e55-6819aa40db47",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n",
+      "To turn validation off, use dataloader.read_csv(..., validate=False).\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[21, 21, 21, 22, 21]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# load dataset again to undo featurisation\n",
+    "loader = MolPropLoader()\n",
+    "loader.load_benchmark(\"Photoswitch\")\n",
+    "\n",
+    "# define custom featurisation function\n",
+    "def smiles_length(smiles):\n",
+    "    return [len(s) for s in smiles]\n",
+    "\n",
+    "loader.featurize(smiles_length)\n",
+    "loader.features[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd2c9f11",
+   "metadata": {},
+   "source": [
+    "This was all we needed to do to load and featurise our dataset. The featurised molecules are now stored in the `loader.features` attribute and can be passed to the GP models."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75f439ff",
+   "metadata": {},
+   "source": [
+    "## Reaction Yield Prediction\n",
+    "\n",
+    "The [ReactionYieldLoader class](https://leojklarner.github.io/gauche/modules/dataloader.html#module-gauche.dataloader.reaction_loader) provides a range of useful helper function for loading and featurising reaction yield prediction datasets. The reaction data can be provided as either multple SMILES columns or a single reaction SMARTS column. It comes with a number of built-in datasets that you can use to test your models:\n",
+    "\n",
+    "* `DreherDoyle`: Data from [Predicting reaction performance in C–N cross-coupling using machine learning. Science, 2018.](https://www.science.org/doi/10.1126/science.aar5169) as multiple SMILES columns. The task is to predict the yields for 3955 Pd-catalysed Buchwald–Hartwig C–N cross-couplings.\n",
+    "* `DreherDoyleRXN`: The `DreherDoyle` dataset as a single reaction SMARTS column.\n",
+    "* `Suzuki-Miyaura`: Data from [A platform for automated nanomole-scale\n",
+    "reaction screening and micromole-scale synthesis in flow. Science, 2018](https://www.science.org/doi/10.1126/science.aap9112). The task is to predict the yields for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings.\n",
+    "* `Suzuki-MiyauraRXN`: The `Suzuki-Miyaura` dataset as a single reaction SMARTS column.\n",
+    "\n",
+    "You can load them by calling the `load_benchmark` function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the `read_csv(path, reactant_column, label_column)` with the path to your dataset and the name of your label column instead. The `reactant_column` argument can either be a single reaction SMARTS column or a list of SMILES columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0e587b0a-13bf-4164-bf30-871871ea8e7f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0    Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...\n",
+       "1    Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...\n",
+       "2    CCc1ccc(I)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2cc...\n",
+       "3    FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#4...\n",
+       "4    COc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2c...\n",
+       "Name: rxn, dtype: object"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[70.41045785],\n",
+       "       [11.06445724],\n",
+       "       [10.22354965],\n",
+       "       [20.0833829 ],\n",
+       "       [ 0.49266271]])"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from gauche.dataloader import ReactionLoader\n",
+    "\n",
+    "# load a benchmark dataset\n",
+    "loader = ReactionLoader()\n",
+    "loader.load_benchmark(\"DreherDoyleRXN\")\n",
+    "\n",
+    "display(loader.features[:5])\n",
+    "loader.labels[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8e4a530",
+   "metadata": {},
+   "source": [
+    "We can now use the `loader.featurize` function to featurise the SMILES/SMARTS. GAUCHE comes with a number of built-in featurisers that you can use:\n",
+    "\n",
+    "* `ohe`: A one-hot encoding that specifies which of the components in the different reactant\n",
+    "and reagent categories is present. In the Buchwald-Hartwig example, the OHE would describe which of the aryl halides, Buchwald ligands, bases and additives are used in the reaction\n",
+    "* `drfp`: The [differential reaction fingerprint](https://github.com/reymond-group/drfp); constructed by taking the symmetric difference of the sets containing the molecular substructures on both sides of the reaction arrow. Reagents are added to the reactants. (Only works for reaction SMARTS).\n",
+    "* `rxnfp`: A [data-driven reaction fingerprint](https://github.com/rxn4chemistry/rxnfp) using Transformer models such as BERT and trained in a supervised or an unsupervised fashion on reaction SMILES. (Only works for reaction SMARTS).\n",
+    "* `bag_of_smiles`: A bag of characters representation of the reaction SMARTS. (Only works for reaction SMARTS).\n",
+    "\n",
+    "When calling the `loader.featurize` function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For a full list of keyword arguments, please refer to the [documentation](https://leojklarner.github.io/gauche/modules/representations.html).\n",
+    "\n",
+    "If drfp requirement is not satisfied you can run\n",
+    "\n",
+    "`!pip install drfp`\n",
+    "\n",
+    "in the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "8abbece0-7e69-449f-9bee-defa7674d1fa",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0., 0., 0., ..., 1., 0., 0.],\n",
+       "       [0., 0., 0., ..., 0., 0., 0.],\n",
+       "       [0., 0., 0., ..., 0., 0., 0.],\n",
+       "       [0., 0., 0., ..., 0., 0., 0.],\n",
+       "       [0., 0., 0., ..., 1., 0., 0.]])"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "loader.featurize(\"drfp\")\n",
+    "loader.features[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00c5c3d9-c795-464d-a600-756db624aeaf",
+   "metadata": {},
+   "source": [
+    "We can again pass any custom featuriser that maps a list of SMILES or a reaction SMARTS string to a list of feature vectors. For example, we can take the length of of the reaction SMARTS string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "d51a91f6-529a-4b01-9ef6-0f39d27d9959",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[274, 277, 212, 207, 257]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# load dataset again to undo featurisation\n",
+    "loader = ReactionLoader()\n",
+    "loader.load_benchmark(\"DreherDoyleRXN\")\n",
+    "\n",
+    "# define custom featurisation function\n",
+    "def smiles_length(smiles):\n",
+    "    return [len(s) for s in smiles]\n",
+    "\n",
+    "loader.featurize(smiles_length)\n",
+    "loader.features[:5]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}