-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 343976f
Showing
173 changed files
with
48,781 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: e130a654607eacaeb4961cd81dbb102d | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
1,007 changes: 1,007 additions & 0 deletions
1,007
.doctrees/nbsphinx/notebooks/bayesian_gnn_on_molecules.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
388 changes: 388 additions & 0 deletions
388
.doctrees/nbsphinx/notebooks/bayesian_optimisation_over_molecules.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
402 changes: 402 additions & 0 deletions
402
.doctrees/nbsphinx/notebooks/external_graph_kernels.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
687 changes: 687 additions & 0 deletions
687
.doctrees/nbsphinx/notebooks/gp_regression_on_molecules.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
369 changes: 369 additions & 0 deletions
369
.doctrees/nbsphinx/notebooks/loading_and_featurising_molecules.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,369 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d3636d97-3a25-4ed5-bc5e-aa043befa7ad", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading and Featurising Molecular Data\n", | ||
"\n", | ||
"**In this noteboook, we will use GAUCHE's to quickly and easily load and preprocess molecular property and yield prediction datasets**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9d65c40a-8ebb-4e0b-84dc-ab32a3fd4a3e", | ||
"metadata": {}, | ||
"source": [ | ||
"## Molecular Property Prediction\n", | ||
"\n", | ||
"The [MolPropLoader class](https://leojklarner.github.io/gauche/modules/dataloader.html#module-gauche.dataloader.molprop_loader) provides a range of useful helper function for loading and featurising molecular property prediction datasets. It comes with a number of built-in datasets that you can use to test your models:\n", | ||
"\n", | ||
"* `Photoswitch`: The task is to predict the values of the E isomer π − π∗ transition wavelength for 392 photoswitch molecules.\n", | ||
"* `ESOL` The task is to predict the logarithmic aqueous solubility values for 1128 organic small molecules.\n", | ||
"* `FreeSolv` The task is to predict the hydration free energy values for 642 organic small molecules.\n", | ||
"* `Lipophilicity` The task is to predict the octanol/water distribution coefficients for 4200 organic small molecules.\n", | ||
"\n", | ||
"You can load them by calling the `load_benchmark` function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the `read_csv(path, smiles_column, label_column)` with the path to your dataset and the name of the columns containing the SMILES strings and labels instead." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "cf4f109d-6602-4f9f-b3c6-7e23bf3a2d9c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n", | ||
"To turn validation off, use dataloader.read_csv(..., validate=False).\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from gauche.dataloader import MolPropLoader\n", | ||
"\n", | ||
"# load a benchmark dataset\n", | ||
"loader = MolPropLoader()\n", | ||
"loader.load_benchmark(\"Photoswitch\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "0c50cec1", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"As you can see, the dataloader automatically runs a validation function that filters out invalid SMILES strings and non-numeric labels. The valid and canonicalised SMILES strings and labels are now stored in the `loader.features` and `loader.labels` attributes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "bdfe1f50", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['Cn1nnc(N=Nc2ccccc2)n1',\n", | ||
" 'Cn1cnc(N=Nc2ccccc2)n1',\n", | ||
" 'Cn1ccc(N=Nc2ccccc2)n1',\n", | ||
" 'Cc1cn(C)nc1N=Nc1ccccc1',\n", | ||
" 'Cn1cc(N=Nc2ccccc2)cn1']" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([[310.],\n", | ||
" [310.],\n", | ||
" [320.],\n", | ||
" [325.],\n", | ||
" [328.]])" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"display(loader.features[:5])\n", | ||
"display(loader.labels[:5])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "64ba2e12", | ||
"metadata": {}, | ||
"source": [ | ||
"We can now use the `loader.featurize` function to featurise the molecules. These featurisers are simply functions that take a list of SMILES strings and return a list of feature vectors. GAUCHE comes with a number of built-in featurisers that you can use:\n", | ||
"\n", | ||
"* `ecfp_fingerprints`: Extended Connectivity Fingerprints (ECFP) that encode all circular substructures up to a certain diameter.\n", | ||
"* `fragments`: A featuriser that encodes the presence of a number of predefined rdkit fragments.\n", | ||
"* `ecfp_fragprints`: A combination of `ecfp_fingerprints` and `fragments`.\t\n", | ||
"* `molecular_graphs`: A featuriser that encodes the molecular graph as a graph of atoms and bonds.\n", | ||
"* `bag_of_smiles`: A featuriser that encodes the SMILES strings as a bag of characters.\n", | ||
"* `bag_of_selfies`: A featuriser that encodes the SMILES strings as a bag of SELFIES characters.\n", | ||
"\n", | ||
"When calling the `loader.featurize` function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For example, we can specify the diameter of the ECFP fingerprints or the maximum number of fragments to encode. For a full list of keyword arguments, please refer to the [documentation](https://leojklarner.github.io/gauche/modules/representations.html)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "646d0e5d-f103-463a-86a8-4f8603af1acd", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([[0, 0, 0, ..., 0, 0, 0],\n", | ||
" [0, 0, 0, ..., 0, 0, 0],\n", | ||
" [0, 0, 0, ..., 0, 0, 0],\n", | ||
" [0, 0, 0, ..., 0, 0, 0],\n", | ||
" [0, 0, 0, ..., 0, 0, 0]])" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"loader.featurize(\"ecfp_fingerprints\")\n", | ||
"loader.features[:5]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ecd5f08d", | ||
"metadata": {}, | ||
"source": [ | ||
"We can also pass any custom featuriser that maps a list of SMILES strings to a list of feature vectors. For example, we can just return the length of the SMILES strings as a feature vector:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "7229fa2a-4db5-42c8-8e55-6819aa40db47", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]\n", | ||
"To turn validation off, use dataloader.read_csv(..., validate=False).\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[21, 21, 21, 22, 21]" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# load dataset again to undo featurisation\n", | ||
"loader = MolPropLoader()\n", | ||
"loader.load_benchmark(\"Photoswitch\")\n", | ||
"\n", | ||
"# define custom featurisation function\n", | ||
"def smiles_length(smiles):\n", | ||
" return [len(s) for s in smiles]\n", | ||
"\n", | ||
"loader.featurize(smiles_length)\n", | ||
"loader.features[:5]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "bd2c9f11", | ||
"metadata": {}, | ||
"source": [ | ||
"This was all we needed to do to load and featurise our dataset. The featurised molecules are now stored in the `loader.features` attribute and can be passed to the GP models." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "75f439ff", | ||
"metadata": {}, | ||
"source": [ | ||
"## Reaction Yield Prediction\n", | ||
"\n", | ||
"The [ReactionYieldLoader class](https://leojklarner.github.io/gauche/modules/dataloader.html#module-gauche.dataloader.reaction_loader) provides a range of useful helper function for loading and featurising reaction yield prediction datasets. The reaction data can be provided as either multple SMILES columns or a single reaction SMARTS column. It comes with a number of built-in datasets that you can use to test your models:\n", | ||
"\n", | ||
"* `DreherDoyle`: Data from [Predicting reaction performance in C–N cross-coupling using machine learning. Science, 2018.](https://www.science.org/doi/10.1126/science.aar5169) as multiple SMILES columns. The task is to predict the yields for 3955 Pd-catalysed Buchwald–Hartwig C–N cross-couplings.\n", | ||
"* `DreherDoyleRXN`: The `DreherDoyle` dataset as a single reaction SMARTS column.\n", | ||
"* `Suzuki-Miyaura`: Data from [A platform for automated nanomole-scale\n", | ||
"reaction screening and micromole-scale synthesis in flow. Science, 2018](https://www.science.org/doi/10.1126/science.aap9112). The task is to predict the yields for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings.\n", | ||
"* `Suzuki-MiyauraRXN`: The `Suzuki-Miyaura` dataset as a single reaction SMARTS column.\n", | ||
"\n", | ||
"You can load them by calling the `load_benchmark` function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the `read_csv(path, reactant_column, label_column)` with the path to your dataset and the name of your label column instead. The `reactant_column` argument can either be a single reaction SMARTS column or a list of SMILES columns." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "0e587b0a-13bf-4164-bf30-871871ea8e7f", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"0 Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...\n", | ||
"1 Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...\n", | ||
"2 CCc1ccc(I)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2cc...\n", | ||
"3 FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#4...\n", | ||
"4 COc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2c...\n", | ||
"Name: rxn, dtype: object" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([[70.41045785],\n", | ||
" [11.06445724],\n", | ||
" [10.22354965],\n", | ||
" [20.0833829 ],\n", | ||
" [ 0.49266271]])" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from gauche.dataloader import ReactionLoader\n", | ||
"\n", | ||
"# load a benchmark dataset\n", | ||
"loader = ReactionLoader()\n", | ||
"loader.load_benchmark(\"DreherDoyleRXN\")\n", | ||
"\n", | ||
"display(loader.features[:5])\n", | ||
"loader.labels[:5]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c8e4a530", | ||
"metadata": {}, | ||
"source": [ | ||
"We can now use the `loader.featurize` function to featurise the SMILES/SMARTS. GAUCHE comes with a number of built-in featurisers that you can use:\n", | ||
"\n", | ||
"* `ohe`: A one-hot encoding that specifies which of the components in the different reactant\n", | ||
"and reagent categories is present. In the Buchwald-Hartwig example, the OHE would describe which of the aryl halides, Buchwald ligands, bases and additives are used in the reaction\n", | ||
"* `drfp`: The [differential reaction fingerprint](https://github.com/reymond-group/drfp); constructed by taking the symmetric difference of the sets containing the molecular substructures on both sides of the reaction arrow. Reagents are added to the reactants. (Only works for reaction SMARTS).\n", | ||
"* `rxnfp`: A [data-driven reaction fingerprint](https://github.com/rxn4chemistry/rxnfp) using Transformer models such as BERT and trained in a supervised or an unsupervised fashion on reaction SMILES. (Only works for reaction SMARTS).\n", | ||
"* `bag_of_smiles`: A bag of characters representation of the reaction SMARTS. (Only works for reaction SMARTS).\n", | ||
"\n", | ||
"When calling the `loader.featurize` function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For a full list of keyword arguments, please refer to the [documentation](https://leojklarner.github.io/gauche/modules/representations.html).\n", | ||
"\n", | ||
"If drfp requirement is not satisfied you can run\n", | ||
"\n", | ||
"`!pip install drfp`\n", | ||
"\n", | ||
"in the next cell." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "8abbece0-7e69-449f-9bee-defa7674d1fa", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([[0., 0., 0., ..., 1., 0., 0.],\n", | ||
" [0., 0., 0., ..., 0., 0., 0.],\n", | ||
" [0., 0., 0., ..., 0., 0., 0.],\n", | ||
" [0., 0., 0., ..., 0., 0., 0.],\n", | ||
" [0., 0., 0., ..., 1., 0., 0.]])" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"loader.featurize(\"drfp\")\n", | ||
"loader.features[:5]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "00c5c3d9-c795-464d-a600-756db624aeaf", | ||
"metadata": {}, | ||
"source": [ | ||
"We can again pass any custom featuriser that maps a list of SMILES or a reaction SMARTS string to a list of feature vectors. For example, we can take the length of of the reaction SMARTS string." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "d51a91f6-529a-4b01-9ef6-0f39d27d9959", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[274, 277, 212, 207, 257]" | ||
] | ||
}, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# load dataset again to undo featurisation\n", | ||
"loader = ReactionLoader()\n", | ||
"loader.load_benchmark(\"DreherDoyleRXN\")\n", | ||
"\n", | ||
"# define custom featurisation function\n", | ||
"def smiles_length(smiles):\n", | ||
" return [len(s) for s in smiles]\n", | ||
"\n", | ||
"loader.featurize(smiles_length)\n", | ||
"loader.features[:5]" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.