diff --git a/docs/CONTRIBUTING.doctree b/docs/CONTRIBUTING.doctree new file mode 100644 index 00000000..3e6bccbe Binary files /dev/null and b/docs/CONTRIBUTING.doctree differ diff --git a/docs/CONTRIBUTING.html b/docs/CONTRIBUTING.html new file mode 100644 index 00000000..78580996 --- /dev/null +++ b/docs/CONTRIBUTING.html @@ -0,0 +1,258 @@ + + + + + + + Contributing & Developer Notes — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Contributing & Developer Notes

+

Pull Requests, Bug Reports, and all Contributions are welcome, encouraged, and appreciated! +Please use the appropriate issue or pull request template when making a contribution to help the maintainers get it merged quickly.

+

We make use of the GitHub Discussions page to go over potential features to add. +Please feel free to stop by if you are looking for something to develop or have an idea for a useful feature!

+

When submitting a PR, please mark your PR with the “PR Ready for Review” label when you are finished making changes so that the GitHub actions bots can work their magic!

+
+

Developer Install

+

To contribute to the astartes source code, start by forking and then cloning the repository (i.e. git clone git@github.com:YourUsername/astartes.git) and then inside the repository run pip install -e .[dev]. This will set you up with all the required dependencies to run astartes and conform to our formatting standards (black and isort), which you can configure to run automatically in VSCode like this.

+
+

Warning +Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around the [] characters (i.e. pip install "astartes[dev]")

+
+
+
+

Version Checking

+

astartes uses pyproject.toml to specify all metadata, but the version is also specified in astartes/__init__.py (via __version__) for backwards compatibility with Python 3.7. +To check which version of astartes you have installed, you can run python -c "import astartes; print(astartes.__version__)" on Python 3.7 or `python -c “from importlib.metadata import version; version(‘astartes’)” on Python 3.8 or newer.

+
+
+

Testing

+

All of the tests in astartes are written using the built-in python unittest module (to allow running without pytest) but we highly recommend using pytest. +To execute the tests from the astartes repository, simply type pytest after running the developer install (or alternately, pytest -v for a more helpful output). +On GitHub, we use actions to run the tests on every Pull Request and on a nightly basis (look in .github/workflows for more information). +These tests include unit tests, functional tests, and regression tests.

+
+
+

Adding New Samplers

+

Adding a new sampler should extend the abstract_sampler.py abstract base class. +Each subclass should override the _sample method with its own algorithm for data partitioning and optionally the _before_sample method to perform any data validation.

+

All samplers in astartes are classified as one of two types: extrapolative or interpolative. +Extrapolative samplers work by clustering data into groups (which are then partitioned into train/validation/test to enforce extrapolation) whereas interpolative samplers provide an exact order in which samples should be moved into the training set.

+

When actually implemented, this means that extrapolative samplers should set the self._samples_clusters attribute and interpolative samplers should set the self._samples_idxs attribute.

+

New samplers can be as simple as a passthrough to another train_test_split, or it can be an original implementation that results in X and y being split into two lists. Take a look at astartes/samplers/interpolation/random_split.py for a basic example!

+

After the sampler has been implemented, add it to __init__.py in in astartes/samplers and it will automatically be unit tested. Additional unit tests to verify that hyperparameters can be properly passed, etc. are also recommended.

+

For historical reasons, and as a guide for any developers who would like add new samplers, below is a running list of samplers which have been considered for addition to asartes but ultimately not added for various reasons.

+
+

Not Implemented Sampling Algorithms

+ + + + + + + + + + + + + + + + + +

Sampler Name

Reasoning

Relevant Link(s)

D-Optimal

Requires a-priori knowledge of the test and train size which does not fit in the astartes framework (samplers are all agnostic to the size of the sets) and it is questionable if the use of the Fischer information matrix is actually meaningful in the context of sampling existing data rather than tuning for ideal data.

The Wikipedia article for optimal design does a good job explaining why this is difficult, and points at some potential alternatives.

Duplex

Requires knowing test and train size before execution, and can only partition data into two sets which would make it incompatible with train_val_test_split.

This implementation in R includes helpful references and a reference implementation.

+
+
+
+

Adding New Featurization Schemes

+

All of the sampling methods implemented in astartes accept arbitrary arrays of numbers and return the sampled groups (with the exception of Scaffold.py). If you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in astartes.

+

Adding a new interface should take on this format:

+
from astartes import train_test_split
+
+def train_test_split_INTERFACE(
+    INTERFACE_input,
+    INTERFACE_ARGS,
+    y: np.array = None,
+    labels: np.array = None,
+    test_size: float = 0.25,
+    train_size: float = 0.75,
+    splitter: str = 'random',
+    hopts: dict = {},
+    INTERFACE_hopts: dict = {},
+):
+    # turn the INTERFACE_input into an input X
+    # based on INTERFACE ARGS where INTERFACE_hopts
+    # specifies additional behavior
+    X = []
+
+    # call train test split with this input
+    return train_test_split(
+        X,
+        y=y,
+        labels=labels,
+        test_size=test_size,
+        train_size=train_size,
+        splitter=splitter,
+        hopts=hopts,
+    )
+
+
+

If possible, we would like to also add an example Jupyter Notebook with any new interface to demonstrate to new users how it functions. See our other examples in the examples directory.

+

Contact @JacksonBurns if you need assistance adding an existing workflow to astartes. If this featurization scheme requires additional dependencies to function, we may add it as an additional extra package in the same way that molecules in installed.

+
+
+

The train_val_test_split Function

+

train_val_test_split is the workhorse function of astartes. +It is responsible for instantiating the sampling algorithm, partitioning the data into training, validation, and testing, and then returning the requested results while also keeping an eye on data types. +Under the hood, train_test_split is just calling train_val_test_split with val_size set to 0.0. +For more information on how it works, check out the inline documentation in astartes/main.py.

+
+
+

Development Philosophy

+

The developers of astartes prioritize (1) reproducibility, (2) flexibility, and (3) maintainability.

+
    +
  1. All versions of astartes 1.x should produce the same results across all platforms, so we have thorough unit and regression testing run on a continuous basis.

  2. +
  3. We specify as few dependencies as possible with the loosest possible dependency requirements, which allows integrating astartes with other tools more easily.

    +
      +
    • Dependencies which introduce a lot of requirements and/or specific versions of requirements are shuffled into the extras_require to avoid weighing down the main package.

    • +
    • Compatibility with all versions of modern Python is achieved by not tightly specifying version numbers as well as by regression testing across all versions.

    • +
    +
  4. +
  5. We follow DRY (Don’t Repeat Yourself) principles to avoid code duplication and decrease maintainence burden, have near-perfect test coverage, and enforce consistent formatting style in the source code.

    +
      +
    • Inline comments are critical for maintainability - at the time of writing, astartes has 1 comment line for every 2 lines of source code.

    • +
    +
  6. +
+
+
+
+

JOSS Branch

+

astartes corresponding JOSS paper is stored in this repository on a separate branch. You can find paper.md on the aptly named joss-paper branch.

+

Note for Maintainers: To push changes from the main branch into the joss-paper branch, run the Update JOSS Branch workflow.

+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/CONTRIBUTING.rst b/docs/CONTRIBUTING.rst new file mode 100644 index 00000000..677944a2 --- /dev/null +++ b/docs/CONTRIBUTING.rst @@ -0,0 +1,143 @@ + +Contributing & Developer Notes +------------------------------ + +Pull Requests, Bug Reports, and all Contributions are welcome, encouraged, and appreciated! +Please use the appropriate `issue `_ or `pull request `_ template when making a contribution to help the maintainers get it merged quickly. + +We make use of `the GitHub Discussions page `_ to go over potential features to add. +Please feel free to stop by if you are looking for something to develop or have an idea for a useful feature! + +When submitting a PR, please mark your PR with the "PR Ready for Review" label when you are finished making changes so that the GitHub actions bots can work their magic! + +Developer Install +^^^^^^^^^^^^^^^^^ + +To contribute to the ``astartes`` source code, start by forking and then cloning the repository (i.e. ``git clone git@github.com:YourUsername/astartes.git``\ ) and then inside the repository run ``pip install -e .[dev]``. This will set you up with all the required dependencies to run ``astartes`` and conform to our formatting standards (\ ``black`` and ``isort``\ ), which you can configure to run automatically in VSCode `like this `_. + +.. + + **Warning** + Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around the ``[]`` characters (i.e. ``pip install "astartes[dev]"``\ ) + + +Version Checking +^^^^^^^^^^^^^^^^ + +``astartes`` uses ``pyproject.toml`` to specify all metadata, but the version is also specified in ``astartes/__init__.py`` (via ``__version__``\ ) for backwards compatibility with Python 3.7. +To check which version of ``astartes`` you have installed, you can run ``python -c "import astartes; print(astartes.__version__)"`` on Python 3.7 or `python -c "from importlib.metadata import version; version('astartes')" on Python 3.8 or newer. + +Testing +^^^^^^^ + +All of the tests in ``astartes`` are written using the built-in python ``unittest`` module (to allow running without ``pytest``\ ) but we *highly* recommend using ``pytest``. +To execute the tests from the ``astartes`` repository, simply type ``pytest`` after running the developer install (or alternately, ``pytest -v`` for a more helpful output). +On GitHub, we use actions to run the tests on every Pull Request and on a nightly basis (look in ``.github/workflows`` for more information). +These tests include unit tests, functional tests, and regression tests. + +Adding New Samplers +^^^^^^^^^^^^^^^^^^^ + +Adding a new sampler should extend the ``abstract_sampler.py`` abstract base class. +Each subclass should override the ``_sample`` method with its own algorithm for data partitioning and optionally the ``_before_sample`` method to perform any data validation. + +All samplers in ``astartes`` are classified as one of two types: extrapolative or interpolative. +Extrapolative samplers work by clustering data into groups (which are then partitioned into train/validation/test to enforce extrapolation) whereas interpolative samplers provide an exact *order* in which samples should be moved into the training set. + +When actually implemented, this means that extrapolative samplers should set the ``self._samples_clusters`` attribute and interpolative samplers should set the ``self._samples_idxs`` attribute. + +New samplers can be as simple as a passthrough to another ``train_test_split``\ , or it can be an original implementation that results in X and y being split into two lists. Take a look at ``astartes/samplers/interpolation/random_split.py`` for a basic example! + +After the sampler has been implemented, add it to ``__init__.py`` in in ``astartes/samplers`` and it will automatically be unit tested. Additional unit tests to verify that hyperparameters can be properly passed, etc. are also recommended. + +For historical reasons, and as a guide for any developers who would like add new samplers, below is a running list of samplers which have been *considered* for addition to ``asartes`` but ultimately not added for various reasons. + +Not Implemented Sampling Algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Sampler Name + - Reasoning + - Relevant Link(s) + * - D-Optimal + - Requires *a-priori* knowledge of the test and train size which does not fit in the ``astartes`` framework (samplers are all agnostic to the size of the sets) and it is questionable if the use of the Fischer information matrix is actually meaningful in the context of sampling existing data rather than tuning for ideal data. + - The `Wikipedia article for optimal design `_ does a good job explaining why this is difficult, and points at some potential alternatives. + * - Duplex + - Requires knowing test and train size before execution, and can only partition data into two sets which would make it incompatible with ``train_val_test_split``. + - This `implementation in R `_ includes helpful references and a reference implementation. + + +Adding New Featurization Schemes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All of the sampling methods implemented in ``astartes`` accept arbitrary arrays of numbers and return the sampled groups (with the exception of ``Scaffold.py``\ ). If you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in ``astartes``. + +Adding a new interface should take on this format: + +.. code-block:: python + + from astartes import train_test_split + + def train_test_split_INTERFACE( + INTERFACE_input, + INTERFACE_ARGS, + y: np.array = None, + labels: np.array = None, + test_size: float = 0.25, + train_size: float = 0.75, + splitter: str = 'random', + hopts: dict = {}, + INTERFACE_hopts: dict = {}, + ): + # turn the INTERFACE_input into an input X + # based on INTERFACE ARGS where INTERFACE_hopts + # specifies additional behavior + X = [] + + # call train test split with this input + return train_test_split( + X, + y=y, + labels=labels, + test_size=test_size, + train_size=train_size, + splitter=splitter, + hopts=hopts, + ) + +If possible, we would like to also add an example Jupyter Notebook with any new interface to demonstrate to new users how it functions. See our other examples in the ``examples`` directory. + +Contact `@JacksonBurns `_ if you need assistance adding an existing workflow to ``astartes``. If this featurization scheme requires additional dependencies to function, we may add it as an additional *extra* package in the same way that ``molecules`` in installed. + +The ``train_val_test_split`` Function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``train_val_test_split`` is the workhorse function of ``astartes``. +It is responsible for instantiating the sampling algorithm, partitioning the data into training, validation, and testing, and then returning the requested results while also keeping an eye on data types. +Under the hood, ``train_test_split`` is just calling ``train_val_test_split`` with ``val_size`` set to ``0.0``. +For more information on how it works, check out the inline documentation in ``astartes/main.py``. + +Development Philosophy +^^^^^^^^^^^^^^^^^^^^^^ + +The developers of ``astartes`` prioritize (1) reproducibility, (2) flexibility, and (3) maintainability. + + +#. All versions of ``astartes`` ``1.x`` should produce the same results across all platforms, so we have thorough unit and regression testing run on a continuous basis. +#. We specify as *few dependencies as possible* with the *loosest possible* dependency requirements, which allows integrating ``astartes`` with other tools more easily. + + * Dependencies which introduce a lot of requirements and/or specific versions of requirements are shuffled into the ``extras_require`` to avoid weighing down the main package. + * Compatibility with all versions of modern Python is achieved by not tightly specifying version numbers as well as by regression testing across all versions. + +#. We follow DRY (Don't Repeat Yourself) principles to avoid code duplication and decrease maintainence burden, have near-perfect test coverage, and enforce consistent formatting style in the source code. + + * Inline comments are *critical* for maintainability - at the time of writing, ``astartes`` has 1 comment line for every 2 lines of source code. + +JOSS Branch +----------- + +``astartes`` corresponding JOSS paper is stored in this repository on a separate branch. You can find ``paper.md`` on the aptly named ``joss-paper`` branch. + +*Note for Maintainers*\ : To push changes from the ``main`` branch into the ``joss-paper`` branch, run the ``Update JOSS Branch`` workflow. diff --git a/docs/README.doctree b/docs/README.doctree new file mode 100644 index 00000000..dcd545cc Binary files /dev/null and b/docs/README.doctree differ diff --git a/docs/README.html b/docs/README.html new file mode 100644 index 00000000..3520d85a --- /dev/null +++ b/docs/README.html @@ -0,0 +1,543 @@ + + + + + + + Online Documentation — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +

astartes

+

(as-tar-tees)

+

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

+
:raw-html-m2r:`<p align=”center”>

<img alt=”astarteslogo” src=”https://raw.githubusercontent.com/JacksonBurns/astartes/main/astartes_logo.png”>

+
+
+

</p>`

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +

Status Badges

UsageContinuous IntegrationRelease
PyPI - Python VersionReproduce PaperpyOpenSci approved
PyPI - LicenseTest StatusDOI
PyPI - Total DownloadsPyPI conda-forge version
GitHub Repo StarsProject Status: Active – The project has reached a stable, usable state and is being actively developed.
+
+

Online Documentation

+

Follow this link for a nicely-rendered version of this README along with additional tutorials for moving from train_test_split in sklearn to astartes. +Keep reading for a installation guide and links to tutorials!

+
+
+

Installing astartes

+

We recommend installing astartes within a virtual environment, using either venv or conda (or other tools) to simplify dependency management. Python versions 3.7, 3.8, 3.9, 3.10, 3.11, and 3.12 are supported on all platforms.

+
+

Warning +Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around text using the '[]' characters (i.e. pip install "astartes[molecules]").

+
+
+

pip

+

astartes is available on PyPI and can be installed using pip:

+
    +
  • To include the featurization options for chemical data, use pip install astartes[molecules].

  • +
  • To install only the sampling algorithms, use pip install astartes (this install will have fewer dependencies and may be more readily compatible in environments with existing workflows).

  • +
+
+
+

conda

+

astartes package is also available on conda-forge with this command: conda install -c conda-forge astartes. +To install astartes with support for featurizing molecules, use: conda install -c conda-forge astartes aimsim. +This will download the base astartes package as well as aimsim, which is the backend used for molecular featurization.

+
+
+

Source

+

To install astartes from source for development, see the Contributing & Developer Notes section.

+
+
+
+

Statement of Need

+

Machine learning has sparked an explosion of progress in chemical kinetics, materials science, and many other fields as researchers use data-driven methods to accelerate steps in traditional workflows within some acceptable error tolerance. +To facilitate adoption of these models, there are two important tasks to consider:

+
    +
  1. use a validation set when selecting the optimal hyperparameter for the model and separately use a held-out test set to measure performance on unseen data.

  2. +
  3. evaluate model performance on both interpolative and extrapolative tasks so future users are informed of any potential limitations.

  4. +
+

astartes addresses both of these points by implementing an sklearn-compatible train_val_test_split function. +Additional technical detail is provided below as well as in our companion paper in the Journal of Open Source Software: Machine Learning Validation via Rational Dataset Sampling with astartes. +For a demo-based explainer using machine learning on a fast food menu, see the astartes Reproducible Notebook published at the United States Research Software Engineers Conference at this page.

+
+

Target Audience

+

astartes is generally applicable to machine learning involving both discovery and inference and model validation. +There are specific functions in astartes for applications in cheminformatics (astartes.molecules) but the methods implemented are general to all numerical data.

+
+
+
+

Quick Start

+

astartes is designed as a drop-in replacement for sklearn‘s train_test_split function (see the sklearn documentation). To switch to astartes, change from sklearn.model_selection import train_test_split to from astartes import train_test_split.

+

Like sklearn, astartes accepts any iterable object as X, y, and labels. +Each will be converted to a numpy array for internal operations, and returned as a numpy array with limited exceptions: if X is a pandas DataFrame, y is a Series, or labels is a Series, astartes will cast it back to its original type including its index and column names.

+
+

Note +The developers recommend passing X, y, and labels as numpy arrays and handling the conversion to and from other types explicitly on your own. Behind-the-scenes type casting can lead to unexpected behavior!

+
+

By default, astartes will split data randomly. Additionally, a variety of algorithmic sampling approaches can be used by specifying the sampler argument to the function (see the Table of Implemented Samplers for a complete list of options and their corresponding references):

+
from sklearn.datasets import load_diabetes
+
+X, y = load_diabetes(return_X_y=True)
+
+X_train, X_test, y_train, y_test = train_test_split(
+  X,  # preferably numpy arrays, but astartes will cast it for you
+  y,
+  sampler = 'kennard_stone',  # any of the supported samplers
+)
+
+
+
+

Note +Extrapolation sampling algorithms will return an additional set of arrays (the cluster labels) which will result in a ValueError: too many values to unpack if not called properly. See the ``split_comparisons` Google colab demo <https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/split_comparisons/split_comparisons.ipynb>`_ for a full explanation.

+
+

That’s all you need to get started with astartes! +The next sections include more examples and some demo notebooks you can try in your browser.

+
+

Example Notebooks

+

Click the badges in the table below to be taken to a live, interactive demo of astartes:

+

To execute these notebooks locally, clone this repository (i.e. git clone https://github.com/JacksonBurns/astartes.git), navigate to the astartes directory, run pip install .[demos], then open and run the notebooks in your preferred editor. +You do not need to execute the cells prefixed with %%capture - they are only present for compatibility with Google Colab.

+
+
+

Withhold Testing Data with train_val_test_split

+

For rigorous ML research, it is critical to withhold some data during training to use a test set. +The model should never see this data during training (unlike the validation set) so that we can get an accurate measurement of its performance.

+

With astartes performing this three-way data split is readily available with train_val_test_split:

+
from astartes import train_val_test_split
+
+X_train, X_val, X_test = train_val_test_split(X, sampler = 'sphere_exclusion')
+
+
+

You can now train your model with X_train, optimize your model with X_val, and measure its performance with X_test.

+
+
+

Evaluate the Impact of Splitting Algorithms on Regression Models

+

For data with many features it can be difficult to visualize how different sampling algorithms change the distribution of data into training, validation, and testing like we do in some of the demo notebooks. +To aid in analyzing the impact of the algorithms, astartes provides generate_regression_results_dict. +This function allows users to quickly evaluate the impact of different splitting techniques on any sklearn-compatible model’s performance. +All results are stored in a nested dictionary ({sampler:{metric:{split:score}}}) format and can be displayed in a neatly formatted table using the optional print_results argument.

+
from sklearn.svm import LinearSVR
+
+from astartes.utils import generate_regression_results_dict as grrd
+
+sklearn_model = LinearSVR()
+results_dict = grrd(
+    sklearn_model,
+    X,
+    y,
+    print_results=True,
+)
+
+         Train       Val      Test
+----  --------  --------  --------
+MAE   1.41522   3.13435   2.17091
+RMSE  2.03062   3.73721   2.40041
+R2    0.90745   0.80787   0.78412
+
+
+

Additional metrics can be passed to generate_regression_results_dict via the additional_metrics argument, which should be a dictionary mapping the name of the metric (as a string) to the function itself, like this:

+
from sklearn.metrics import mean_absolute_percentage_error
+
+add_met = {"mape": mean_absolute_percentage_error}
+
+grrd(sklearn_model, X, y, additional_metric=add_met)
+
+
+

See the docstring for generate_regression_results_dict (with help(generate_regression_results_dict)) for more information.

+
+
+

Using astartes with Categorical Data

+

Any of the implemented sampling algorithms whose hyperparameters allow specifying the metric or distance_metric (effectively 1-metric) can be co-opted to work with categorical data. +Simply encode the data in a format compatible with the sklearn metric of choice and then call astartes with that metric specified:

+
from sklearn.metrics import jaccard_score
+
+X_train, X_test, y_train, y_test = train_test_split(
+  X,
+  y,
+  sampler='kennard_stone',
+  hopts={"metric": jaccard_score},
+)
+
+
+

Other samplers which do not allow specifying a categorical distance metric did not provide a method for doing so in their original inception, though it is possible that they can be adapted for this application. +If you are interested in adding support for categorical metrics to an existing sampler, consider opening a Feature Request!

+
+
+

Access Sampling Algorithms Directly

+

The sampling algorithms implemented in astartes can also be directly accessed and run if it is more useful for your applications. +In the below example, we import the Kennard Stone sampler, use it to partition a simple array, and then retrieve a sample.

+
from astartes.samplers.interpolation import KennardStone
+
+kennard_stone = KennardStone([[1, 2], [3, 4], [5, 6]])
+first_2_samples = kennard_stone.get_sample_idxs(2)
+
+
+

All samplers in astartes implement a _sample() method that is called by the constructor (i.e. greedily) and either a get_sampler_idxs or get_cluster_idxs for interpolative and extrapolative samplers, respectively. +For more detail on the implementaiton and design of samplers in astartes, see the Developer Notes section.

+
+
+
+

Theory and Application of astartes

+

This section of the README details some of the theory behind why the algorithms implemented in astartes are important and some motivating examples. +For a comprehensive walkthrough of the theory and implementation of astartes, follow this link to read the companion paper (freely available and hosted here on GitHub).

+
+

Note +We reference open-access publications wherever possible. For articles locked behind a paywall (denoted with :small_blue_diamond:), we instead suggest reading this Wikipedia page and absolutely not attempting to bypass the paywall.

+
+
+

Rational Splitting Algorithms

+

While much machine learning is done with a random choice between training/validation/test data, an alternative is the use of so-called “rational” splitting algorithms. +These approaches use some similarity-based algorithm to divide data into sets. +Some of these algorithms include Kennard-Stone (Kennard & Stone :small_blue_diamond:), Sphere Exclusion (Tropsha et. al :small_blue_diamond:),as well as the OptiSim as discussed in Applied Chemoinformatics: Achievements and Future Opportunities :small_blue_diamond:. +Some clustering-based splitting techniques have also been incorporated, such as DBSCAN.

+

There are two broad categories of sampling algorithms implemented in astartes: extrapolative and interpolative. +The former will force your model to predict on out-of-sample data, which creates a more challenging task than interpolative sampling. +See the table below for all of the sampling approaches currently implemented in astartes, as well as the hyperparameters that each algorithm accepts (which are passed in with hopts) and a helpful reference for understanding how the hyperparameters work. +Note that random_state is defined as a keyword argument in train_test_split itself, even though these algorithms will use the random_state in their own work. +Do not provide a random_state in the hopts dictionary - it will be overwritten by the random_state you provide for train_test_split (or the default if none is provided).

+
+

Implemented Sampling Algorithms

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Sampler Name

Usage String

Type

Hyperparameters

Reference

Notes

Random

‘random’

Interpolative

shuffle

sklearn train_test_split Documentation

This sampler is a direct passthrough to sklearn‘s train_test_split.

Kennard-Stone

‘kennard_stone’

Interpolative

metric

Original Paper by Kennard & Stone :small_blue_diamond:

Euclidian distance is used by default, as described in the original paper.

Sample set Partitioning based on joint X-Y distances (SPXY)

‘spxy’

Interpolative

distance_metric

Saldhana et. al original paper :small_blue_diamond:

Extension of Kennard Stone that also includes the response when sampling distances.

Mahalanobis Distance Kennard Stone (MDKS)

‘spxy’ (MDKS is derived from SPXY)

Interpolative

none, see Notes

Saptoro et. al original paper

MDKS is SPXY using Mahalanobis distance and can be called by using SPXY with distance_metric="mahalanobis"

Scaffold

‘scaffold’

Extrapolative

include_chirality

Bemis-Murcko Scaffold :small_blue_diamond: as implemented in RDKit

This sampler requires SMILES strings as input (use the molecules subpackage)

Sphere Exclusion

‘sphere_exclusion’

Extrapolative

metric, distance_cutoff

custom implementation

Variation on Sphere Exclusion for arbitrary-valued vectors.

Time Based

‘time_based’

Extrapolative

none

Papers using Time based splitting: Chen et al. :small_blue_diamond:, Sheridan, R. P :small_blue_diamond:, Feinberg et al. :small_blue_diamond:, Struble et al.

This sampler requires labels to be an iterable of either date or datetime objects.

Optimizable K-Dissimilarity Selection (OptiSim)

‘optisim’

Extrapolative

n_clusters, max_subsample_size, distance_cutoff

custom implementation

Variation on OptiSim for arbitrary-valued vectors.

K-Means

‘kmeans’

Extrapolative

n_clusters, n_init

``sklearn KMeans` <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html>`_

Passthrough to sklearn‘s KMeans.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

‘dbscan’

Extrapolative

eps, min_samples, algorithm, metric, leaf_size

``sklearn DBSCAN` <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`_ Documentation

Passthrough to sklearn‘s DBSCAN.

Minimum Test Set Dissimilarity (MTSD)

~

~

upcoming in astartes v1.x

~

~

Restricted Boltzmann Machine (RBM)

~

~

upcoming in astartes v1.x

~

~

Kohonen Self-Organizing Map (SOM)

~

~

upcoming in astartes v1.x

~

~

SPlit Method

~

~

upcoming in astartes v1.x

~

~

+
+
+
+

Domain-Specific Applications

+

Below are some field specific applications of astartes. Interested in adding a new sampling algorithm or featurization approach? See ``CONTRIBUTING.md` <./CONTRIBUTING.md>`_.

+
+

Chemical Data and the astartes.molecules Subpackage

+

Machine Learning is enormously useful in chemistry-related fields due to the high-dimensional feature space of chemical data. +To properly apply ML to chemical data for inference or discovery, it is important to know a model’s accuracy under the two domains. +To simplify the process of partitioning chemical data, astartes implements a pre-built featurizer for common chemistry data formats. +After installing with pip install astartes[molecules] one can import the new train/test splitting function like this: from astartes.molecules import train_test_split_molecules

+

The usage of this function is identical to train_test_split but with the addition of new arguments to control how the molecules are featurized:

+
train_test_split_molecules(
+    molecules=smiles,
+    y=y,
+    test_size=0.2,
+    train_size=0.8,
+    fingerprint="daylight_fingerprint",
+    fprints_hopts={
+        "minPath": 2,
+        "maxPath": 5,
+        "fpSize": 200,
+        "bitsPerHash": 4,
+        "useHs": 1,
+        "tgtDensity": 0.4,
+        "minSize": 64,
+    },
+    sampler="random",
+    random_state=42,
+    hopts={
+        "shuffle": True,
+    },
+)
+
+
+

To see a complete example of using train_test_split_molecules with actual chemical data, take a look in the examples directory and the brief companion paper.

+

Configuration options for the featurization scheme can be found in the documentation for AIMSim though most of the critical configuration options are shown above.

+
+
+
+
+

Reproducibility

+

astartes aims to be completely reproducible across different platforms, Python versions, and dependency configurations - any version of astartes v1.x should result in the exact same splits, always. +To that end, the default behavior of astartes is to use 42 as the random seed and always set it. +Running astartes with the default settings will always produce the exact same results. +We have verified this behavior on Debian Ubuntu, Windows, and Intel Macs from Python versions 3.7 through 3.11 (with appropriate dependencies for each version).

+
+

Known Reproducibility Limitations

+

Inevitably external dependencies of astartes will introduce backwards-incompatible changes. +We continually run regression tests to catch these, and will list all known limitations here:

+
    +
  • sklearn v1.3.0 introduced backwards-incompatible changes in the KMeans sampler that changed how the random initialization affects the results, even given the same random seed. Different version of sklearn will affect the performance of astartes and we recommend including the exact version of scikit-learn and astartes used, when applicable.

  • +
+
+

Note +We are limited in our ability to test on M1 Macs, but from our limited manual testing we achieve perfect reproducbility in all cases except occasionally with KMeans on Apple silicon. +astartes is still consistent between runs on the same platform in all cases, and other samplers are not impacted by this apparent bug.

+
+
+
+
+

How to Cite

+

If you use astartes in your work please follow the link below to our (Open Access!) paper in the Journal of Open Source Software or use the “Cite this repository” button on GitHub.

+

Machine Learning Validation via Rational Dataset Sampling with astartes

+
+
+

Contributing & Developer Notes

+

See CONTRIBUTING.md for instructions on installing astartes for development, making a contribution, and general guidance on the design of astartes.

+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/README.rst b/docs/README.rst new file mode 100644 index 00000000..cda047e5 --- /dev/null +++ b/docs/README.rst @@ -0,0 +1,486 @@ +.. role:: raw-html-m2r(raw) + :format: html + + +:raw-html-m2r:`

astartes

` + +:raw-html-m2r:`

(as-tar-tees)

` + + +.. raw:: html + +

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

+ + +:raw-html-m2r:`

+ astarteslogo +

` + + +.. raw:: html + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +

Status Badges

UsageContinuous IntegrationRelease
PyPI - Python VersionReproduce PaperpyOpenSci approved
PyPI - LicenseTest StatusDOI
PyPI - Total DownloadsPyPI conda-forge version
GitHub Repo StarsProject Status: Active – The project has reached a stable, usable state and is being actively developed.
+
+ + +Online Documentation +-------------------- + +Follow `this link `_ for a nicely-rendered version of this README along with additional tutorials for `moving from train_test_split in sklearn to astartes `_. +Keep reading for a installation guide and links to tutorials! + +Installing ``astartes`` +--------------------------- + +We recommend installing ``astartes`` within a virtual environment, using either ``venv`` or ``conda`` (or other tools) to simplify dependency management. Python versions 3.7, 3.8, 3.9, 3.10, 3.11, and 3.12 are supported on all platforms. + +.. + + **Warning** + Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around text using the ``'[]'`` characters (i.e. ``pip install "astartes[molecules]"``\ ). + + +``pip`` +^^^^^^^^^^^ + +``astartes`` is available on ``PyPI`` and can be installed using ``pip``\ : + + +* To include the featurization options for chemical data, use ``pip install astartes[molecules]``. +* To install only the sampling algorithms, use ``pip install astartes`` (this install will have fewer dependencies and may be more readily compatible in environments with existing workflows). + +``conda`` +^^^^^^^^^^^^^ + +``astartes`` package is also available on ``conda-forge`` with this command: ``conda install -c conda-forge astartes``. +To install ``astartes`` with support for featurizing molecules, use: ``conda install -c conda-forge astartes aimsim``. +This will download the base ``astartes`` package as well as ``aimsim``\ , which is the backend used for molecular featurization. + +Source +^^^^^^ + +To install ``astartes`` from source for development, see the `Contributing & Developer Notes <#contributing--developer-notes>`_ section. + +Statement of Need +----------------- + +Machine learning has sparked an explosion of progress in chemical kinetics, materials science, and many other fields as researchers use data-driven methods to accelerate steps in traditional workflows within some acceptable error tolerance. +To facilitate adoption of these models, there are two important tasks to consider: + + +#. use a validation set when selecting the optimal hyperparameter for the model and separately use a held-out test set to measure performance on unseen data. +#. evaluate model performance on both interpolative and extrapolative tasks so future users are informed of any potential limitations. + +``astartes`` addresses both of these points by implementing an ``sklearn``\ -compatible ``train_val_test_split`` function. +Additional technical detail is provided below as well as in our companion paper in the Journal of Open Source Software: `Machine Learning Validation via Rational Dataset Sampling with astartes `_. +For a demo-based explainer using machine learning on a fast food menu, see the ``astartes`` Reproducible Notebook published at the United States Research Software Engineers Conference at `this page `_. + +Target Audience +^^^^^^^^^^^^^^^ + +``astartes`` is generally applicable to machine learning involving both discovery and inference *and* model validation. +There are specific functions in ``astartes`` for applications in cheminformatics (\ ``astartes.molecules``\ ) but the methods implemented are general to all numerical data. + +Quick Start +----------- + +``astartes`` is designed as a drop-in replacement for ``sklearn``\ 's ``train_test_split`` function (see the `sklearn documentation `_\ ). To switch to ``astartes``\ , change ``from sklearn.model_selection import train_test_split`` to ``from astartes import train_test_split``. + +Like ``sklearn``\ , ``astartes`` accepts any iterable object as ``X``\ , ``y``\ , and ``labels``. +Each will be converted to a ``numpy`` array for internal operations, and returned as a ``numpy`` array with limited exceptions: if ``X`` is a ``pandas`` ``DataFrame``\ , ``y`` is a ``Series``\ , or ``labels`` is a ``Series``\ , ``astartes`` will cast it back to its original type including its index and column names. + +.. + + **Note** + The developers recommend passing ``X``\ , ``y``\ , and ``labels`` as ``numpy`` arrays and handling the conversion to and from other types explicitly on your own. Behind-the-scenes type casting can lead to unexpected behavior! + + +By default, ``astartes`` will split data randomly. Additionally, a variety of algorithmic sampling approaches can be used by specifying the ``sampler`` argument to the function (see the `Table of Implemented Samplers <#implemented-sampling-algorithms>`_ for a complete list of options and their corresponding references): + +.. code-block:: python + + from sklearn.datasets import load_diabetes + + X, y = load_diabetes(return_X_y=True) + + X_train, X_test, y_train, y_test = train_test_split( + X, # preferably numpy arrays, but astartes will cast it for you + y, + sampler = 'kennard_stone', # any of the supported samplers + ) + +.. + + **Note** + Extrapolation sampling algorithms will return an additional set of arrays (the cluster labels) which will result in a ``ValueError: too many values to unpack`` if not called properly. See the `\ ``split_comparisons`` Google colab demo `_ for a full explanation. + + +That's all you need to get started with ``astartes``\ ! +The next sections include more examples and some demo notebooks you can try in your browser. + +Example Notebooks +^^^^^^^^^^^^^^^^^ + +Click the badges in the table below to be taken to a live, interactive demo of ``astartes``\ : + +.. list-table:: + :header-rows: 1 + + * - Demo + - Topic + - Link + * - Comparing Sampling Algorithms with Fast Food + - Visual representations of how different samplers affect data partitioning + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/split_comparisons/split_comparisons.ipynb + :alt: Colab + + * - Using ``train_val_test_split`` with the ``sklearn`` example datasets + - Demonstrating how witholding a test set with ``train_val_test_split`` can impact performance + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/train_val_test_split_sklearn_example/train_val_test_split_example.ipynb + :alt: Colab + + * - Cheminformatics sample set partitioning with ``astartes`` + - Extrapolation vs. Interpolation impact on cheminformatics model accuracy + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/barrier_prediction_with_RDB7/RDB7_barrier_prediction_example.ipynb + :alt: Colab + + * - Comparing partitioning approaches for alkanes + - Visualizing how sampler impact model performance with simple chemicals + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/mlpds_2023_astartes_demonstration/mlpds_2023_demo.ipynb + :alt: Colab + + + +To execute these notebooks locally, clone this repository (i.e. ``git clone https://github.com/JacksonBurns/astartes.git``\ ), navigate to the ``astartes`` directory, run ``pip install .[demos]``\ , then open and run the notebooks in your preferred editor. +You do *not* need to execute the cells prefixed with ``%%capture`` - they are only present for compatibility with Google Colab. + +Withhold Testing Data with ``train_val_test_split`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For rigorous ML research, it is critical to withhold some data during training to use a ``test`` set. +The model should *never* see this data during training (unlike the validation set) so that we can get an accurate measurement of its performance. + +With ``astartes`` performing this three-way data split is readily available with ``train_val_test_split``\ : + +.. code-block:: python + + from astartes import train_val_test_split + + X_train, X_val, X_test = train_val_test_split(X, sampler = 'sphere_exclusion') + +You can now train your model with ``X_train``\ , optimize your model with ``X_val``\ , and measure its performance with ``X_test``. + +Evaluate the Impact of Splitting Algorithms on Regression Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For data with many features it can be difficult to visualize how different sampling algorithms change the distribution of data into training, validation, and testing like we do in some of the demo notebooks. +To aid in analyzing the impact of the algorithms, ``astartes`` provides ``generate_regression_results_dict``. +This function allows users to quickly evaluate the impact of different splitting techniques on any ``sklearn``\ -compatible model's performance. +All results are stored in a nested dictionary (\ ``{sampler:{metric:{split:score}}}``\ ) format and can be displayed in a neatly formatted table using the optional ``print_results`` argument. + +.. code-block:: python + + from sklearn.svm import LinearSVR + + from astartes.utils import generate_regression_results_dict as grrd + + sklearn_model = LinearSVR() + results_dict = grrd( + sklearn_model, + X, + y, + print_results=True, + ) + + Train Val Test + ---- -------- -------- -------- + MAE 1.41522 3.13435 2.17091 + RMSE 2.03062 3.73721 2.40041 + R2 0.90745 0.80787 0.78412 + +Additional metrics can be passed to ``generate_regression_results_dict`` via the ``additional_metrics`` argument, which should be a dictionary mapping the name of the metric (as a ``string``\ ) to the function itself, like this: + +.. code-block:: python + + from sklearn.metrics import mean_absolute_percentage_error + + add_met = {"mape": mean_absolute_percentage_error} + + grrd(sklearn_model, X, y, additional_metric=add_met) + +See the docstring for ``generate_regression_results_dict`` (with ``help(generate_regression_results_dict)``\ ) for more information. + +Using ``astartes`` with Categorical Data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Any of the implemented sampling algorithms whose hyperparameters allow specifying the ``metric`` or ``distance_metric`` (effectively ``1-metric``\ ) can be co-opted to work with categorical data. +Simply encode the data in a format compatible with the ``sklearn`` metric of choice and then call ``astartes`` with that metric specified: + +.. code-block:: python + + from sklearn.metrics import jaccard_score + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler='kennard_stone', + hopts={"metric": jaccard_score}, + ) + +Other samplers which do not allow specifying a categorical distance metric did not provide a method for doing so in their original inception, though it is possible that they can be adapted for this application. +If you are interested in adding support for categorical metrics to an existing sampler, consider opening a `Feature Request `_\ ! + +Access Sampling Algorithms Directly +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The sampling algorithms implemented in ``astartes`` can also be directly accessed and run if it is more useful for your applications. +In the below example, we import the Kennard Stone sampler, use it to partition a simple array, and then retrieve a sample. + +.. code-block:: python + + from astartes.samplers.interpolation import KennardStone + + kennard_stone = KennardStone([[1, 2], [3, 4], [5, 6]]) + first_2_samples = kennard_stone.get_sample_idxs(2) + +All samplers in ``astartes`` implement a ``_sample()`` method that is called by the constructor (i.e. greedily) and either a ``get_sampler_idxs`` or ``get_cluster_idxs`` for interpolative and extrapolative samplers, respectively. +For more detail on the implementaiton and design of samplers in ``astartes``\ , see the `Developer Notes <#contributing--developer-notes>`_ section. + +Theory and Application of ``astartes`` +------------------------------------------ + +This section of the README details some of the theory behind why the algorithms implemented in ``astartes`` are important and some motivating examples. +For a comprehensive walkthrough of the theory and implementation of ``astartes``\ , follow `this link `_ to read the companion paper (freely available and hosted here on GitHub). + +.. + + **Note** + We reference open-access publications wherever possible. For articles locked behind a paywall (denoted with :small_blue_diamond:), we instead suggest reading `this Wikipedia page `_ and absolutely **not** attempting to bypass the paywall. + + +Rational Splitting Algorithms +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While much machine learning is done with a random choice between training/validation/test data, an alternative is the use of so-called "rational" splitting algorithms. +These approaches use some similarity-based algorithm to divide data into sets. +Some of these algorithms include Kennard-Stone (\ `Kennard & Stone `_ :small_blue_diamond:), Sphere Exclusion (\ `Tropsha et. al `_ :small_blue_diamond:),as well as the OptiSim as discussed in `Applied Chemoinformatics: Achievements and Future Opportunities `_ :small_blue_diamond:. +Some clustering-based splitting techniques have also been incorporated, such as `DBSCAN `_. + +There are two broad categories of sampling algorithms implemented in ``astartes``\ : extrapolative and interpolative. +The former will force your model to predict on out-of-sample data, which creates a more challenging task than interpolative sampling. +See the table below for all of the sampling approaches currently implemented in ``astartes``\ , as well as the hyperparameters that each algorithm accepts (which are passed in with ``hopts``\ ) and a helpful reference for understanding how the hyperparameters work. +Note that ``random_state`` is defined as a keyword argument in ``train_test_split`` itself, even though these algorithms will use the ``random_state`` in their own work. +Do not provide a ``random_state`` in the ``hopts`` dictionary - it will be overwritten by the ``random_state`` you provide for ``train_test_split`` (or the default if none is provided). + +Implemented Sampling Algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Sampler Name + - Usage String + - Type + - Hyperparameters + - Reference + - Notes + * - Random + - 'random' + - Interpolative + - ``shuffle`` + - `sklearn train_test_split `_ Documentation + - This sampler is a direct passthrough to ``sklearn``\ 's ``train_test_split``. + * - Kennard-Stone + - 'kennard_stone' + - Interpolative + - ``metric`` + - Original Paper by `Kennard & Stone `_ :small_blue_diamond: + - Euclidian distance is used by default, as described in the original paper. + * - Sample set Partitioning based on joint X-Y distances (SPXY) + - 'spxy' + - Interpolative + - ``distance_metric`` + - Saldhana et. al `original paper `_ :small_blue_diamond: + - Extension of Kennard Stone that also includes the response when sampling distances. + * - Mahalanobis Distance Kennard Stone (MDKS) + - 'spxy' *(MDKS is derived from SPXY)* + - Interpolative + - *none, see Notes* + - Saptoro et. al `original paper `_ + - MDKS is SPXY using Mahalanobis distance and can be called by using SPXY with ``distance_metric="mahalanobis"`` + * - Scaffold + - 'scaffold' + - Extrapolative + - ``include_chirality`` + - `Bemis-Murcko Scaffold `_ :small_blue_diamond: as implemented in RDKit + - This sampler requires SMILES strings as input (use the ``molecules`` subpackage) + * - Sphere Exclusion + - 'sphere_exclusion' + - Extrapolative + - ``metric``\ , ``distance_cutoff`` + - *custom implementation* + - Variation on Sphere Exclusion for arbitrary-valued vectors. + * - Time Based + - 'time_based' + - Extrapolative + - *none* + - Papers using Time based splitting: `Chen et al. `_ :small_blue_diamond:, `Sheridan, R. P `_ :small_blue_diamond:, `Feinberg et al. `_ :small_blue_diamond:, `Struble et al. `_ + - This sampler requires ``labels`` to be an iterable of either date or datetime objects. + * - Optimizable K-Dissimilarity Selection (OptiSim) + - 'optisim' + - Extrapolative + - ``n_clusters``\ , ``max_subsample_size``\ , ``distance_cutoff`` + - *custom implementation* + - Variation on `OptiSim `_ for arbitrary-valued vectors. + * - K-Means + - 'kmeans' + - Extrapolative + - ``n_clusters``\ , ``n_init`` + - `\ ``sklearn KMeans`` `_ + - Passthrough to ``sklearn``\ 's ``KMeans``. + * - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) + - 'dbscan' + - Extrapolative + - ``eps``\ , ``min_samples``\ , ``algorithm``\ , ``metric``\ , ``leaf_size`` + - `\ ``sklearn DBSCAN`` `_ Documentation + - Passthrough to ``sklearn``\ 's ``DBSCAN``. + * - Minimum Test Set Dissimilarity (MTSD) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - Restricted Boltzmann Machine (RBM) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - Kohonen Self-Organizing Map (SOM) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - SPlit Method + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + + +Domain-Specific Applications +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below are some field specific applications of ``astartes``. Interested in adding a new sampling algorithm or featurization approach? See `\ ``CONTRIBUTING.md`` <./CONTRIBUTING.md>`_. + +Chemical Data and the ``astartes.molecules`` Subpackage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Machine Learning is enormously useful in chemistry-related fields due to the high-dimensional feature space of chemical data. +To properly apply ML to chemical data for inference *or* discovery, it is important to know a model's accuracy under the two domains. +To simplify the process of partitioning chemical data, ``astartes`` implements a pre-built featurizer for common chemistry data formats. +After installing with ``pip install astartes[molecules]`` one can import the new train/test splitting function like this: ``from astartes.molecules import train_test_split_molecules`` + +The usage of this function is identical to ``train_test_split`` but with the addition of new arguments to control how the molecules are featurized: + +.. code-block:: python + + train_test_split_molecules( + molecules=smiles, + y=y, + test_size=0.2, + train_size=0.8, + fingerprint="daylight_fingerprint", + fprints_hopts={ + "minPath": 2, + "maxPath": 5, + "fpSize": 200, + "bitsPerHash": 4, + "useHs": 1, + "tgtDensity": 0.4, + "minSize": 64, + }, + sampler="random", + random_state=42, + hopts={ + "shuffle": True, + }, + ) + +To see a complete example of using ``train_test_split_molecules`` with actual chemical data, take a look in the ``examples`` directory and the brief `companion paper `_. + +Configuration options for the featurization scheme can be found in the documentation for `AIMSim `_ though most of the critical configuration options are shown above. + +Reproducibility +--------------- + +``astartes`` aims to be completely reproducible across different platforms, Python versions, and dependency configurations - any version of ``astartes`` v1.x should result in the *exact* same splits, always. +To that end, the default behavior of ``astartes`` is to use ``42`` as the random seed and *always* set it. +Running ``astartes`` with the default settings will always produce the exact same results. +We have verified this behavior on Debian Ubuntu, Windows, and Intel Macs from Python versions 3.7 through 3.11 (with appropriate dependencies for each version). + +Known Reproducibility Limitations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Inevitably external dependencies of ``astartes`` will introduce backwards-incompatible changes. +We continually run regression tests to catch these, and will list all *known* limitations here: + + +* ``sklearn`` v1.3.0 introduced backwards-incompatible changes in the ``KMeans`` sampler that changed how the random initialization affects the results, even given the same random seed. Different version of ``sklearn`` will affect the performance of ``astartes`` and we recommend including the exact version of ``scikit-learn`` and ``astartes`` used, when applicable. + +.. + + **Note** + We are limited in our ability to test on M1 Macs, but from our limited manual testing we achieve perfect reproducbility in all cases *except occasionally* with ``KMeans`` on Apple silicon. + ``astartes`` is still consistent between runs on the same platform in all cases, and other samplers are not impacted by this apparent bug. + + +How to Cite +----------- + +If you use ``astartes`` in your work please follow the link below to our (Open Access!) paper in the Journal of Open Source Software or use the "Cite this repository" button on GitHub. + +`Machine Learning Validation via Rational Dataset Sampling with astartes `_ + +Contributing & Developer Notes +------------------------------ + +See `CONTRIBUTING.md <./CONTRIBUTING.md>`_ for instructions on installing ``astartes`` for development, making a contribution, and general guidance on the design of ``astartes``. diff --git a/docs/_sources/CONTRIBUTING.rst.txt b/docs/_sources/CONTRIBUTING.rst.txt new file mode 100644 index 00000000..677944a2 --- /dev/null +++ b/docs/_sources/CONTRIBUTING.rst.txt @@ -0,0 +1,143 @@ + +Contributing & Developer Notes +------------------------------ + +Pull Requests, Bug Reports, and all Contributions are welcome, encouraged, and appreciated! +Please use the appropriate `issue `_ or `pull request `_ template when making a contribution to help the maintainers get it merged quickly. + +We make use of `the GitHub Discussions page `_ to go over potential features to add. +Please feel free to stop by if you are looking for something to develop or have an idea for a useful feature! + +When submitting a PR, please mark your PR with the "PR Ready for Review" label when you are finished making changes so that the GitHub actions bots can work their magic! + +Developer Install +^^^^^^^^^^^^^^^^^ + +To contribute to the ``astartes`` source code, start by forking and then cloning the repository (i.e. ``git clone git@github.com:YourUsername/astartes.git``\ ) and then inside the repository run ``pip install -e .[dev]``. This will set you up with all the required dependencies to run ``astartes`` and conform to our formatting standards (\ ``black`` and ``isort``\ ), which you can configure to run automatically in VSCode `like this `_. + +.. + + **Warning** + Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around the ``[]`` characters (i.e. ``pip install "astartes[dev]"``\ ) + + +Version Checking +^^^^^^^^^^^^^^^^ + +``astartes`` uses ``pyproject.toml`` to specify all metadata, but the version is also specified in ``astartes/__init__.py`` (via ``__version__``\ ) for backwards compatibility with Python 3.7. +To check which version of ``astartes`` you have installed, you can run ``python -c "import astartes; print(astartes.__version__)"`` on Python 3.7 or `python -c "from importlib.metadata import version; version('astartes')" on Python 3.8 or newer. + +Testing +^^^^^^^ + +All of the tests in ``astartes`` are written using the built-in python ``unittest`` module (to allow running without ``pytest``\ ) but we *highly* recommend using ``pytest``. +To execute the tests from the ``astartes`` repository, simply type ``pytest`` after running the developer install (or alternately, ``pytest -v`` for a more helpful output). +On GitHub, we use actions to run the tests on every Pull Request and on a nightly basis (look in ``.github/workflows`` for more information). +These tests include unit tests, functional tests, and regression tests. + +Adding New Samplers +^^^^^^^^^^^^^^^^^^^ + +Adding a new sampler should extend the ``abstract_sampler.py`` abstract base class. +Each subclass should override the ``_sample`` method with its own algorithm for data partitioning and optionally the ``_before_sample`` method to perform any data validation. + +All samplers in ``astartes`` are classified as one of two types: extrapolative or interpolative. +Extrapolative samplers work by clustering data into groups (which are then partitioned into train/validation/test to enforce extrapolation) whereas interpolative samplers provide an exact *order* in which samples should be moved into the training set. + +When actually implemented, this means that extrapolative samplers should set the ``self._samples_clusters`` attribute and interpolative samplers should set the ``self._samples_idxs`` attribute. + +New samplers can be as simple as a passthrough to another ``train_test_split``\ , or it can be an original implementation that results in X and y being split into two lists. Take a look at ``astartes/samplers/interpolation/random_split.py`` for a basic example! + +After the sampler has been implemented, add it to ``__init__.py`` in in ``astartes/samplers`` and it will automatically be unit tested. Additional unit tests to verify that hyperparameters can be properly passed, etc. are also recommended. + +For historical reasons, and as a guide for any developers who would like add new samplers, below is a running list of samplers which have been *considered* for addition to ``asartes`` but ultimately not added for various reasons. + +Not Implemented Sampling Algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Sampler Name + - Reasoning + - Relevant Link(s) + * - D-Optimal + - Requires *a-priori* knowledge of the test and train size which does not fit in the ``astartes`` framework (samplers are all agnostic to the size of the sets) and it is questionable if the use of the Fischer information matrix is actually meaningful in the context of sampling existing data rather than tuning for ideal data. + - The `Wikipedia article for optimal design `_ does a good job explaining why this is difficult, and points at some potential alternatives. + * - Duplex + - Requires knowing test and train size before execution, and can only partition data into two sets which would make it incompatible with ``train_val_test_split``. + - This `implementation in R `_ includes helpful references and a reference implementation. + + +Adding New Featurization Schemes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All of the sampling methods implemented in ``astartes`` accept arbitrary arrays of numbers and return the sampled groups (with the exception of ``Scaffold.py``\ ). If you have an existing featurization scheme (i.e. take an arbitrary input and turn it into an array of numbers), we would be thrilled to include it in ``astartes``. + +Adding a new interface should take on this format: + +.. code-block:: python + + from astartes import train_test_split + + def train_test_split_INTERFACE( + INTERFACE_input, + INTERFACE_ARGS, + y: np.array = None, + labels: np.array = None, + test_size: float = 0.25, + train_size: float = 0.75, + splitter: str = 'random', + hopts: dict = {}, + INTERFACE_hopts: dict = {}, + ): + # turn the INTERFACE_input into an input X + # based on INTERFACE ARGS where INTERFACE_hopts + # specifies additional behavior + X = [] + + # call train test split with this input + return train_test_split( + X, + y=y, + labels=labels, + test_size=test_size, + train_size=train_size, + splitter=splitter, + hopts=hopts, + ) + +If possible, we would like to also add an example Jupyter Notebook with any new interface to demonstrate to new users how it functions. See our other examples in the ``examples`` directory. + +Contact `@JacksonBurns `_ if you need assistance adding an existing workflow to ``astartes``. If this featurization scheme requires additional dependencies to function, we may add it as an additional *extra* package in the same way that ``molecules`` in installed. + +The ``train_val_test_split`` Function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``train_val_test_split`` is the workhorse function of ``astartes``. +It is responsible for instantiating the sampling algorithm, partitioning the data into training, validation, and testing, and then returning the requested results while also keeping an eye on data types. +Under the hood, ``train_test_split`` is just calling ``train_val_test_split`` with ``val_size`` set to ``0.0``. +For more information on how it works, check out the inline documentation in ``astartes/main.py``. + +Development Philosophy +^^^^^^^^^^^^^^^^^^^^^^ + +The developers of ``astartes`` prioritize (1) reproducibility, (2) flexibility, and (3) maintainability. + + +#. All versions of ``astartes`` ``1.x`` should produce the same results across all platforms, so we have thorough unit and regression testing run on a continuous basis. +#. We specify as *few dependencies as possible* with the *loosest possible* dependency requirements, which allows integrating ``astartes`` with other tools more easily. + + * Dependencies which introduce a lot of requirements and/or specific versions of requirements are shuffled into the ``extras_require`` to avoid weighing down the main package. + * Compatibility with all versions of modern Python is achieved by not tightly specifying version numbers as well as by regression testing across all versions. + +#. We follow DRY (Don't Repeat Yourself) principles to avoid code duplication and decrease maintainence burden, have near-perfect test coverage, and enforce consistent formatting style in the source code. + + * Inline comments are *critical* for maintainability - at the time of writing, ``astartes`` has 1 comment line for every 2 lines of source code. + +JOSS Branch +----------- + +``astartes`` corresponding JOSS paper is stored in this repository on a separate branch. You can find ``paper.md`` on the aptly named ``joss-paper`` branch. + +*Note for Maintainers*\ : To push changes from the ``main`` branch into the ``joss-paper`` branch, run the ``Update JOSS Branch`` workflow. diff --git a/docs/_sources/README.rst.txt b/docs/_sources/README.rst.txt new file mode 100644 index 00000000..cda047e5 --- /dev/null +++ b/docs/_sources/README.rst.txt @@ -0,0 +1,486 @@ +.. role:: raw-html-m2r(raw) + :format: html + + +:raw-html-m2r:`

astartes

` + +:raw-html-m2r:`

(as-tar-tees)

` + + +.. raw:: html + +

Train:Validation:Test Algorithmic Sampling for Molecules and Arbitrary Arrays

+ + +:raw-html-m2r:`

+ astarteslogo +

` + + +.. raw:: html + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +

Status Badges

UsageContinuous IntegrationRelease
PyPI - Python VersionReproduce PaperpyOpenSci approved
PyPI - LicenseTest StatusDOI
PyPI - Total DownloadsPyPI conda-forge version
GitHub Repo StarsProject Status: Active – The project has reached a stable, usable state and is being actively developed.
+
+ + +Online Documentation +-------------------- + +Follow `this link `_ for a nicely-rendered version of this README along with additional tutorials for `moving from train_test_split in sklearn to astartes `_. +Keep reading for a installation guide and links to tutorials! + +Installing ``astartes`` +--------------------------- + +We recommend installing ``astartes`` within a virtual environment, using either ``venv`` or ``conda`` (or other tools) to simplify dependency management. Python versions 3.7, 3.8, 3.9, 3.10, 3.11, and 3.12 are supported on all platforms. + +.. + + **Warning** + Windows (PowerShell) and MacOS Catalina or newer (zsh) require double quotes around text using the ``'[]'`` characters (i.e. ``pip install "astartes[molecules]"``\ ). + + +``pip`` +^^^^^^^^^^^ + +``astartes`` is available on ``PyPI`` and can be installed using ``pip``\ : + + +* To include the featurization options for chemical data, use ``pip install astartes[molecules]``. +* To install only the sampling algorithms, use ``pip install astartes`` (this install will have fewer dependencies and may be more readily compatible in environments with existing workflows). + +``conda`` +^^^^^^^^^^^^^ + +``astartes`` package is also available on ``conda-forge`` with this command: ``conda install -c conda-forge astartes``. +To install ``astartes`` with support for featurizing molecules, use: ``conda install -c conda-forge astartes aimsim``. +This will download the base ``astartes`` package as well as ``aimsim``\ , which is the backend used for molecular featurization. + +Source +^^^^^^ + +To install ``astartes`` from source for development, see the `Contributing & Developer Notes <#contributing--developer-notes>`_ section. + +Statement of Need +----------------- + +Machine learning has sparked an explosion of progress in chemical kinetics, materials science, and many other fields as researchers use data-driven methods to accelerate steps in traditional workflows within some acceptable error tolerance. +To facilitate adoption of these models, there are two important tasks to consider: + + +#. use a validation set when selecting the optimal hyperparameter for the model and separately use a held-out test set to measure performance on unseen data. +#. evaluate model performance on both interpolative and extrapolative tasks so future users are informed of any potential limitations. + +``astartes`` addresses both of these points by implementing an ``sklearn``\ -compatible ``train_val_test_split`` function. +Additional technical detail is provided below as well as in our companion paper in the Journal of Open Source Software: `Machine Learning Validation via Rational Dataset Sampling with astartes `_. +For a demo-based explainer using machine learning on a fast food menu, see the ``astartes`` Reproducible Notebook published at the United States Research Software Engineers Conference at `this page `_. + +Target Audience +^^^^^^^^^^^^^^^ + +``astartes`` is generally applicable to machine learning involving both discovery and inference *and* model validation. +There are specific functions in ``astartes`` for applications in cheminformatics (\ ``astartes.molecules``\ ) but the methods implemented are general to all numerical data. + +Quick Start +----------- + +``astartes`` is designed as a drop-in replacement for ``sklearn``\ 's ``train_test_split`` function (see the `sklearn documentation `_\ ). To switch to ``astartes``\ , change ``from sklearn.model_selection import train_test_split`` to ``from astartes import train_test_split``. + +Like ``sklearn``\ , ``astartes`` accepts any iterable object as ``X``\ , ``y``\ , and ``labels``. +Each will be converted to a ``numpy`` array for internal operations, and returned as a ``numpy`` array with limited exceptions: if ``X`` is a ``pandas`` ``DataFrame``\ , ``y`` is a ``Series``\ , or ``labels`` is a ``Series``\ , ``astartes`` will cast it back to its original type including its index and column names. + +.. + + **Note** + The developers recommend passing ``X``\ , ``y``\ , and ``labels`` as ``numpy`` arrays and handling the conversion to and from other types explicitly on your own. Behind-the-scenes type casting can lead to unexpected behavior! + + +By default, ``astartes`` will split data randomly. Additionally, a variety of algorithmic sampling approaches can be used by specifying the ``sampler`` argument to the function (see the `Table of Implemented Samplers <#implemented-sampling-algorithms>`_ for a complete list of options and their corresponding references): + +.. code-block:: python + + from sklearn.datasets import load_diabetes + + X, y = load_diabetes(return_X_y=True) + + X_train, X_test, y_train, y_test = train_test_split( + X, # preferably numpy arrays, but astartes will cast it for you + y, + sampler = 'kennard_stone', # any of the supported samplers + ) + +.. + + **Note** + Extrapolation sampling algorithms will return an additional set of arrays (the cluster labels) which will result in a ``ValueError: too many values to unpack`` if not called properly. See the `\ ``split_comparisons`` Google colab demo `_ for a full explanation. + + +That's all you need to get started with ``astartes``\ ! +The next sections include more examples and some demo notebooks you can try in your browser. + +Example Notebooks +^^^^^^^^^^^^^^^^^ + +Click the badges in the table below to be taken to a live, interactive demo of ``astartes``\ : + +.. list-table:: + :header-rows: 1 + + * - Demo + - Topic + - Link + * - Comparing Sampling Algorithms with Fast Food + - Visual representations of how different samplers affect data partitioning + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/split_comparisons/split_comparisons.ipynb + :alt: Colab + + * - Using ``train_val_test_split`` with the ``sklearn`` example datasets + - Demonstrating how witholding a test set with ``train_val_test_split`` can impact performance + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/train_val_test_split_sklearn_example/train_val_test_split_example.ipynb + :alt: Colab + + * - Cheminformatics sample set partitioning with ``astartes`` + - Extrapolation vs. Interpolation impact on cheminformatics model accuracy + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/barrier_prediction_with_RDB7/RDB7_barrier_prediction_example.ipynb + :alt: Colab + + * - Comparing partitioning approaches for alkanes + - Visualizing how sampler impact model performance with simple chemicals + - + .. image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/JacksonBurns/astartes/blob/main/examples/mlpds_2023_astartes_demonstration/mlpds_2023_demo.ipynb + :alt: Colab + + + +To execute these notebooks locally, clone this repository (i.e. ``git clone https://github.com/JacksonBurns/astartes.git``\ ), navigate to the ``astartes`` directory, run ``pip install .[demos]``\ , then open and run the notebooks in your preferred editor. +You do *not* need to execute the cells prefixed with ``%%capture`` - they are only present for compatibility with Google Colab. + +Withhold Testing Data with ``train_val_test_split`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For rigorous ML research, it is critical to withhold some data during training to use a ``test`` set. +The model should *never* see this data during training (unlike the validation set) so that we can get an accurate measurement of its performance. + +With ``astartes`` performing this three-way data split is readily available with ``train_val_test_split``\ : + +.. code-block:: python + + from astartes import train_val_test_split + + X_train, X_val, X_test = train_val_test_split(X, sampler = 'sphere_exclusion') + +You can now train your model with ``X_train``\ , optimize your model with ``X_val``\ , and measure its performance with ``X_test``. + +Evaluate the Impact of Splitting Algorithms on Regression Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For data with many features it can be difficult to visualize how different sampling algorithms change the distribution of data into training, validation, and testing like we do in some of the demo notebooks. +To aid in analyzing the impact of the algorithms, ``astartes`` provides ``generate_regression_results_dict``. +This function allows users to quickly evaluate the impact of different splitting techniques on any ``sklearn``\ -compatible model's performance. +All results are stored in a nested dictionary (\ ``{sampler:{metric:{split:score}}}``\ ) format and can be displayed in a neatly formatted table using the optional ``print_results`` argument. + +.. code-block:: python + + from sklearn.svm import LinearSVR + + from astartes.utils import generate_regression_results_dict as grrd + + sklearn_model = LinearSVR() + results_dict = grrd( + sklearn_model, + X, + y, + print_results=True, + ) + + Train Val Test + ---- -------- -------- -------- + MAE 1.41522 3.13435 2.17091 + RMSE 2.03062 3.73721 2.40041 + R2 0.90745 0.80787 0.78412 + +Additional metrics can be passed to ``generate_regression_results_dict`` via the ``additional_metrics`` argument, which should be a dictionary mapping the name of the metric (as a ``string``\ ) to the function itself, like this: + +.. code-block:: python + + from sklearn.metrics import mean_absolute_percentage_error + + add_met = {"mape": mean_absolute_percentage_error} + + grrd(sklearn_model, X, y, additional_metric=add_met) + +See the docstring for ``generate_regression_results_dict`` (with ``help(generate_regression_results_dict)``\ ) for more information. + +Using ``astartes`` with Categorical Data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Any of the implemented sampling algorithms whose hyperparameters allow specifying the ``metric`` or ``distance_metric`` (effectively ``1-metric``\ ) can be co-opted to work with categorical data. +Simply encode the data in a format compatible with the ``sklearn`` metric of choice and then call ``astartes`` with that metric specified: + +.. code-block:: python + + from sklearn.metrics import jaccard_score + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler='kennard_stone', + hopts={"metric": jaccard_score}, + ) + +Other samplers which do not allow specifying a categorical distance metric did not provide a method for doing so in their original inception, though it is possible that they can be adapted for this application. +If you are interested in adding support for categorical metrics to an existing sampler, consider opening a `Feature Request `_\ ! + +Access Sampling Algorithms Directly +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The sampling algorithms implemented in ``astartes`` can also be directly accessed and run if it is more useful for your applications. +In the below example, we import the Kennard Stone sampler, use it to partition a simple array, and then retrieve a sample. + +.. code-block:: python + + from astartes.samplers.interpolation import KennardStone + + kennard_stone = KennardStone([[1, 2], [3, 4], [5, 6]]) + first_2_samples = kennard_stone.get_sample_idxs(2) + +All samplers in ``astartes`` implement a ``_sample()`` method that is called by the constructor (i.e. greedily) and either a ``get_sampler_idxs`` or ``get_cluster_idxs`` for interpolative and extrapolative samplers, respectively. +For more detail on the implementaiton and design of samplers in ``astartes``\ , see the `Developer Notes <#contributing--developer-notes>`_ section. + +Theory and Application of ``astartes`` +------------------------------------------ + +This section of the README details some of the theory behind why the algorithms implemented in ``astartes`` are important and some motivating examples. +For a comprehensive walkthrough of the theory and implementation of ``astartes``\ , follow `this link `_ to read the companion paper (freely available and hosted here on GitHub). + +.. + + **Note** + We reference open-access publications wherever possible. For articles locked behind a paywall (denoted with :small_blue_diamond:), we instead suggest reading `this Wikipedia page `_ and absolutely **not** attempting to bypass the paywall. + + +Rational Splitting Algorithms +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While much machine learning is done with a random choice between training/validation/test data, an alternative is the use of so-called "rational" splitting algorithms. +These approaches use some similarity-based algorithm to divide data into sets. +Some of these algorithms include Kennard-Stone (\ `Kennard & Stone `_ :small_blue_diamond:), Sphere Exclusion (\ `Tropsha et. al `_ :small_blue_diamond:),as well as the OptiSim as discussed in `Applied Chemoinformatics: Achievements and Future Opportunities `_ :small_blue_diamond:. +Some clustering-based splitting techniques have also been incorporated, such as `DBSCAN `_. + +There are two broad categories of sampling algorithms implemented in ``astartes``\ : extrapolative and interpolative. +The former will force your model to predict on out-of-sample data, which creates a more challenging task than interpolative sampling. +See the table below for all of the sampling approaches currently implemented in ``astartes``\ , as well as the hyperparameters that each algorithm accepts (which are passed in with ``hopts``\ ) and a helpful reference for understanding how the hyperparameters work. +Note that ``random_state`` is defined as a keyword argument in ``train_test_split`` itself, even though these algorithms will use the ``random_state`` in their own work. +Do not provide a ``random_state`` in the ``hopts`` dictionary - it will be overwritten by the ``random_state`` you provide for ``train_test_split`` (or the default if none is provided). + +Implemented Sampling Algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Sampler Name + - Usage String + - Type + - Hyperparameters + - Reference + - Notes + * - Random + - 'random' + - Interpolative + - ``shuffle`` + - `sklearn train_test_split `_ Documentation + - This sampler is a direct passthrough to ``sklearn``\ 's ``train_test_split``. + * - Kennard-Stone + - 'kennard_stone' + - Interpolative + - ``metric`` + - Original Paper by `Kennard & Stone `_ :small_blue_diamond: + - Euclidian distance is used by default, as described in the original paper. + * - Sample set Partitioning based on joint X-Y distances (SPXY) + - 'spxy' + - Interpolative + - ``distance_metric`` + - Saldhana et. al `original paper `_ :small_blue_diamond: + - Extension of Kennard Stone that also includes the response when sampling distances. + * - Mahalanobis Distance Kennard Stone (MDKS) + - 'spxy' *(MDKS is derived from SPXY)* + - Interpolative + - *none, see Notes* + - Saptoro et. al `original paper `_ + - MDKS is SPXY using Mahalanobis distance and can be called by using SPXY with ``distance_metric="mahalanobis"`` + * - Scaffold + - 'scaffold' + - Extrapolative + - ``include_chirality`` + - `Bemis-Murcko Scaffold `_ :small_blue_diamond: as implemented in RDKit + - This sampler requires SMILES strings as input (use the ``molecules`` subpackage) + * - Sphere Exclusion + - 'sphere_exclusion' + - Extrapolative + - ``metric``\ , ``distance_cutoff`` + - *custom implementation* + - Variation on Sphere Exclusion for arbitrary-valued vectors. + * - Time Based + - 'time_based' + - Extrapolative + - *none* + - Papers using Time based splitting: `Chen et al. `_ :small_blue_diamond:, `Sheridan, R. P `_ :small_blue_diamond:, `Feinberg et al. `_ :small_blue_diamond:, `Struble et al. `_ + - This sampler requires ``labels`` to be an iterable of either date or datetime objects. + * - Optimizable K-Dissimilarity Selection (OptiSim) + - 'optisim' + - Extrapolative + - ``n_clusters``\ , ``max_subsample_size``\ , ``distance_cutoff`` + - *custom implementation* + - Variation on `OptiSim `_ for arbitrary-valued vectors. + * - K-Means + - 'kmeans' + - Extrapolative + - ``n_clusters``\ , ``n_init`` + - `\ ``sklearn KMeans`` `_ + - Passthrough to ``sklearn``\ 's ``KMeans``. + * - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) + - 'dbscan' + - Extrapolative + - ``eps``\ , ``min_samples``\ , ``algorithm``\ , ``metric``\ , ``leaf_size`` + - `\ ``sklearn DBSCAN`` `_ Documentation + - Passthrough to ``sklearn``\ 's ``DBSCAN``. + * - Minimum Test Set Dissimilarity (MTSD) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - Restricted Boltzmann Machine (RBM) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - Kohonen Self-Organizing Map (SOM) + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + * - SPlit Method + - ~ + - ~ + - *upcoming in* ``astartes`` *v1.x* + - ~ + - ~ + + +Domain-Specific Applications +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below are some field specific applications of ``astartes``. Interested in adding a new sampling algorithm or featurization approach? See `\ ``CONTRIBUTING.md`` <./CONTRIBUTING.md>`_. + +Chemical Data and the ``astartes.molecules`` Subpackage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Machine Learning is enormously useful in chemistry-related fields due to the high-dimensional feature space of chemical data. +To properly apply ML to chemical data for inference *or* discovery, it is important to know a model's accuracy under the two domains. +To simplify the process of partitioning chemical data, ``astartes`` implements a pre-built featurizer for common chemistry data formats. +After installing with ``pip install astartes[molecules]`` one can import the new train/test splitting function like this: ``from astartes.molecules import train_test_split_molecules`` + +The usage of this function is identical to ``train_test_split`` but with the addition of new arguments to control how the molecules are featurized: + +.. code-block:: python + + train_test_split_molecules( + molecules=smiles, + y=y, + test_size=0.2, + train_size=0.8, + fingerprint="daylight_fingerprint", + fprints_hopts={ + "minPath": 2, + "maxPath": 5, + "fpSize": 200, + "bitsPerHash": 4, + "useHs": 1, + "tgtDensity": 0.4, + "minSize": 64, + }, + sampler="random", + random_state=42, + hopts={ + "shuffle": True, + }, + ) + +To see a complete example of using ``train_test_split_molecules`` with actual chemical data, take a look in the ``examples`` directory and the brief `companion paper `_. + +Configuration options for the featurization scheme can be found in the documentation for `AIMSim `_ though most of the critical configuration options are shown above. + +Reproducibility +--------------- + +``astartes`` aims to be completely reproducible across different platforms, Python versions, and dependency configurations - any version of ``astartes`` v1.x should result in the *exact* same splits, always. +To that end, the default behavior of ``astartes`` is to use ``42`` as the random seed and *always* set it. +Running ``astartes`` with the default settings will always produce the exact same results. +We have verified this behavior on Debian Ubuntu, Windows, and Intel Macs from Python versions 3.7 through 3.11 (with appropriate dependencies for each version). + +Known Reproducibility Limitations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Inevitably external dependencies of ``astartes`` will introduce backwards-incompatible changes. +We continually run regression tests to catch these, and will list all *known* limitations here: + + +* ``sklearn`` v1.3.0 introduced backwards-incompatible changes in the ``KMeans`` sampler that changed how the random initialization affects the results, even given the same random seed. Different version of ``sklearn`` will affect the performance of ``astartes`` and we recommend including the exact version of ``scikit-learn`` and ``astartes`` used, when applicable. + +.. + + **Note** + We are limited in our ability to test on M1 Macs, but from our limited manual testing we achieve perfect reproducbility in all cases *except occasionally* with ``KMeans`` on Apple silicon. + ``astartes`` is still consistent between runs on the same platform in all cases, and other samplers are not impacted by this apparent bug. + + +How to Cite +----------- + +If you use ``astartes`` in your work please follow the link below to our (Open Access!) paper in the Journal of Open Source Software or use the "Cite this repository" button on GitHub. + +`Machine Learning Validation via Rational Dataset Sampling with astartes `_ + +Contributing & Developer Notes +------------------------------ + +See `CONTRIBUTING.md <./CONTRIBUTING.md>`_ for instructions on installing ``astartes`` for development, making a contribution, and general guidance on the design of ``astartes``. diff --git a/docs/_sources/astartes.rst.txt b/docs/_sources/astartes.rst.txt index 2d827941..4af4818c 100644 --- a/docs/_sources/astartes.rst.txt +++ b/docs/_sources/astartes.rst.txt @@ -7,16 +7,24 @@ Subpackages .. toctree:: :maxdepth: 4 - astartes.interfaces astartes.samplers + astartes.utils Submodules ---------- -astartes.astartes module ------------------------- +astartes.main module +-------------------- -.. automodule:: astartes.astartes +.. automodule:: astartes.main + :members: + :undoc-members: + :show-inheritance: + +astartes.molecules module +------------------------- + +.. automodule:: astartes.molecules :members: :undoc-members: :show-inheritance: diff --git a/docs/_sources/astartes.samplers.extrapolation.rst.txt b/docs/_sources/astartes.samplers.extrapolation.rst.txt new file mode 100644 index 00000000..6fd2f0b8 --- /dev/null +++ b/docs/_sources/astartes.samplers.extrapolation.rst.txt @@ -0,0 +1,61 @@ +astartes.samplers.extrapolation package +======================================= + +Submodules +---------- + +astartes.samplers.extrapolation.dbscan module +--------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.dbscan + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.kmeans module +--------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.kmeans + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.optisim module +---------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.optisim + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.scaffold module +----------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.scaffold + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.sphere\_exclusion module +-------------------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.sphere_exclusion + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.time\_based module +-------------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.time_based + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.samplers.extrapolation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/astartes.samplers.interpolation.rst.txt b/docs/_sources/astartes.samplers.interpolation.rst.txt new file mode 100644 index 00000000..91f887d4 --- /dev/null +++ b/docs/_sources/astartes.samplers.interpolation.rst.txt @@ -0,0 +1,37 @@ +astartes.samplers.interpolation package +======================================= + +Submodules +---------- + +astartes.samplers.interpolation.kennardstone module +--------------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.kennardstone + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.interpolation.random\_split module +---------------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.random_split + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.interpolation.spxy module +------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.spxy + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.samplers.interpolation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/astartes.samplers.rst.txt b/docs/_sources/astartes.samplers.rst.txt index 1d92b78a..0caaf02a 100644 --- a/docs/_sources/astartes.samplers.rst.txt +++ b/docs/_sources/astartes.samplers.rst.txt @@ -1,69 +1,22 @@ astartes.samplers package ========================= -Submodules ----------- - -astartes.samplers.dbscan module -------------------------------- - -.. automodule:: astartes.samplers.dbscan - :members: - :undoc-members: - :show-inheritance: - -astartes.samplers.doptimal module ---------------------------------- - -.. automodule:: astartes.samplers.doptimal - :members: - :undoc-members: - :show-inheritance: - -astartes.samplers.duplex module -------------------------------- - -.. automodule:: astartes.samplers.duplex - :members: - :undoc-members: - :show-inheritance: - -astartes.samplers.kennard\_stone module ---------------------------------------- +Subpackages +----------- -.. automodule:: astartes.samplers.kennard_stone - :members: - :undoc-members: - :show-inheritance: - -astartes.samplers.optisim module --------------------------------- - -.. automodule:: astartes.samplers.optisim - :members: - :undoc-members: - :show-inheritance: - -astartes.samplers.random module -------------------------------- +.. toctree:: + :maxdepth: 4 -.. automodule:: astartes.samplers.random - :members: - :undoc-members: - :show-inheritance: + astartes.samplers.extrapolation + astartes.samplers.interpolation -astartes.samplers.sampler module --------------------------------- - -.. automodule:: astartes.samplers.sampler - :members: - :undoc-members: - :show-inheritance: +Submodules +---------- -astartes.samplers.sphere\_exclusion module +astartes.samplers.abstract\_sampler module ------------------------------------------ -.. automodule:: astartes.samplers.sphere_exclusion +.. automodule:: astartes.samplers.abstract_sampler :members: :undoc-members: :show-inheritance: diff --git a/docs/_sources/astartes.utils.rst.txt b/docs/_sources/astartes.utils.rst.txt new file mode 100644 index 00000000..729aa702 --- /dev/null +++ b/docs/_sources/astartes.utils.rst.txt @@ -0,0 +1,61 @@ +astartes.utils package +====================== + +Submodules +---------- + +astartes.utils.array\_type\_helpers module +------------------------------------------ + +.. automodule:: astartes.utils.array_type_helpers + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.exceptions module +-------------------------------- + +.. automodule:: astartes.utils.exceptions + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.fast\_kennard\_stone module +------------------------------------------ + +.. automodule:: astartes.utils.fast_kennard_stone + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.sampler\_factory module +-------------------------------------- + +.. automodule:: astartes.utils.sampler_factory + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.user\_utils module +--------------------------------- + +.. automodule:: astartes.utils.user_utils + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.warnings module +------------------------------ + +.. automodule:: astartes.utils.warnings + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt index 6b398a99..75ed3a60 100644 --- a/docs/_sources/index.rst.txt +++ b/docs/_sources/index.rst.txt @@ -1,5 +1,5 @@ .. astartes documentation master file, created by - sphinx-quickstart on Fri Jul 9 14:25:42 2021. + sphinx-quickstart on Fri Jul 9 14:25:42 2022. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. @@ -11,6 +11,8 @@ astartes documentation :caption: Contents: README + CONTRIBUTING + sklearn_to_astartes modules diff --git a/docs/_sources/modules.rst.txt b/docs/_sources/modules.rst.txt index 5ec05fd6..831c4b4f 100644 --- a/docs/_sources/modules.rst.txt +++ b/docs/_sources/modules.rst.txt @@ -5,5 +5,4 @@ astartes :maxdepth: 4 astartes - setup test diff --git a/docs/_sources/sklearn_to_astartes.rst.txt b/docs/_sources/sklearn_to_astartes.rst.txt new file mode 100644 index 00000000..fa5248f3 --- /dev/null +++ b/docs/_sources/sklearn_to_astartes.rst.txt @@ -0,0 +1,180 @@ + +Transitioning from ``sklearn`` to ``astartes`` +====================================================== + +Step 1. Installation +-------------------- + +``astartes`` has been designed to rely on (1) as few packages as possible and (2) packages which are already likely to be installed in a Machine Learning (ML) Python workflow (i.e. Numpy and Sklearn). Because of this, ``astartes`` should be compatible with your *existing* workflow such as a conda environment. + +To install ``astartes`` for general ML use (the sampling of arbitrary vectors): **\ ``pip install astartes``\ ** + +For users in cheminformatics, ``astartes`` has an optional add-on that includes featurization as part of the sampling. To install, type **\ ``pip install 'astartes[molecules]'``\ **. With this extra install, ``astartes`` uses `\ ``AIMSim`` `_ to encode SMILES strings as feature vectors. The SMILES strings are parsed into molecular graphs using RDKit and then sampled with a single function call: ``train_test_split_molecules``. + + +* If your workflow already has a featurization scheme in place (i.e. you already have a vector representation of your chemical of interest), you can directly use ``train_test_split`` (though we invite you to explore the many molecular descriptors made available through AIMSim). + +Step 2. Changing the ``import`` Statement +--------------------------------------------- + +In one of the first few lines of your Python script, you have the line ``from sklearn.model_selection import train_test_split``. To switch to using ``astartes`` change this line to ``from astartes import train_test_split``. + +That's it! You are now using ``astartes``. + +If you were just calling ``train_test_split(X, y)``\ , your script should now work in the exact same way as ``sklearn`` with no changes required. + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + random_state=42, + ) + +*becomes* + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + random_state=42, + ) + +But we encourage you to try one of our many other samplers (see below)! + +Step 3. Specifying an Algorithmic Sampler +----------------------------------------- + +By default (for interoperability), ``astartes`` will use a random sampler to produce train/test splits - but the real value of ``astartes`` is in the algorithmic sampling algorithms it implements. Check out the `README for a complete list of available algorithms `_ and how to call and customize them. + +If you existing call to ``train_test_split`` looks like this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + ) + +and you want to try out using Kennard-Stone sampling, switch it to this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler="kennard_stone", + ) + +That's it! + +Step 4. Passing Keyword Arguments +--------------------------------- + +All of the arguments to the ``sklearn``\ 's ``train_test_split`` can still be passed to ``astartes``\ ' ``train_test_split``\ : + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split( + X, + y, + labels, + train_size = 0.75, + test_size = 0.25, + sampler = "kmeans", + hopts = {"n_clusters": 4}, + ) + +Some samplers have tunable hyperparameters that allow you to more finely control their behavior. To do this with Sphere Exclusion, for example, switch your call to this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler="sphere_exclusion", + hopts={"distance_cutoff":0.15}, + ) + +Step 5. Useful ``astartes`` Features +---------------------------------------- + +``return_indices``\ : Improve Code Clarity +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are circumstances where the indices of the train/test data can be useful (for example, if ``y`` or ``labels`` are large, memory-intense objects), and there is no way to directly return these indices in ``sklearn``. ``astartes`` will return the sampling splits themselves by default, but it can also return the indices for the user to manipulate according to their needs: + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split( + X, + y, + labels, + return_indices = False, + ) + +*could instead be* + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test, indices_train, indices_test = train_test_split( + X, + y, + labels, + return_indices = True, + ) + +If ``y`` or ``labels`` were large, memory-intense objects it could be beneficial to *not* pass them in to ``train_test_split`` and instead separate the existing lists later using the returned indices. + +``train_val_test_split``\ : More Rigorous ML +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Behind the scenes, ``train_test_split`` is actually just a one-line function that calls the real workhorse of ``astartes`` - ``train_val_test_split``\ : + +.. code-block:: python + + def train_test_split( + X: np.array, + ... + return_indices: bool = False, + ): + return train_val_test_split( + X, y, labels, train_size, 0, test_size, sampler, hopts, return_indices + ) + +The function call to ``train_val_test_split`` is identical to ``train_test_split`` and supports all the same samplers and hyperparameters, except for one additional keyword argument ``val_size``\ : + +.. code-block:: python + + def train_val_test_split( + X: np.array, + y: np.array = None, + labels: np.array = None, + train_size: float = 0.8, + val_size: float = 0.1, + test_size: float = 0.1, + sampler: str = "random", + hopts: dict = {}, + return_indices: bool = False, + ): + +When called, this will return *three* arrays from ``X``\ , ``y``\ , and ``labels`` (or three arrays of indices, if ``return_indices=True``\ ) rather than the usual two, according to the values given for ``train_size``\ , ``val_size``\ , and ``test_size`` in the function call. + +.. code-block:: python + + X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split( + X, + y, + train_size: float = 0.8, + val_size: float = 0.1, + test_size: float = 0.1, + ) + +For truly rigorous ML modeling, the validation set should be used for hyperparameter tuning and the test set held out until the *very final* change has been made to the model to get a true sense of its performance. For better or for worse, this is *not* the current standard for ML modeling, but the authors believe it should be. + +Custom Warnings: ``ImperfectSplittingWarning`` and ``NormalizationWarning`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the event that your requested train/validation/test split is not mathematically possible given the dimensions of the input data (i.e. you request 50/25/25 but have 101 data points), ``astartes`` will warn you during runtime that it has occurred. ``sklearn`` simply moves on quietly, and while this is fine *most* of the time, the authors felt it prudent to warn the user. +When entering a train/validation/test split, ``astartes`` will check that it is normalized and make it so if not, warning the user during runtime. This will hopefully help prevent head-scratching hours of debugging. diff --git a/docs/_sources/test.functional.rst.txt b/docs/_sources/test.functional.rst.txt new file mode 100644 index 00000000..0c1189a2 --- /dev/null +++ b/docs/_sources/test.functional.rst.txt @@ -0,0 +1,29 @@ +test.functional package +======================= + +Submodules +---------- + +test.functional.test\_astartes module +------------------------------------- + +.. automodule:: test.functional.test_astartes + :members: + :undoc-members: + :show-inheritance: + +test.functional.test\_molecules module +-------------------------------------- + +.. automodule:: test.functional.test_molecules + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.functional + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.regression.rst.txt b/docs/_sources/test.regression.rst.txt new file mode 100644 index 00000000..c3cb03bb --- /dev/null +++ b/docs/_sources/test.regression.rst.txt @@ -0,0 +1,21 @@ +test.regression package +======================= + +Submodules +---------- + +test.regression.test\_regression module +--------------------------------------- + +.. automodule:: test.regression.test_regression + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.regression + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.rst.txt b/docs/_sources/test.rst.txt new file mode 100644 index 00000000..9fca9975 --- /dev/null +++ b/docs/_sources/test.rst.txt @@ -0,0 +1,20 @@ +test package +============ + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.functional + test.regression + test.unit + +Module contents +--------------- + +.. automodule:: test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.unit.rst.txt b/docs/_sources/test.unit.rst.txt new file mode 100644 index 00000000..d11aa2eb --- /dev/null +++ b/docs/_sources/test.unit.rst.txt @@ -0,0 +1,19 @@ +test.unit package +================= + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.unit.samplers + test.unit.utils + +Module contents +--------------- + +.. automodule:: test.unit + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.unit.samplers.extrapolative.rst.txt b/docs/_sources/test.unit.samplers.extrapolative.rst.txt new file mode 100644 index 00000000..cd6b17ef --- /dev/null +++ b/docs/_sources/test.unit.samplers.extrapolative.rst.txt @@ -0,0 +1,61 @@ +test.unit.samplers.extrapolative package +======================================== + +Submodules +---------- + +test.unit.samplers.extrapolative.test\_DBSCAN module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_DBSCAN + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_Scaffold module +------------------------------------------------------ + +.. automodule:: test.unit.samplers.extrapolative.test_Scaffold + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_kmeans module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_kmeans + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_optisim module +----------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_optisim + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_sphere\_exclusion module +--------------------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_sphere_exclusion + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_time\_based module +--------------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_time_based + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.samplers.extrapolative + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.unit.samplers.interpolative.rst.txt b/docs/_sources/test.unit.samplers.interpolative.rst.txt new file mode 100644 index 00000000..812af5a9 --- /dev/null +++ b/docs/_sources/test.unit.samplers.interpolative.rst.txt @@ -0,0 +1,37 @@ +test.unit.samplers.interpolative package +======================================== + +Submodules +---------- + +test.unit.samplers.interpolative.test\_kennard\_stone module +------------------------------------------------------------ + +.. automodule:: test.unit.samplers.interpolative.test_kennard_stone + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.interpolative.test\_random module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.interpolative.test_random + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.interpolative.test\_spxy module +-------------------------------------------------- + +.. automodule:: test.unit.samplers.interpolative.test_spxy + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.samplers.interpolative + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.unit.samplers.rst.txt b/docs/_sources/test.unit.samplers.rst.txt new file mode 100644 index 00000000..dc112797 --- /dev/null +++ b/docs/_sources/test.unit.samplers.rst.txt @@ -0,0 +1,19 @@ +test.unit.samplers package +========================== + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.unit.samplers.extrapolative + test.unit.samplers.interpolative + +Module contents +--------------- + +.. automodule:: test.unit.samplers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_sources/test.unit.utils.rst.txt b/docs/_sources/test.unit.utils.rst.txt new file mode 100644 index 00000000..d4b96cf4 --- /dev/null +++ b/docs/_sources/test.unit.utils.rst.txt @@ -0,0 +1,37 @@ +test.unit.utils package +======================= + +Submodules +---------- + +test.unit.utils.test\_convert\_to\_array module +----------------------------------------------- + +.. automodule:: test.unit.utils.test_convert_to_array + :members: + :undoc-members: + :show-inheritance: + +test.unit.utils.test\_sampler\_factory module +--------------------------------------------- + +.. automodule:: test.unit.utils.test_sampler_factory + :members: + :undoc-members: + :show-inheritance: + +test.unit.utils.test\_utils module +---------------------------------- + +.. automodule:: test.unit.utils.test_utils + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/_static/_sphinx_javascript_frameworks_compat.js b/docs/_static/_sphinx_javascript_frameworks_compat.js new file mode 100644 index 00000000..81415803 --- /dev/null +++ b/docs/_static/_sphinx_javascript_frameworks_compat.js @@ -0,0 +1,123 @@ +/* Compatability shim for jQuery and underscores.js. + * + * Copyright Sphinx contributors + * Released under the two clause BSD licence + */ + +/** + * small helper function to urldecode strings + * + * See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL + */ +jQuery.urldecode = function(x) { + if (!x) { + return x + } + return decodeURIComponent(x.replace(/\+/g, ' ')); +}; + +/** + * small helper function to urlencode strings + */ +jQuery.urlencode = encodeURIComponent; + +/** + * This function returns the parsed url parameters of the + * current request. Multiple values per key are supported, + * it will always return arrays of strings for the value parts. + */ +jQuery.getQueryParameters = function(s) { + if (typeof s === 'undefined') + s = document.location.search; + var parts = s.substr(s.indexOf('?') + 1).split('&'); + var result = {}; + for (var i = 0; i < parts.length; i++) { + var tmp = parts[i].split('=', 2); + var key = jQuery.urldecode(tmp[0]); + var value = jQuery.urldecode(tmp[1]); + if (key in result) + result[key].push(value); + else + result[key] = [value]; + } + return result; +}; + +/** + * highlight a given string on a jquery object by wrapping it in + * span elements with the given class name. + */ +jQuery.fn.highlightText = function(text, className) { + function highlight(node, addItems) { + if (node.nodeType === 3) { + var val = node.nodeValue; + var pos = val.toLowerCase().indexOf(text); + if (pos >= 0 && + !jQuery(node.parentNode).hasClass(className) && + !jQuery(node.parentNode).hasClass("nohighlight")) { + var span; + var isInSVG = jQuery(node).closest("body, svg, foreignObject").is("svg"); + if (isInSVG) { + span = document.createElementNS("http://www.w3.org/2000/svg", "tspan"); + } else { + span = document.createElement("span"); + span.className = className; + } + span.appendChild(document.createTextNode(val.substr(pos, text.length))); + node.parentNode.insertBefore(span, node.parentNode.insertBefore( + document.createTextNode(val.substr(pos + text.length)), + node.nextSibling)); + node.nodeValue = val.substr(0, pos); + if (isInSVG) { + var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect"); + var bbox = node.parentElement.getBBox(); + rect.x.baseVal.value = bbox.x; + rect.y.baseVal.value = bbox.y; + rect.width.baseVal.value = bbox.width; + rect.height.baseVal.value = bbox.height; + rect.setAttribute('class', className); + addItems.push({ + "parent": node.parentNode, + "target": rect}); + } + } + } + else if (!jQuery(node).is("button, select, textarea")) { + jQuery.each(node.childNodes, function() { + highlight(this, addItems); + }); + } + } + var addItems = []; + var result = this.each(function() { + highlight(this, addItems); + }); + for (var i = 0; i < addItems.length; ++i) { + jQuery(addItems[i].parent).before(addItems[i].target); + } + return result; +}; + +/* + * backward compatibility for jQuery.browser + * This will be supported until firefox bug is fixed. + */ +if (!jQuery.browser) { + jQuery.uaMatch = function(ua) { + ua = ua.toLowerCase(); + + var match = /(chrome)[ \/]([\w.]+)/.exec(ua) || + /(webkit)[ \/]([\w.]+)/.exec(ua) || + /(opera)(?:.*version|)[ \/]([\w.]+)/.exec(ua) || + /(msie) ([\w.]+)/.exec(ua) || + ua.indexOf("compatible") < 0 && /(mozilla)(?:.*? rv:([\w.]+)|)/.exec(ua) || + []; + + return { + browser: match[ 1 ] || "", + version: match[ 2 ] || "0" + }; + }; + jQuery.browser = {}; + jQuery.browser[jQuery.uaMatch(navigator.userAgent).browser] = true; +} diff --git a/docs/_static/sphinx_highlight.js b/docs/_static/sphinx_highlight.js new file mode 100644 index 00000000..8a96c69a --- /dev/null +++ b/docs/_static/sphinx_highlight.js @@ -0,0 +1,154 @@ +/* Highlighting utilities for Sphinx HTML documentation. */ +"use strict"; + +const SPHINX_HIGHLIGHT_ENABLED = true + +/** + * highlight a given string on a node by wrapping it in + * span elements with the given class name. + */ +const _highlight = (node, addItems, text, className) => { + if (node.nodeType === Node.TEXT_NODE) { + const val = node.nodeValue; + const parent = node.parentNode; + const pos = val.toLowerCase().indexOf(text); + if ( + pos >= 0 && + !parent.classList.contains(className) && + !parent.classList.contains("nohighlight") + ) { + let span; + + const closestNode = parent.closest("body, svg, foreignObject"); + const isInSVG = closestNode && closestNode.matches("svg"); + if (isInSVG) { + span = document.createElementNS("http://www.w3.org/2000/svg", "tspan"); + } else { + span = document.createElement("span"); + span.classList.add(className); + } + + span.appendChild(document.createTextNode(val.substr(pos, text.length))); + const rest = document.createTextNode(val.substr(pos + text.length)); + parent.insertBefore( + span, + parent.insertBefore( + rest, + node.nextSibling + ) + ); + node.nodeValue = val.substr(0, pos); + /* There may be more occurrences of search term in this node. So call this + * function recursively on the remaining fragment. + */ + _highlight(rest, addItems, text, className); + + if (isInSVG) { + const rect = document.createElementNS( + "http://www.w3.org/2000/svg", + "rect" + ); + const bbox = parent.getBBox(); + rect.x.baseVal.value = bbox.x; + rect.y.baseVal.value = bbox.y; + rect.width.baseVal.value = bbox.width; + rect.height.baseVal.value = bbox.height; + rect.setAttribute("class", className); + addItems.push({ parent: parent, target: rect }); + } + } + } else if (node.matches && !node.matches("button, select, textarea")) { + node.childNodes.forEach((el) => _highlight(el, addItems, text, className)); + } +}; +const _highlightText = (thisNode, text, className) => { + let addItems = []; + _highlight(thisNode, addItems, text, className); + addItems.forEach((obj) => + obj.parent.insertAdjacentElement("beforebegin", obj.target) + ); +}; + +/** + * Small JavaScript module for the documentation. + */ +const SphinxHighlight = { + + /** + * highlight the search words provided in localstorage in the text + */ + highlightSearchWords: () => { + if (!SPHINX_HIGHLIGHT_ENABLED) return; // bail if no highlight + + // get and clear terms from localstorage + const url = new URL(window.location); + const highlight = + localStorage.getItem("sphinx_highlight_terms") + || url.searchParams.get("highlight") + || ""; + localStorage.removeItem("sphinx_highlight_terms") + url.searchParams.delete("highlight"); + window.history.replaceState({}, "", url); + + // get individual terms from highlight string + const terms = highlight.toLowerCase().split(/\s+/).filter(x => x); + if (terms.length === 0) return; // nothing to do + + // There should never be more than one element matching "div.body" + const divBody = document.querySelectorAll("div.body"); + const body = divBody.length ? divBody[0] : document.querySelector("body"); + window.setTimeout(() => { + terms.forEach((term) => _highlightText(body, term, "highlighted")); + }, 10); + + const searchBox = document.getElementById("searchbox"); + if (searchBox === null) return; + searchBox.appendChild( + document + .createRange() + .createContextualFragment( + '" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + localStorage.removeItem("sphinx_highlight_terms") + }, + + initEscapeListener: () => { + // only install a listener if it is really needed + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; + + document.addEventListener("keydown", (event) => { + // bail for input elements + if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; + // bail with special keys + if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; + if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { + SphinxHighlight.hideSearchWords(); + event.preventDefault(); + } + }); + }, +}; + +_ready(() => { + /* Do not call highlightSearchWords() when we are on the search page. + * It will highlight words from the *previous* search query. + */ + if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords(); + SphinxHighlight.initEscapeListener(); +}); diff --git a/docs/astartes.doctree b/docs/astartes.doctree new file mode 100644 index 00000000..11f048d0 Binary files /dev/null and b/docs/astartes.doctree differ diff --git a/docs/astartes.html b/docs/astartes.html new file mode 100644 index 00000000..69e5943e --- /dev/null +++ b/docs/astartes.html @@ -0,0 +1,390 @@ + + + + + + + astartes package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

astartes package

+
+

Subpackages

+
+ +
+
+
+

Submodules

+
+
+

astartes.main module

+
+
+astartes.main.train_test_split(X: array, y: array | None = None, labels: array | None = None, train_size: float = 0.75, test_size: float | None = None, sampler: str = 'random', random_state: int | None = None, hopts: dict = {}, return_indices: bool = False)
+

Deterministic train_test_splitting of arbitrary arrays.

+
+
Parameters:
+
    +
  • X (np.array) – Numpy array of feature vectors.

  • +
  • y (np.array, optional) – Targets corresponding to X, must be of same size. Defaults to None.

  • +
  • labels (np.array, optional) – Labels corresponding to X, must be of same size. Defaults to None.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.75.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to None.

  • +
  • sampler (str, optional) – Sampler to use, see IMPLEMENTED_INTER/EXTRAPOLATION_SAMPLERS. Defaults to “random”.

  • +
  • random_state (int, optional) – The random seed used throughout astartes.

  • +
  • hopts (dict, optional) – Hyperparameters for the sampler used above. Defaults to {}.

  • +
  • return_indices (bool, optional) – True to return indices of train/test instead of values. Defaults to False.

  • +
+
+
Returns:
+

X, y, and labels train/val/test data, or indices.

+
+
Return type:
+

np.array

+
+
+
+ +
+
+astartes.main.train_val_test_split(X: array | DataFrame, y: array | Series | None = None, labels: array | Series | None = None, train_size: float = 0.8, val_size: float = 0.1, test_size: float = 0.1, sampler: str = 'random', random_state: int | None = None, hopts: dict = {}, return_indices: bool = False)
+

Deterministic train_test_splitting of arbitrary arrays.

+
+
Parameters:
+
    +
  • X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.

  • +
  • y (np.array, pd.Series, optional) – Targets corresponding to X, must be of same size. Defaults to None.

  • +
  • labels (np.array, pd.Series, optional) – Labels corresponding to X, must be of same size. Defaults to None.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • +
  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • +
  • sampler (str, optional) – Sampler to use, see IMPLEMENTED_INTER/EXTRAPOLATION_SAMPLERS. Defaults to “random”.

  • +
  • random_state (int, optional) – The random seed used throughout astartes.

  • +
  • hopts (dict, optional) – Hyperparameters for the sampler used above. Defaults to {}.

  • +
  • return_indices (bool, optional) – True to return indices of train/test after values. Defaults to False.

  • +
+
+
Returns:
+

X, y, and labels train/val/test data, or indices.

+
+
Return type:
+

np.array(s)

+
+
+
+ +
+
+

astartes.molecules module

+
+
+astartes.molecules.train_test_split_molecules(molecules: array, y: array | None = None, labels: array | None = None, train_size: float = 0.75, test_size: float | None = None, sampler: str = 'random', random_state: int | None = None, hopts: dict = {}, fingerprint: str = 'morgan_fingerprint', fprints_hopts: dict = {}, return_indices: bool = False)
+

Deterministic train/test splitting of molecules (SMILES strings or RDKit objects).

+
+
Parameters:
+
    +
  • molecules (np.array) – List of SMILES strings or RDKit molecule objects representing molecules or reactions.

  • +
  • y (np.array, optional) – Targets corresponding to SMILES, must be of same size. Defaults to None.

  • +
  • labels (np.array, optional) – Labels corresponding to SMILES, must be of same size. Defaults to None.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training (test+train~1). Defaults to 0.75.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to None.

  • +
  • sampler (str, optional) – Sampler to use, see IMPLEMENTED_INTER/EXTRAPOLATION_SAMPLERS. Defaults to “random”.

  • +
  • random_state (int, optional) – The random seed used throughout astartes. Defaults to None.

  • +
  • hopts (dict, optional) – Hyperparameters for the sampler used above. Defaults to {}.

  • +
  • fingerprint (str, optional) – Molecular fingerprint to be used from AIMSim. Defaults to “morgan_fingerprint”.

  • +
  • fprints_hopts (dict, optional) – Hyperparameters for AIMSim featurization. Defaults to {}.

  • +
  • return_indices (bool, optional) – True to return indices of train/test after the values. Defaults to False.

  • +
+
+
Returns:
+

X, y, and labels train/test data, or indices.

+
+
Return type:
+

np.array

+
+
+
+ +
+
+astartes.molecules.train_val_test_split_molecules(molecules: array, y: array | None = None, labels: array | None = None, train_size: float = 0.8, val_size: float = 0.1, test_size: float = 0.1, sampler: str = 'random', random_state: int | None = None, hopts: dict = {}, fingerprint: str = 'morgan_fingerprint', fprints_hopts: dict = {}, return_indices: bool = False)
+

Deterministic train_test_splitting of molecules (SMILES strings or RDKit objects).

+
+
Parameters:
+
    +
  • molecules (np.array) – List of SMILES strings or RDKit molecule objects representing molecules or reactions.

  • +
  • y (np.array, optional) – Targets corresponding to SMILES, must be of same size. Defaults to None.

  • +
  • labels (np.array, optional) – Labels corresponding to SMILES, must be of same size. Defaults to None.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • +
  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • +
  • sampler (str, optional) – Sampler to use, see IMPLEMENTED_INTER/EXTRAPOLATION_SAMPLERS. Defaults to “random”.

  • +
  • random_state (int, optional) – The random seed used throughout astartes. Defaults to 42.

  • +
  • hopts (dict, optional) – Hyperparameters for the sampler used above. Defaults to {}.

  • +
  • fingerprint (str, optional) – Molecular fingerprint to be used from AIMSim. Defaults to “morgan_fingerprint”.

  • +
  • fprints_hopts (dict, optional) – Hyperparameters for AIMSim featurization. Defaults to {}.

  • +
  • return_indices (bool, optional) – True to return indices of train/test after the values. Defaults to False.

  • +
+
+
Returns:
+

X, y, and labels train/val/test data, or indices.

+
+
Return type:
+

np.array

+
+
+
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/astartes.rst b/docs/astartes.rst new file mode 100644 index 00000000..4af4818c --- /dev/null +++ b/docs/astartes.rst @@ -0,0 +1,38 @@ +astartes package +================ + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + astartes.samplers + astartes.utils + +Submodules +---------- + +astartes.main module +-------------------- + +.. automodule:: astartes.main + :members: + :undoc-members: + :show-inheritance: + +astartes.molecules module +------------------------- + +.. automodule:: astartes.molecules + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/astartes.samplers.doctree b/docs/astartes.samplers.doctree new file mode 100644 index 00000000..cb821af2 Binary files /dev/null and b/docs/astartes.samplers.doctree differ diff --git a/docs/astartes.samplers.extrapolation.doctree b/docs/astartes.samplers.extrapolation.doctree new file mode 100644 index 00000000..fad02631 Binary files /dev/null and b/docs/astartes.samplers.extrapolation.doctree differ diff --git a/docs/astartes.samplers.extrapolation.html b/docs/astartes.samplers.extrapolation.html new file mode 100644 index 00000000..6dd39674 --- /dev/null +++ b/docs/astartes.samplers.extrapolation.html @@ -0,0 +1,336 @@ + + + + + + + astartes.samplers.extrapolation package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

astartes.samplers.extrapolation package

+
+

Submodules

+
+
+

astartes.samplers.extrapolation.dbscan module

+
+
+class astartes.samplers.extrapolation.dbscan.DBSCAN(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

astartes.samplers.extrapolation.kmeans module

+
+
+class astartes.samplers.extrapolation.kmeans.KMeans(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

astartes.samplers.extrapolation.optisim module

+

The Optimizable K-Dissimilarity Selection (OptiSim) algorithm, as originally +described by Clark (https://pubs.acs.org/doi/full/10.1021/ci970282v), adapted +to work for arbitrary distance metrics.

+

The original algorithm: +1. Initialization

+
+
    +
  • Take a featurized dataset and select an arbitrary starting data point for +the selection set.

  • +
  • Treat the remaining data as ‘candidates’.

  • +
  • Create an empty ‘recycling bin’.

  • +
  • Create an empty subsample set.

  • +
  • Create an empty selection set.

  • +
+
+
    +
  1. Remove a random point from the candidates.

  2. +
+
+
    +
  • if it has a similarity greater than a given cutoff to any of the members of the selection set,

  • +
+

recycle it (or conversely, if it is within a cutoff distance) +- otherwise, add to subsample set

+
+
    +
  1. Repeat 2 until one of two conditions is met:

  2. +
+
+
    +
  1. The subsample reaches the pre-determined maximum size K or

  2. +
  3. The candidates are exhausted.

  4. +
+
+

4. If Step 3 resulted in condition b, move all data from recycling bin and go +to Step 2.

+

5. If subsample is empty, quit (all remaining candidates are similar, the +most dissimilar data points have already been identified)

+

6. Pick the most dissimilar (relative to data points already in selection set) +point in the subsample and add it to the selection set.

+
    +
  1. Move the remaining points in the subsample to the recycling bin.

  2. +
  3. If size(selection set) is sufficient, quit. Otherwise, go to Step 2.

  4. +
+

As suggested in the original paper, the members of the selection set are then +used as cluster centers, and we assign every element in the dataset to belong +to the cluster containing the selection set member to which it is the most +similar. To implement this step, use scipy.spatial.distance.cdist.

+

This algorithm seems like it might introduce an infinite loop if the subsample +is not filled and all of the remaining candidates are within the cutoff and cannot +be added. Might need a stop condition here? Unless the empyting of the recycling bin +will somehow fix this. Also possible that one would never have a partially filled +subsample after looking at the full dataset since it is more probable that ALL the +points would be rejected and the subsample would be empty.

+

Likely just check for no more points being possible to fit into the subsample, and +exit if that is the case.

+
+
+class astartes.samplers.extrapolation.optisim.OptiSim(X, y, labels, configs)
+

Bases: AbstractSampler

+
+
+get_dist(i, j)
+

Calculates pdist and returns distance between two samples

+
+ +
+
+move_item(item, source_set, destintation_set)
+

Moves item from source_set to destination_set

+
+ +
+
+rchoose(set)
+

Choose a random element from a set with self._rng

+
+ +
+ +
+
+

astartes.samplers.extrapolation.scaffold module

+

This sampler partitions the data based on the Bemis-Murcko scaffold function as implemented in RDKit. +Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. +Landrum, G. et al. RDKit: Open-Source Cheminformatics; 2006; https://www.rdkit.org.

+

The goal is to cluster molecules that share the same scaffold. +Later, these clusters will be assigned to training, validation, and testing split +to create data splits that will measure extrapolation by testing on scaffolds +that are not in the training set.

+
+
+class astartes.samplers.extrapolation.scaffold.Scaffold(X, y, labels, configs)
+

Bases: AbstractSampler

+
+
+generate_bemis_murcko_scaffold(mol, include_chirality=False)
+

Compute the Bemis-Murcko scaffold for an RDKit molecule.

+
+
Params:

mol: A smiles string or an RDKit molecule. +include_chirality: Whether to include chirality.

+
+
+
+
Returns:
+

Bemis-Murcko scaffold

+
+
+
+ +
+
+scaffold_to_smiles(mols)
+

Computes scaffold for each smiles string and returns a mapping from scaffolds to sets of smiles.

+
+
Params:

mols: A list of smiles strings or RDKit molecules.

+
+
+
+
Returns:
+

A dictionary mapping each unique scaffold to all smiles (or smiles indices) which have that scaffold.

+
+
+
+ +
+
+str_to_mol(string)
+

Converts an InChI or SMILES string to an RDKit molecule.

+
+
Params:

string: The InChI or SMILES string.

+
+
+
+
Returns:
+

An RDKit molecule.

+
+
+
+ +
+ +
+
+

astartes.samplers.extrapolation.sphere_exclusion module

+

The Sphere Exclusion clustering algorithm.

+

This re-implementation draws from this blog post on the RDKit blog, +though abstracted to work for arbitrary feature vectors: +http://rdkit.blogspot.com/2020/11/sphere-exclusion-clustering-with-rdkit.html +As well as this paper: +https://www.daylight.com/cheminformatics/whitepapers/ClusteringWhitePaper.pdf

+

But instead of using tanimoto similarity, which has a domain between zero and +one, it uses euclidian distance to enable processing arbitrary valued +vectors.

+
+
+class astartes.samplers.extrapolation.sphere_exclusion.SphereExclusion(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

astartes.samplers.extrapolation.time_based module

+
+
+class astartes.samplers.extrapolation.time_based.TimeBased(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/astartes.samplers.extrapolation.rst b/docs/astartes.samplers.extrapolation.rst new file mode 100644 index 00000000..6fd2f0b8 --- /dev/null +++ b/docs/astartes.samplers.extrapolation.rst @@ -0,0 +1,61 @@ +astartes.samplers.extrapolation package +======================================= + +Submodules +---------- + +astartes.samplers.extrapolation.dbscan module +--------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.dbscan + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.kmeans module +--------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.kmeans + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.optisim module +---------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.optisim + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.scaffold module +----------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.scaffold + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.sphere\_exclusion module +-------------------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.sphere_exclusion + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.extrapolation.time\_based module +-------------------------------------------------- + +.. automodule:: astartes.samplers.extrapolation.time_based + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.samplers.extrapolation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/astartes.samplers.html b/docs/astartes.samplers.html new file mode 100644 index 00000000..d9dd7390 --- /dev/null +++ b/docs/astartes.samplers.html @@ -0,0 +1,274 @@ + + + + + + + astartes.samplers package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

astartes.samplers package

+
+

Subpackages

+ +
+
+

Submodules

+
+
+

astartes.samplers.abstract_sampler module

+

Abstract Sampling class

+
+
+class astartes.samplers.abstract_sampler.AbstractSampler(X, y, labels, configs)
+

Bases: ABC

+

Abstract Base Class for all samplers.

+
+
+__init__(X, y, labels, configs)
+

Copies X, y, labels, and configs into class attributes and then calls sampler.

+
+ +
+
+get_clusters()
+

Getter for the cluster labels.

+
+
Returns:
+

Cluster labels.

+
+
Return type:
+

np.array

+
+
+
+ +
+
+get_config(key, default=None)
+

Getter to sampler._configs

+
+
Parameters:
+
    +
  • key (str) – String parameter for the sampler.

  • +
  • default (any, optional) – Default to return if key not present. Defaults to None.

  • +
+
+
Returns:
+

value at provided key, or else default.

+
+
Return type:
+

any

+
+
+
+ +
+
+get_sample_idxs(n_samples)
+

Get idxs of samples.

+
+ +
+
+get_sorted_cluster_counter(max_shufflable_size=None)
+

Return a dict containing cluster_id: number of members sorted by number +of members, ascending

+

if max_shufflable_size is not None, clusters below the passed size will be +shuffled into a new order according to random_state in hopts

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/astartes.samplers.interpolation.doctree b/docs/astartes.samplers.interpolation.doctree new file mode 100644 index 00000000..984109ce Binary files /dev/null and b/docs/astartes.samplers.interpolation.doctree differ diff --git a/docs/astartes.samplers.interpolation.html b/docs/astartes.samplers.interpolation.html new file mode 100644 index 00000000..52f18375 --- /dev/null +++ b/docs/astartes.samplers.interpolation.html @@ -0,0 +1,183 @@ + + + + + + + astartes.samplers.interpolation package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

astartes.samplers.interpolation package

+
+

Submodules

+
+
+

astartes.samplers.interpolation.kennardstone module

+
+
+class astartes.samplers.interpolation.kennardstone.KennardStone(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

astartes.samplers.interpolation.random_split module

+
+
+class astartes.samplers.interpolation.random_split.Random(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

astartes.samplers.interpolation.spxy module

+

Implements the Sample set Partitioning based on join X-Y distances +algorithm as originally described by Saldanha and coworkers in +“A method for calibration and validation subset partitioning” +doi:10.1016/j.talanta.2005.03.025

+

This implementation has been validated against their original source +code implementation, which can be found in the paper linked above. +The corresponding unit tests reflect the expected output from +the original implemenation. The breaking of ties is different +compared to the original, but this is ultimately a minor and +likely inconsequential difference.

+
+
+class astartes.samplers.interpolation.spxy.SPXY(X, y, labels, configs)
+

Bases: AbstractSampler

+
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/astartes.samplers.interpolation.rst b/docs/astartes.samplers.interpolation.rst new file mode 100644 index 00000000..91f887d4 --- /dev/null +++ b/docs/astartes.samplers.interpolation.rst @@ -0,0 +1,37 @@ +astartes.samplers.interpolation package +======================================= + +Submodules +---------- + +astartes.samplers.interpolation.kennardstone module +--------------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.kennardstone + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.interpolation.random\_split module +---------------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.random_split + :members: + :undoc-members: + :show-inheritance: + +astartes.samplers.interpolation.spxy module +------------------------------------------- + +.. automodule:: astartes.samplers.interpolation.spxy + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.samplers.interpolation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/astartes.samplers.rst b/docs/astartes.samplers.rst new file mode 100644 index 00000000..0caaf02a --- /dev/null +++ b/docs/astartes.samplers.rst @@ -0,0 +1,30 @@ +astartes.samplers package +========================= + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + astartes.samplers.extrapolation + astartes.samplers.interpolation + +Submodules +---------- + +astartes.samplers.abstract\_sampler module +------------------------------------------ + +.. automodule:: astartes.samplers.abstract_sampler + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.samplers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/astartes.utils.doctree b/docs/astartes.utils.doctree new file mode 100644 index 00000000..d03ce6bc Binary files /dev/null and b/docs/astartes.utils.doctree differ diff --git a/docs/astartes.utils.html b/docs/astartes.utils.html new file mode 100644 index 00000000..246e680f --- /dev/null +++ b/docs/astartes.utils.html @@ -0,0 +1,529 @@ + + + + + + + astartes.utils package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

astartes.utils package

+
+

Submodules

+
+
+

astartes.utils.array_type_helpers module

+
+
+astartes.utils.array_type_helpers.convert_to_array(obj: object, name: str)
+

Attempt to convert obj named name to a numpy array, with appropriate warnings and exceptions.

+
+
Parameters:
+
    +
  • obj (object) – The item to attempt to convert.

  • +
  • name (str) – Human-readable name for printing.

  • +
+
+
+
+ +
+
+astartes.utils.array_type_helpers.panda_handla(X, y, labels)
+

Helper function to deal with supporting Pandas data types in astartes

+
+
Parameters:
+
    +
  • X (Dataframe) – Features with column names

  • +
  • y (Series) – Targets

  • +
  • labels (Series) – Labels for data

  • +
+
+
Returns:
+

Empty if no pandas types, metadata-filled otherwise

+
+
Return type:
+

dict

+
+
+
+ +
+
+astartes.utils.array_type_helpers.return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices, output_is_pandas)
+

Convenience function to return the requested arrays appropriately.

+
+
Parameters:
+
    +
  • sampler_instance (sampler) – The fit sampler instance.

  • +
  • test_size (float) – Fraction of data to use in test.

  • +
  • val_size (float) – Fraction of data to use in val.

  • +
  • train_size (float) – Fraction of data to use in train.

  • +
  • return_indices (bool) – Return indices after the value arrays.

  • +
  • output_is_pandas (dict) – metadata about casting to pandas.

  • +
+
+
Returns:
+

Either many arrays or indices in arrays.

+
+
Return type:
+

np.array

+
+
+

Notes

+

This function copies and pastes a lot of code when it could instead +use some loop over (X, y, labels, sampler_instance.get_clusters()) +but such an implementation is more error prone. This is long and +not the prettiest, but it is definitely doing what we want.

+
+ +
+
+

astartes.utils.exceptions module

+

Exceptions used by astartes

+
+
+exception astartes.utils.exceptions.InvalidConfigurationError(message=None)
+

Bases: RuntimeError

+

Used when user-requested split/data would not work.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.exceptions.InvalidModelTypeError(message=None)
+

Bases: RuntimeError

+

Used when user-provided model is invalid.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.exceptions.MoleculesNotInstalledError(message=None)
+

Bases: RuntimeError

+

Used when attempting to featurize molecules without install.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.exceptions.SamplerNotImplementedError(message=None)
+

Bases: RuntimeError

+

Used when attempting to call a non-existent sampler.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.exceptions.UncastableInputError(message=None)
+

Bases: RuntimeError

+

Used when X, y, or labels cannot be cast to a np.array.

+
+
+__init__(message=None)
+
+ +
+ +
+
+

astartes.utils.fast_kennard_stone module

+
+
+astartes.utils.fast_kennard_stone.fast_kennard_stone(ks_distance: ndarray) ndarray
+

Implements the Kennard-Stone algorithm

+
+
Parameters:
+

ks_distance (np.ndarray) – Distance matrix

+
+
Returns:
+

Indices in order of Kennard-Stone selection

+
+
Return type:
+

np.ndarray

+
+
+
+ +
+
+

astartes.utils.sampler_factory module

+
+
+class astartes.utils.sampler_factory.SamplerFactory(sampler)
+

Bases: object

+
+
+__init__(sampler)
+

Initialize SamplerFactory and copy a lowercased ‘sampler’ into an attribute.

+
+
Parameters:
+

sampler (string) – The desired sampler.

+
+
+
+ +
+
+get_sampler(X, y, labels, hopts)
+

Instantiate (which also performs fitting) and return the sampler.

+
+
Parameters:
+
    +
  • X (np.array) – Feature array.

  • +
  • y (np.array) – Target array.

  • +
  • labels (np.array) – Label array.

  • +
  • hopts (dict) – Hyperparameters for the sampler.

  • +
+
+
Raises:
+

SamplerNotImplementedError – Raised when an non-existent or not yet implemented sampler is requested.

+
+
Returns:
+

The fit sampler instance.

+
+
Return type:
+

astartes.sampler

+
+
+
+ +
+ +
+
+

astartes.utils.user_utils module

+
+
+astartes.utils.user_utils.display_results_as_table(error_dict)
+

Helper function to print a dictionary as a neat tabulate

+
+ +
+
+astartes.utils.user_utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})
+

Helper function to train a sklearn model using the provided data +and provided sampler types.

+
+
Parameters:
+
    +
  • X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.

  • +
  • y (np.array, pd.Series) – Targets corresponding to X, must be of same size.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • +
  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • +
  • random_state (int, optional) – The random seed used throughout astartes.

  • +
  • samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying +the sampler and the values being another dictionary with the +corresponding hyperparameters. Defaults to {}.

  • +
  • print_results (bool, optional) – whether to print the resulting dictionary as a neat table

  • +
  • additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics +such as those in sklearn.metrics or user-provided functions

  • +
+
+
Returns:
+

+
nested dictionary with the format of
+
{
+
sampler: {
+
‘mae’:{

‘train’: [], +‘val’: [], +‘test’: [],

+
+
+

}, +‘rmse’:{

+
+

’train’: [], +‘val’: [], +‘test’: [],

+
+

}, +‘R2’:{

+
+

’train’: [], +‘val’: [], +‘test’: [],

+
+

},

+
+
+

},

+
+
+

}

+
+
+

+
+
Return type:
+

dict

+
+
+
+ +
+
+

astartes.utils.warnings module

+

Warnings used by astartes

+
+
+exception astartes.utils.warnings.ConversionWarning(message=None)
+

Bases: RuntimeWarning

+

Used when passed data is not a numpy array.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.warnings.ImperfectSplittingWarning(message=None)
+

Bases: RuntimeWarning

+

Used when a sampler cannot match requested splits.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.warnings.NoMatchingScaffold(message=None)
+

Bases: Warning

+

Used when an RDKit molecule does not match any +Bemis-Murcko scaffold and returns an empty string.

+
+
+__init__(message=None)
+
+ +
+ +
+
+exception astartes.utils.warnings.NormalizationWarning(message=None)
+

Bases: RuntimeWarning

+

Used when a requested split does not add to 1.

+
+
+__init__(message=None)
+
+ +
+ +
+
+

Module contents

+
+
+astartes.utils.generate_regression_results_dict(sklearn_model, X, y, samplers=['random'], random_state=0, samplers_hopts={}, train_size=0.8, val_size=0.1, test_size=0.1, print_results=False, additional_metrics={})
+

Helper function to train a sklearn model using the provided data +and provided sampler types.

+
+
Parameters:
+
    +
  • X (np.array, pd.DataFrame) – Numpy array or pandas DataFrame of feature vectors.

  • +
  • y (np.array, pd.Series) – Targets corresponding to X, must be of same size.

  • +
  • train_size (float, optional) – Fraction of dataset to use in training set. Defaults to 0.8.

  • +
  • val_size (float, optional) – Fraction of dataset to use in validation set. Defaults to 0.1.

  • +
  • test_size (float, optional) – Fraction of dataset to use in test set. Defaults to 0.1.

  • +
  • random_state (int, optional) – The random seed used throughout astartes.

  • +
  • samplers_hopts (dict, optional) – Should be a dictionary of dictionaries with the keys specifying +the sampler and the values being another dictionary with the +corresponding hyperparameters. Defaults to {}.

  • +
  • print_results (bool, optional) – whether to print the resulting dictionary as a neat table

  • +
  • additional_metrics (dict, optional) – mapping of name (str) to metric (func) for additional metrics +such as those in sklearn.metrics or user-provided functions

  • +
+
+
Returns:
+

+
nested dictionary with the format of
+
{
+
sampler: {
+
‘mae’:{

‘train’: [], +‘val’: [], +‘test’: [],

+
+
+

}, +‘rmse’:{

+
+

’train’: [], +‘val’: [], +‘test’: [],

+
+

}, +‘R2’:{

+
+

’train’: [], +‘val’: [], +‘test’: [],

+
+

},

+
+
+

},

+
+
+

}

+
+
+

+
+
Return type:
+

dict

+
+
+
+ +
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/astartes.utils.rst b/docs/astartes.utils.rst new file mode 100644 index 00000000..729aa702 --- /dev/null +++ b/docs/astartes.utils.rst @@ -0,0 +1,61 @@ +astartes.utils package +====================== + +Submodules +---------- + +astartes.utils.array\_type\_helpers module +------------------------------------------ + +.. automodule:: astartes.utils.array_type_helpers + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.exceptions module +-------------------------------- + +.. automodule:: astartes.utils.exceptions + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.fast\_kennard\_stone module +------------------------------------------ + +.. automodule:: astartes.utils.fast_kennard_stone + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.sampler\_factory module +-------------------------------------- + +.. automodule:: astartes.utils.sampler_factory + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.user\_utils module +--------------------------------- + +.. automodule:: astartes.utils.user_utils + :members: + :undoc-members: + :show-inheritance: + +astartes.utils.warnings module +------------------------------ + +.. automodule:: astartes.utils.warnings + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: astartes.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/environment.pickle b/docs/environment.pickle new file mode 100644 index 00000000..483395c4 Binary files /dev/null and b/docs/environment.pickle differ diff --git a/docs/genindex.html b/docs/genindex.html new file mode 100644 index 00000000..70d21f47 --- /dev/null +++ b/docs/genindex.html @@ -0,0 +1,969 @@ + + + + + + Index — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + +

Index

+ +
+ _ + | A + | C + | D + | F + | G + | I + | K + | M + | N + | O + | P + | R + | S + | T + | U + +
+

_

+ + +
+ +

A

+ + + +
    +
  • AbstractSampler (class in astartes.samplers.abstract_sampler) +
  • +
  • + astartes + +
  • +
  • + astartes.main + +
  • +
  • + astartes.molecules + +
  • +
  • + astartes.samplers + +
  • +
  • + astartes.samplers.abstract_sampler + +
  • +
  • + astartes.samplers.extrapolation + +
  • +
  • + astartes.samplers.extrapolation.dbscan + +
  • +
  • + astartes.samplers.extrapolation.kmeans + +
  • +
  • + astartes.samplers.extrapolation.optisim + +
  • +
  • + astartes.samplers.extrapolation.scaffold + +
  • +
  • + astartes.samplers.extrapolation.sphere_exclusion + +
  • +
    +
  • + astartes.samplers.extrapolation.time_based + +
  • +
  • + astartes.samplers.interpolation + +
  • +
  • + astartes.samplers.interpolation.kennardstone + +
  • +
  • + astartes.samplers.interpolation.random_split + +
  • +
  • + astartes.samplers.interpolation.spxy + +
  • +
  • + astartes.utils + +
  • +
  • + astartes.utils.array_type_helpers + +
  • +
  • + astartes.utils.exceptions + +
  • +
  • + astartes.utils.fast_kennard_stone + +
  • +
  • + astartes.utils.sampler_factory + +
  • +
  • + astartes.utils.user_utils + +
  • +
  • + astartes.utils.warnings + +
  • +
+ +

C

+ + + +
+ +

D

+ + + +
+ +

F

+ + +
+ +

G

+ + + +
+ +

I

+ + + +
+ +

K

+ + + +
+ +

M

+ + + +
+ +

N

+ + + +
+ +

O

+ + +
+ +

P

+ + +
+ +

R

+ + + +
+ +

S

+ + + +
+ +

T

+ + + +
+ +

U

+ + +
+ + + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/index.doctree b/docs/index.doctree new file mode 100644 index 00000000..411881fe Binary files /dev/null and b/docs/index.doctree differ diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 00000000..b105a37e --- /dev/null +++ b/docs/index.html @@ -0,0 +1,186 @@ + + + + + + + astartes documentation — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + +
+ + +
+ + +
+
+ + + + \ No newline at end of file diff --git a/docs/modules.doctree b/docs/modules.doctree new file mode 100644 index 00000000..d80d509f Binary files /dev/null and b/docs/modules.doctree differ diff --git a/docs/modules.html b/docs/modules.html new file mode 100644 index 00000000..2ff34e79 --- /dev/null +++ b/docs/modules.html @@ -0,0 +1,191 @@ + + + + + + + astartes — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ + +
+
+ + + + \ No newline at end of file diff --git a/docs/modules.rst b/docs/modules.rst new file mode 100644 index 00000000..831c4b4f --- /dev/null +++ b/docs/modules.rst @@ -0,0 +1,8 @@ +astartes +======== + +.. toctree:: + :maxdepth: 4 + + astartes + test diff --git a/docs/objects.inv b/docs/objects.inv new file mode 100644 index 00000000..007f4353 Binary files /dev/null and b/docs/objects.inv differ diff --git a/docs/py-modindex.html b/docs/py-modindex.html new file mode 100644 index 00000000..0d10e32c --- /dev/null +++ b/docs/py-modindex.html @@ -0,0 +1,363 @@ + + + + + + Python Module Index — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + +

Python Module Index

+ +
+ a | + t +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 
+ a
+ astartes +
    + astartes.main +
    + astartes.molecules +
    + astartes.samplers +
    + astartes.samplers.abstract_sampler +
    + astartes.samplers.extrapolation +
    + astartes.samplers.extrapolation.dbscan +
    + astartes.samplers.extrapolation.kmeans +
    + astartes.samplers.extrapolation.optisim +
    + astartes.samplers.extrapolation.scaffold +
    + astartes.samplers.extrapolation.sphere_exclusion +
    + astartes.samplers.extrapolation.time_based +
    + astartes.samplers.interpolation +
    + astartes.samplers.interpolation.kennardstone +
    + astartes.samplers.interpolation.random_split +
    + astartes.samplers.interpolation.spxy +
    + astartes.utils +
    + astartes.utils.array_type_helpers +
    + astartes.utils.exceptions +
    + astartes.utils.fast_kennard_stone +
    + astartes.utils.sampler_factory +
    + astartes.utils.user_utils +
    + astartes.utils.warnings +
 
+ t
+ test +
    + test.functional +
    + test.functional.test_astartes +
    + test.functional.test_molecules +
    + test.regression +
    + test.regression.test_regression +
    + test.unit +
    + test.unit.samplers +
    + test.unit.samplers.extrapolative +
    + test.unit.samplers.extrapolative.test_DBSCAN +
    + test.unit.samplers.extrapolative.test_kmeans +
    + test.unit.samplers.extrapolative.test_optisim +
    + test.unit.samplers.extrapolative.test_Scaffold +
    + test.unit.samplers.extrapolative.test_sphere_exclusion +
    + test.unit.samplers.extrapolative.test_time_based +
    + test.unit.samplers.interpolative +
    + test.unit.samplers.interpolative.test_kennard_stone +
    + test.unit.samplers.interpolative.test_random +
    + test.unit.samplers.interpolative.test_spxy +
    + test.unit.utils +
    + test.unit.utils.test_convert_to_array +
    + test.unit.utils.test_sampler_factory +
    + test.unit.utils.test_utils +
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/search.html b/docs/search.html new file mode 100644 index 00000000..b3370440 --- /dev/null +++ b/docs/search.html @@ -0,0 +1,133 @@ + + + + + + Search — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + + + +
+ +
+ +
+
+ +
+
+
+
+ + + + + + + + + \ No newline at end of file diff --git a/docs/searchindex.js b/docs/searchindex.js new file mode 100644 index 00000000..24194195 --- /dev/null +++ b/docs/searchindex.js @@ -0,0 +1 @@ +Search.setIndex({"docnames": ["CONTRIBUTING", "README", "astartes", "astartes.samplers", "astartes.samplers.extrapolation", "astartes.samplers.interpolation", "astartes.utils", "index", "modules", "sklearn_to_astartes", "test", "test.functional", "test.regression", "test.unit", "test.unit.samplers", "test.unit.samplers.extrapolative", "test.unit.samplers.interpolative", "test.unit.utils"], "filenames": ["CONTRIBUTING.rst", "README.rst", "astartes.rst", "astartes.samplers.rst", "astartes.samplers.extrapolation.rst", "astartes.samplers.interpolation.rst", "astartes.utils.rst", "index.rst", "modules.rst", "sklearn_to_astartes.rst", "test.rst", "test.functional.rst", "test.regression.rst", "test.unit.rst", "test.unit.samplers.rst", "test.unit.samplers.extrapolative.rst", "test.unit.samplers.interpolative.rst", "test.unit.utils.rst"], "titles": ["Contributing & Developer Notes", "Online Documentation", "astartes package", "astartes.samplers package", "astartes.samplers.extrapolation package", "astartes.samplers.interpolation package", "astartes.utils package", "astartes documentation", "astartes", "Transitioning from sklearn to astartes", "test package", "test.functional package", "test.regression package", "test.unit package", "test.unit.samplers package", "test.unit.samplers.extrapolative package", "test.unit.samplers.interpolative package", "test.unit.utils package"], "terms": {"pull": 0, "request": [0, 1, 6, 9, 11, 16], "bug": [0, 1], "report": 0, "all": [0, 1, 3, 4, 9, 11, 17], "ar": [0, 1, 4, 9], "welcom": 0, "encourag": [0, 9], "appreci": 0, "pleas": [0, 1], "us": [0, 2, 4, 6, 7, 11, 15, 16, 17], "appropri": [0, 1, 6], "issu": 0, "templat": 0, "when": [0, 1, 6, 9, 11, 16, 17], "make": [0, 1, 9], "help": [0, 1, 9, 11], "maintain": 0, "get": [0, 1, 3, 9], "merg": 0, "quickli": [0, 1], "we": [0, 1, 4, 6, 9], "github": [0, 1, 9], "discuss": [0, 1], "page": [0, 1], "go": [0, 4], "over": [0, 6], "potenti": [0, 1], "add": [0, 4, 6, 9], "feel": 0, "free": 0, "stop": [0, 4], "you": [0, 1, 9], "look": [0, 1, 4, 9], "someth": [0, 15], "have": [0, 1, 4, 9], "an": [0, 1, 4, 6, 7, 11], "idea": 0, "submit": 0, "pr": 0, "mark": 0, "your": [0, 1, 9], "readi": 0, "review": 0, "label": [0, 1, 2, 3, 4, 5, 6, 9, 11, 12, 15], "finish": 0, "chang": [0, 1, 7], "so": [0, 1, 9], "action": 0, "bot": 0, "can": [0, 1, 5, 9, 11], "work": [0, 1, 4, 6, 9], "magic": 0, "To": [0, 1, 4, 9], "astart": [0, 11], "sourc": [0, 4, 5, 7], "code": [0, 5, 6], "start": [0, 4, 7, 11], "fork": 0, "clone": [0, 1], "repositori": [0, 1], "i": [0, 1, 3, 4, 5, 6, 9, 11, 16], "e": [0, 1, 9], "git": [0, 1], "com": [0, 1, 4], "yourusernam": 0, "insid": 0, "run": [0, 1], "pip": [0, 7, 9], "dev": 0, "thi": [0, 1, 4, 5, 6, 9], "set": [0, 1, 2, 4, 5, 6, 9, 11], "up": 0, "requir": [0, 1, 9, 16], "depend": [0, 1], "conform": 0, "our": [0, 1, 9], "format": [0, 1, 6], "standard": [0, 9], "black": 0, "isort": 0, "which": [0, 1, 4, 5, 6, 9, 12], "configur": [0, 1], "automat": 0, "vscode": 0, "like": [0, 1, 4, 5, 9], "warn": [0, 1, 2, 8, 11, 15, 17], "window": [0, 1], "powershel": [0, 1], "maco": [0, 1], "catalina": [0, 1], "newer": [0, 1, 12], "zsh": [0, 1], "doubl": [0, 1], "quot": [0, 1], "around": [0, 1], "charact": [0, 1], "pyproject": 0, "toml": 0, "specifi": [0, 1, 6, 7, 11, 15], "metadata": [0, 6], "also": [0, 1, 4, 6, 9], "__init__": [0, 2, 3, 6], "py": [0, 17], "via": [0, 1], "__version__": 0, "backward": [0, 1], "compat": [0, 1, 9, 11], "python": [0, 1, 9], "3": [0, 1, 4, 7, 12], "7": [0, 1], "c": [0, 1], "import": [0, 1, 7], "print": [0, 6], "from": [0, 1, 2, 4, 5, 7, 15], "importlib": 0, "8": [0, 1, 2, 6, 9], "written": 0, "built": [0, 1], "unittest": 0, "modul": [0, 1, 7, 8], "allow": [0, 1, 9], "without": [0, 6], "pytest": 0, "highli": 0, "recommend": [0, 1], "execut": [0, 1], "simpli": [0, 1, 9], "type": [0, 1, 2, 3, 6, 9, 17], "after": [0, 1, 2, 4, 6], "altern": [0, 1], "v": 0, "more": [0, 1, 4, 6], "output": [0, 5], "On": 0, "everi": [0, 4, 11], "nightli": 0, "basi": 0, "workflow": [0, 1, 9], "inform": [0, 1], "These": [0, 1], "includ": [0, 1, 4, 9, 15], "unit": [0, 1, 5, 8, 10], "regress": [0, 7, 8, 10, 17], "should": [0, 1, 6, 9, 11, 15, 16, 17], "extend": 0, "abstract_sampl": [0, 2, 8], "abstract": [0, 3, 4], "base": [0, 1, 3, 4, 5, 6, 11, 12, 15, 16, 17], "class": [0, 3, 4, 5, 6, 11, 12, 15, 16, 17], "each": [0, 1, 4], "subclass": 0, "overrid": 0, "_sampl": [0, 1], "method": [0, 1, 5, 12], "its": [0, 1, 9], "own": [0, 1], "data": [0, 2, 4, 6, 7, 9, 11, 15], "partit": [0, 1, 4, 5], "option": [0, 1, 2, 3, 6, 9, 11], "_before_sampl": 0, "perform": [0, 1, 6, 9], "ani": [0, 1, 3, 4, 6], "valid": [0, 1, 2, 4, 5, 6, 9, 11], "classifi": 0, "one": [0, 1, 4, 9, 16], "two": [0, 1, 4, 9], "extrapol": [0, 1, 2, 3, 11, 12, 13, 14], "interpol": [0, 1, 2, 3, 11, 12, 13, 14], "cluster": [0, 1, 3, 4], "group": 0, "train": [0, 1, 2, 4, 6, 9, 11], "enforc": 0, "wherea": 0, "provid": [0, 1, 3, 6, 16], "exact": [0, 1, 9], "order": [0, 3, 6], "move": [0, 1, 4, 9], "actual": [0, 1, 9], "mean": [0, 1], "self": [0, 1, 4], "_samples_clust": 0, "attribut": [0, 3, 6, 11, 12, 15, 16, 17], "_samples_idx": 0, "simpl": [0, 1, 17], "passthrough": [0, 1], "anoth": [0, 6], "train_test_split": [0, 1, 2, 8, 9, 11, 15, 16], "origin": [0, 1, 4, 5], "result": [0, 1, 4, 6, 11, 12, 15, 16, 17], "x": [0, 1, 2, 3, 4, 5, 6, 9, 11], "y": [0, 1, 2, 3, 4, 5, 6, 9, 11, 16], "being": [0, 4, 6], "split": [0, 2, 4, 6, 7, 9, 11, 12, 16, 17], "list": [0, 1, 2, 4, 9], "take": [0, 1, 4], "random_split": [0, 2, 3], "basic": 0, "exampl": [0, 7, 9], "ha": [0, 1, 4, 5, 9, 12], "been": [0, 1, 4, 5, 9], "addit": [0, 1, 6, 9], "verifi": [0, 1, 15, 16], "hyperparamet": [0, 1, 2, 6, 9, 11], "properli": [0, 1], "pass": [0, 1, 3, 6, 7, 11], "etc": 0, "For": [0, 1, 9], "histor": 0, "reason": 0, "guid": [0, 1], "who": 0, "would": [0, 4, 6, 11], "below": [0, 1, 3, 9], "consid": [0, 1], "asart": 0, "ultim": [0, 5], "variou": [0, 11, 15, 16], "name": [0, 1, 6], "relev": 0, "link": [0, 1, 5], "": [0, 1, 2, 9], "d": 0, "optim": [0, 1], "priori": 0, "knowledg": 0, "size": [0, 2, 3, 4, 6, 11], "doe": [0, 6], "fit": [0, 4, 6], "framework": [0, 4], "agnost": 0, "question": 0, "fischer": 0, "matrix": [0, 6], "meaning": 0, "context": 0, "exist": [0, 1, 6, 9], "rather": [0, 9], "than": [0, 1, 4, 9, 15], "tune": [0, 9], "ideal": 0, "wikipedia": [0, 1], "articl": [0, 1], "design": [0, 1, 9], "good": 0, "job": 0, "explain": [0, 1], "why": [0, 1], "difficult": [0, 1], "point": [0, 1, 4, 9], "some": [0, 1, 6, 9], "duplex": 0, "know": [0, 1], "befor": 0, "onli": [0, 1], "incompat": [0, 1], "r": [0, 1], "refer": [0, 1, 12], "accept": [0, 1], "arbitrari": [0, 1, 2, 4, 9], "arrai": [0, 1, 2, 3, 6, 9, 11, 16, 17], "number": [0, 3, 15], "return": [0, 1, 2, 3, 4, 6, 9, 11, 17], "except": [0, 1, 2, 8, 9, 11], "scaffold": [0, 1, 2, 3, 6, 15], "If": [0, 1, 4, 9, 11], "input": [0, 1, 9, 11, 15, 17], "turn": 0, "thrill": 0, "interfac": [0, 11], "def": [0, 9], "train_test_split_interfac": 0, "interface_input": 0, "interface_arg": 0, "np": [0, 2, 3, 6, 9], "none": [0, 1, 2, 3, 6, 9], "test_siz": [0, 1, 2, 6, 9], "float": [0, 2, 6, 9], "0": [0, 1, 2, 6, 9], "25": [0, 9], "train_siz": [0, 1, 2, 6, 9], "75": [0, 2, 9], "splitter": 0, "str": [0, 2, 3, 6, 9], "random": [0, 1, 2, 3, 4, 5, 6, 9, 16], "hopt": [0, 1, 2, 3, 6, 9, 11], "dict": [0, 2, 3, 6, 9], "interface_hopt": 0, "arg": 0, "where": [0, 9], "behavior": [0, 1, 9], "call": [0, 1, 3, 6, 9, 15, 17], "possibl": [0, 1, 4, 9, 16], "jupyt": 0, "notebook": [0, 7], "demonstr": 0, "user": [0, 1, 6, 9, 11], "how": [0, 7, 9], "see": [0, 1, 2, 9], "other": [0, 1, 9, 15], "directori": [0, 1], "contact": 0, "jacksonburn": [0, 1], "need": [0, 4, 7, 9], "assist": 0, "mai": [0, 1], "extra": [0, 9], "packag": [0, 1, 7, 8, 9], "same": [0, 1, 2, 4, 6, 9], "wai": [0, 1, 9], "molecul": [0, 4, 6, 8, 9, 11, 15], "workhors": [0, 9], "It": 0, "respons": [0, 1], "instanti": [0, 6, 15, 16], "while": [0, 1, 9], "keep": [0, 1], "ey": 0, "under": [0, 1], "hood": 0, "just": [0, 4, 9], "val_siz": [0, 2, 6, 9], "out": [0, 1, 9], "inlin": 0, "document": 0, "main": [0, 1, 8], "priorit": 0, "1": [0, 1, 2, 4, 6, 7], "reproduc": [0, 7], "2": [0, 1, 4, 7, 12], "flexibl": 0, "produc": [0, 1, 9], "across": [0, 1], "platform": [0, 1], "thorough": 0, "continu": [0, 1], "few": [0, 9], "loosest": 0, "integr": [0, 1], "tool": [0, 1], "easili": 0, "introduc": [0, 1, 4], "lot": [0, 6], "specif": [0, 7], "shuffl": [0, 1, 3, 11], "extras_requir": 0, "avoid": 0, "weigh": 0, "down": [0, 11], "modern": 0, "achiev": [0, 1], "tightli": 0, "well": [0, 1, 4, 12], "follow": [0, 1], "dry": 0, "don": 0, "t": [0, 11], "repeat": [0, 4], "yourself": 0, "principl": 0, "duplic": 0, "decreas": 0, "burden": 0, "perfect": [0, 1], "coverag": 0, "consist": [0, 1], "style": 0, "comment": 0, "critic": [0, 1], "time": [0, 1, 9], "write": 0, "line": [0, 9], "correspond": [0, 1, 2, 5, 6], "paper": [0, 1, 4, 5], "store": [0, 1], "separ": [0, 1, 9], "find": 0, "md": [0, 1], "aptli": 0, "push": 0, "updat": 0, "tar": 1, "tee": 1, "raw": 1, "html": [1, 4, 9], "m2r": 1, "p": 1, "align": 1, "center": [1, 4], "img": 1, "alt": 1, "astarteslogo": 1, "src": 1, "http": [1, 4, 9], "githubusercont": 1, "astartes_logo": 1, "png": 1, "statu": 1, "badg": 1, "usag": 1, "releas": 1, "nice": 1, "render": 1, "version": [1, 7], "readm": [1, 9], "along": 1, "tutori": 1, "sklearn": [1, 6, 7, 12], "read": 1, "within": [1, 4, 17], "virtual": 1, "environ": [1, 9], "either": [1, 6], "venv": 1, "simplifi": 1, "manag": 1, "9": 1, "10": [1, 4, 5], "11": [1, 4], "12": 1, "support": [1, 6, 9, 17], "text": 1, "avail": [1, 9], "pypi": 1, "featur": [1, 2, 4, 6, 7, 11], "fewer": 1, "readili": 1, "forg": 1, "command": 1, "aimsim": [1, 2, 9], "download": 1, "backend": 1, "molecular": [1, 2, 4, 9, 11], "section": 1, "machin": [1, 9], "learn": [1, 9], "spark": 1, "explos": 1, "progress": 1, "kinet": 1, "materi": 1, "scienc": 1, "mani": [1, 6, 9], "field": 1, "research": 1, "driven": 1, "acceler": 1, "step": [1, 4, 7, 11], "tradit": 1, "error": [1, 6, 11, 17], "toler": 1, "facilit": 1, "adopt": 1, "task": [1, 17], "select": [1, 4, 6], "held": [1, 9], "measur": [1, 4], "unseen": 1, "both": 1, "futur": 1, "address": 1, "function": [1, 4, 6, 7, 8, 9, 10, 15, 16, 17], "technic": 1, "detail": 1, "companion": 1, "journal": 1, "open": [1, 4], "softwar": 1, "dataset": [1, 2, 4, 6], "demo": 1, "fast": 1, "food": 1, "menu": 1, "publish": 1, "state": 1, "engin": 1, "confer": 1, "gener": [1, 9, 17], "involv": 1, "discoveri": 1, "infer": 1, "There": [1, 9], "cheminformat": [1, 4, 9], "numer": 1, "drop": 1, "replac": 1, "switch": [1, 9], "model_select": [1, 9], "iter": 1, "object": [1, 2, 6, 9, 11, 15], "convert": [1, 4, 6], "numpi": [1, 2, 6, 9], "intern": 1, "oper": 1, "panda": [1, 2, 6], "datafram": [1, 2, 6, 17], "seri": [1, 2, 6, 17], "cast": [1, 6, 17], "back": 1, "index": [1, 7], "column": [1, 6], "The": [1, 2, 4, 5, 6, 7, 9], "handl": [1, 17], "convers": [1, 4], "explicitli": 1, "behind": [1, 9], "scene": [1, 9], "lead": 1, "unexpect": 1, "By": [1, 9], "default": [1, 2, 3, 6, 9], "randomli": 1, "addition": 1, "varieti": 1, "approach": 1, "sampler": [1, 2, 6, 7, 8, 10, 11, 13, 17], "argument": [1, 7], "tabl": [1, 6], "complet": [1, 9], "load_diabet": 1, "return_x_i": 1, "true": [1, 2, 9], "x_train": [1, 9], "x_test": [1, 9], "y_train": [1, 9], "y_test": [1, 9], "prefer": 1, "kennard_ston": [1, 9, 16], "valueerror": [1, 15], "too": 1, "valu": [1, 2, 3, 4, 6, 9, 11], "unpack": 1, "split_comparison": 1, "googl": 1, "colab": 1, "blob": 1, "ipynb": 1, "_": [1, 9], "full": [1, 4], "explan": 1, "That": [1, 9], "next": 1, "try": [1, 9, 11], "browser": 1, "click": 1, "taken": 1, "live": 1, "interact": 1, "local": 1, "navig": 1, "editor": 1, "do": [1, 6, 9, 17], "cell": 1, "prefix": 1, "captur": 1, "thei": 1, "present": [1, 3], "rigor": 1, "ml": 1, "dure": [1, 9], "never": [1, 4], "unlik": 1, "accur": 1, "With": [1, 9], "three": [1, 9], "x_val": [1, 9], "sphere_exclus": [1, 2, 3, 9, 15], "now": [1, 9], "visual": 1, "differ": [1, 5, 11], "distribut": 1, "aid": 1, "analyz": 1, "generate_regression_results_dict": [1, 2, 6], "techniqu": 1, "nest": [1, 6], "dictionari": [1, 4, 6, 17], "metric": [1, 4, 6], "score": 1, "displai": 1, "neatli": 1, "print_result": [1, 6], "svm": 1, "linearsvr": 1, "util": [1, 2, 8, 10, 13], "grrd": 1, "sklearn_model": [1, 6], "results_dict": 1, "val": [1, 2, 6], "mae": [1, 6], "41522": 1, "13435": 1, "17091": 1, "rmse": [1, 6], "03062": 1, "73721": 1, "40041": 1, "r2": [1, 6], "90745": 1, "80787": 1, "78412": 1, "additional_metr": [1, 6], "map": [1, 4, 6, 15], "string": [1, 2, 3, 4, 6, 9], "itself": 1, "mean_absolute_percentage_error": 1, "add_met": 1, "mape": 1, "docstr": [1, 11], "whose": 1, "distance_metr": 1, "effect": 1, "co": 1, "opt": 1, "encod": [1, 9], "choic": 1, "jaccard_scor": 1, "distanc": [1, 4, 5, 6], "did": 1, "incept": 1, "though": [1, 4, 9], "adapt": [1, 4], "interest": [1, 9], "ad": [1, 4, 7], "In": [1, 9], "kennard": [1, 6, 9, 16], "stone": [1, 6, 9, 16], "retriev": 1, "kennardston": [1, 2, 3, 16], "4": [1, 4, 7], "5": [1, 4, 7], "6": [1, 4], "first_2_sampl": 1, "get_sample_idx": [1, 2, 3], "constructor": 1, "greedili": 1, "get_sampler_idx": 1, "get_cluster_idx": 1, "respect": 1, "implementaiton": 1, "motiv": 1, "comprehens": 1, "walkthrough": 1, "freeli": 1, "host": 1, "here": [1, 4], "public": 1, "wherev": 1, "lock": 1, "paywal": 1, "denot": 1, "small_blue_diamond": 1, "instead": [1, 2, 4, 6, 9], "suggest": [1, 4, 11], "absolut": 1, "attempt": [1, 6], "bypass": 1, "much": 1, "done": 1, "between": [1, 4], "similar": [1, 4], "divid": 1, "sphere": [1, 4, 9], "exclus": [1, 4, 9], "tropsha": 1, "et": [1, 4], "al": [1, 4], "optisim": [1, 2, 3, 15], "appli": 1, "chemoinformat": 1, "opportun": 1, "incorpor": 1, "dbscan": [1, 2, 3, 15], "broad": 1, "categori": 1, "former": 1, "forc": 1, "predict": 1, "creat": [1, 4], "challeng": 1, "current": [1, 9], "understand": 1, "random_st": [1, 2, 3, 6, 9, 11], "defin": 1, "keyword": [1, 7], "even": 1, "overwritten": 1, "direct": 1, "euclidian": [1, 4], "describ": [1, 4, 5], "joint": 1, "spxy": [1, 2, 3, 16], "saldhana": 1, "extens": 1, "mahalanobi": 1, "mdk": 1, "deriv": 1, "saptoro": 1, "include_chir": [1, 4], "bemi": [1, 4, 6], "murcko": [1, 4, 6], "rdkit": [1, 2, 4, 6, 9, 11], "smile": [1, 2, 4, 9, 15], "distance_cutoff": [1, 9], "custom": 1, "variat": 1, "vector": [1, 2, 4, 6, 9], "time_bas": [1, 2, 3, 15], "chen": 1, "sheridan": 1, "feinberg": 1, "strubl": 1, "date": [1, 15], "datetim": [1, 15], "optimiz": [1, 4], "k": [1, 4], "dissimilar": [1, 4], "n_cluster": [1, 9], "max_subsample_s": 1, "kmean": [1, 2, 3, 9, 12, 15], "n_init": 1, "scikit": 1, "org": [1, 4], "stabl": 1, "densiti": 1, "spatial": [1, 4], "nois": 1, "ep": 1, "min_sampl": 1, "leaf_siz": 1, "minimum": 1, "mtsd": 1, "upcom": 1, "v1": [1, 12], "restrict": 1, "boltzmann": 1, "rbm": 1, "kohonen": 1, "organ": 1, "som": 1, "new": [1, 3, 7], "enorm": 1, "chemistri": 1, "relat": 1, "due": 1, "high": 1, "dimension": 1, "space": 1, "accuraci": 1, "process": [1, 4], "pre": [1, 4], "common": 1, "train_test_split_molecul": [1, 2, 8, 9], "ident": [1, 9], "control": [1, 9], "fingerprint": [1, 2, 11], "daylight_fingerprint": [1, 11], "fprints_hopt": [1, 2], "minpath": 1, "maxpath": 1, "fpsize": 1, "200": 1, "bitsperhash": 1, "useh": 1, "tgtdensiti": 1, "minsiz": 1, "64": 1, "42": [1, 2, 9], "brief": 1, "scheme": [1, 7, 9], "found": [1, 5], "most": [1, 4, 9], "shown": 1, "abov": [1, 2, 5], "aim": 1, "alwai": 1, "end": 1, "seed": [1, 2, 6], "debian": 1, "ubuntu": 1, "intel": 1, "mac": 1, "through": [1, 9, 11], "inevit": 1, "extern": 1, "catch": 1, "initi": [1, 4, 6], "affect": 1, "given": [1, 4, 9], "abil": [1, 11, 15], "m1": 1, "manual": 1, "reproducbl": 1, "case": [1, 4], "occasion": 1, "appl": 1, "silicon": 1, "still": [1, 9], "appar": 1, "button": 1, "instruct": 1, "guidanc": 1, "abstractsampl": [2, 3, 4, 5], "get_clust": [2, 3, 6], "get_config": [2, 3], "get_sorted_cluster_count": [2, 3], "array_type_help": [2, 8], "convert_to_arrai": [2, 6], "panda_handla": [2, 6], "return_help": [2, 6], "invalidconfigurationerror": [2, 6], "invalidmodeltypeerror": [2, 6], "moleculesnotinstallederror": [2, 6], "samplernotimplementederror": [2, 6], "uncastableinputerror": [2, 6], "fast_kennard_ston": [2, 8], "sampler_factori": [2, 8], "samplerfactori": [2, 6, 17], "get_sampl": [2, 6], "user_util": [2, 8], "display_results_as_t": [2, 6], "conversionwarn": [2, 6], "imperfectsplittingwarn": [2, 6], "nomatchingscaffold": [2, 6], "normalizationwarn": [2, 6], "int": [2, 6], "return_indic": [2, 6], "bool": [2, 6, 9], "fals": [2, 4, 6, 9], "determinist": 2, "paramet": [2, 3, 6], "target": [2, 6, 7], "must": [2, 6], "fraction": [2, 6], "test": [2, 4, 5, 6, 7, 8, 9], "implemented_int": 2, "extrapolation_sampl": 2, "throughout": [2, 6], "indic": [2, 4, 6, 9, 11], "train_val_test_split": [2, 7, 8, 11], "pd": [2, 6], "morgan_fingerprint": 2, "repres": 2, "reaction": 2, "train_val_test_split_molecul": [2, 8, 11], "get_dist": [3, 4], "move_item": [3, 4], "rchoos": [3, 4], "generate_bemis_murcko_scaffold": [3, 4], "scaffold_to_smil": [3, 4], "str_to_mol": [3, 4], "sphereexclus": [3, 4], "timebas": [3, 4, 12, 15], "sampl": [3, 4, 5, 7, 9], "config": [3, 4, 5], "abc": 3, "copi": [3, 6], "getter": 3, "kei": [3, 6], "_config": 3, "els": 3, "n_sampl": 3, "idx": 3, "max_shufflable_s": 3, "contain": [3, 4, 17], "cluster_id": 3, "member": [3, 4], "sort": 3, "ascend": 3, "accord": [3, 9], "algorithm": [4, 5, 6, 7], "clark": 4, "pub": 4, "ac": 4, "doi": [4, 5], "1021": 4, "ci970282v": 4, "treat": 4, "remain": 4, "candid": 4, "empti": [4, 6], "recycl": 4, "bin": 4, "subsampl": 4, "remov": 4, "greater": 4, "cutoff": 4, "otherwis": [4, 6], "until": [4, 9], "condit": 4, "met": 4, "reach": 4, "determin": 4, "maximum": 4, "exhaust": 4, "b": 4, "quit": 4, "alreadi": [4, 9], "identifi": 4, "pick": 4, "rel": [4, 12], "suffici": 4, "As": 4, "assign": 4, "element": 4, "belong": 4, "implement": [4, 5, 6, 9], "scipi": 4, "cdist": 4, "seem": 4, "might": 4, "infinit": 4, "loop": [4, 6], "fill": [4, 6], "cannot": [4, 6, 15], "unless": 4, "empyt": 4, "somehow": 4, "fix": 4, "partial": 4, "sinc": 4, "probabl": 4, "reject": 4, "check": [4, 7, 9, 11, 12], "exit": 4, "j": [4, 5], "calcul": [4, 15], "pdist": 4, "item": [4, 6], "source_set": 4, "destintation_set": 4, "destination_set": 4, "choos": 4, "_rng": 4, "g": 4, "w": 4, "m": 4, "A": [4, 5], "properti": 4, "known": [4, 7], "drug": 4, "med": 4, "chem": 4, "1996": 4, "39": 4, "2887": 4, "2893": 4, "landrum": 4, "2006": 4, "www": 4, "goal": 4, "share": 4, "later": [4, 9, 11, 12, 15, 16], "mol": 4, "comput": 4, "param": 4, "whether": [4, 6], "chiral": [4, 15], "uniqu": 4, "inchi": [4, 15], "re": [4, 16, 17], "draw": 4, "blog": 4, "post": 4, "blogspot": 4, "2020": 4, "daylight": 4, "whitepap": 4, "clusteringwhitepap": 4, "pdf": 4, "But": [4, 9], "tanimoto": 4, "domain": [4, 7], "zero": [4, 11], "enabl": 4, "join": 5, "saldanha": 5, "cowork": 5, "calibr": 5, "subset": 5, "1016": 5, "talanta": 5, "2005": 5, "03": 5, "025": 5, "against": 5, "reflect": 5, "expect": [5, 11], "implemen": 5, "break": 5, "ti": 5, "compar": 5, "minor": 5, "inconsequenti": 5, "obj": 6, "human": 6, "readabl": 6, "helper": 6, "deal": 6, "sampler_inst": 6, "train_idx": 6, "val_idx": 6, "test_idx": 6, "output_is_panda": 6, "conveni": [6, 11, 12, 15, 16], "instanc": 6, "about": 6, "note": [6, 7, 11], "past": 6, "could": [6, 9], "prone": 6, "long": 6, "prettiest": 6, "definit": 6, "what": 6, "want": [6, 9], "messag": 6, "runtimeerror": 6, "model": [6, 7, 9], "invalid": 6, "instal": [6, 7], "non": 6, "ks_distanc": 6, "ndarrai": 6, "lowercas": 6, "desir": 6, "rais": [6, 11, 15, 17], "yet": 6, "error_dict": 6, "neat": 6, "tabul": 6, "samplers_hopt": 6, "func": 6, "those": 6, "runtimewarn": 6, "match": 6, "onlin": 7, "conda": [7, 9], "statement": 7, "audienc": 7, "quick": 7, "withhold": 7, "evalu": 7, "impact": 7, "categor": 7, "access": 7, "directli": [7, 9, 15, 16], "theori": 7, "applic": 7, "ration": 7, "limit": 7, "cite": 7, "contribut": 7, "develop": 7, "philosophi": 7, "joss": 7, "branch": 7, "transit": 7, "subpackag": 8, "submodul": [8, 10, 13, 14], "content": 8, "test_astart": [8, 10], "test_molecul": [8, 10], "test_regress": [8, 10], "reli": 9, "becaus": 9, "part": 9, "vlachosgroup": 9, "io": 9, "pars": 9, "graph": [9, 11], "singl": 9, "place": 9, "represent": 9, "chemic": 9, "invit": 9, "explor": 9, "descriptor": [9, 11], "made": 9, "first": 9, "script": 9, "were": 9, "becom": 9, "interoper": 9, "real": 9, "them": [9, 17], "labels_train": 9, "labels_test": 9, "tunabl": 9, "fine": 9, "15": 9, "circumst": 9, "larg": 9, "memori": 9, "intens": 9, "themselv": 9, "manipul": 9, "indices_train": 9, "indices_test": 9, "benefici": 9, "usual": 9, "y_val": 9, "truli": 9, "veri": 9, "final": 9, "sens": 9, "better": 9, "wors": 9, "author": 9, "believ": 9, "event": [9, 11], "mathemat": [9, 16], "dimens": 9, "50": 9, "101": 9, "runtim": 9, "occur": 9, "quietli": 9, "felt": 9, "prudent": 9, "enter": 9, "normal": 9, "hopefulli": 9, "prevent": 9, "head": 9, "scratch": 9, "hour": 9, "debug": 9, "setupclass": [10, 11, 12, 13, 14, 15, 16, 17], "test_close_mispelling_sampl": [10, 11], "test_extrapolative_shuffl": [10, 11], "test_inconsistent_input_length": [10, 11], "test_insufficient_dataset_test": [10, 11], "test_insufficient_dataset_train": [10, 11], "test_insufficient_dataset_v": [10, 11], "test_not_implemented_sampl": [10, 11], "test_return_indic": [10, 11], "test_return_indices_with_valid": [10, 11], "test_split_valid": [10, 11], "test_train_test_split": [10, 11, 13, 17], "test_train_val_test_split": [10, 11], "test_train_val_test_split_extrpolation_shuffl": [10, 11], "test_fingerprint": [10, 11], "test_fprint_hopt": [10, 11], "test_maximum_cal": [10, 11], "test_molecules_with_rdkit": [10, 11], "test_molecules_with_troublesome_smil": [10, 11], "test_sampler_hopt": [10, 11], "test_validation_split_molecul": [10, 11], "test_extrapolation_regress": [10, 12], "test_interpolation_regress": [10, 12], "test_kmeans_regression_sklearn_v12": [10, 12], "test_kmeans_regression_sklearn_v13": [10, 12], "test_timebased_regress": [10, 12], "test_convert_to_arrai": [10, 13], "test_sampler_factori": [10, 13], "test_util": [10, 13], "methodnam": [11, 12, 15, 16, 17], "runtest": [11, 12, 15, 16, 17], "testcas": [11, 12, 15, 16, 17], "classmethod": [11, 12, 15, 16, 17], "typo": 11, "length": 11, "round": 11, "funat": 11, "imperfect": 11, "inhomogen": 11, "variabl": 11, "save": [12, 16, 17], "static": 12, "earlier": 12, "test_dbscan": [13, 14], "test_scaffold": [13, 14], "test_kmean": [13, 14], "test_optisim": [13, 14], "test_sphere_exclus": [13, 14], "test_time_bas": [13, 14], "test_kennard_ston": [13, 14], "test_random": [13, 14], "test_spxi": [13, 14], "test_bad_type_cast": [13, 17], "test_convertable_input": [13, 17], "test_panda_handla": [13, 17], "test_unconvertable_input": [13, 17], "test_generate_regression_results_dict": [13, 17], "test_dbscan_sampl": [14, 15], "test_include_chir": [14, 15], "test_incorrect_input": [14, 15], "test_mol_from_inchi": [14, 15], "test_no_scaffold_found_warn": [14, 15], "test_remove_atom_map": [14, 15], "test_scaffold_sampl": [14, 15], "test_kmeans_sampling_v12": [14, 15], "test_kmeans_sampling_v13": [14, 15], "test_optisim_sampl": [14, 15], "test_sphereexclus": [14, 15], "test_sphereexclusion_sampl": [14, 15], "test_mising_label": [14, 15], "test_time_based_d": [14, 15], "test_time_based_datetim": [14, 15], "test_time_based_sampl": [14, 15], "test_kennard_stone_sampl": [14, 16], "test_kennard_stone_sample_no_warn": [14, 16], "test_random_sampl": [14, 16], "test_random_sample_no_warn": [14, 16], "test_missing_i": [14, 16], "test_spxy_sampl": [14, 16], "typeerror": 15, "load": 15, "atom": 15, "neither": 15, "nor": 15, "Not": 15, "tt": 16, "complain": 16, "fail": 17, "factori": 17}, "objects": {"": [[2, 0, 0, "-", "astartes"], [10, 0, 0, "-", "test"]], "astartes": [[2, 0, 0, "-", "main"], [2, 0, 0, "-", "molecules"], [3, 0, 0, "-", "samplers"], [6, 0, 0, "-", "utils"]], "astartes.main": [[2, 1, 1, "", "train_test_split"], [2, 1, 1, "", "train_val_test_split"]], "astartes.molecules": [[2, 1, 1, "", "train_test_split_molecules"], [2, 1, 1, "", "train_val_test_split_molecules"]], "astartes.samplers": [[3, 0, 0, "-", "abstract_sampler"], [4, 0, 0, "-", "extrapolation"], [5, 0, 0, "-", "interpolation"]], "astartes.samplers.abstract_sampler": [[3, 2, 1, "", "AbstractSampler"]], "astartes.samplers.abstract_sampler.AbstractSampler": [[3, 3, 1, "", "__init__"], [3, 3, 1, "", "get_clusters"], [3, 3, 1, "", "get_config"], [3, 3, 1, "", "get_sample_idxs"], [3, 3, 1, "", "get_sorted_cluster_counter"]], "astartes.samplers.extrapolation": [[4, 0, 0, "-", "dbscan"], [4, 0, 0, "-", "kmeans"], [4, 0, 0, "-", "optisim"], [4, 0, 0, "-", "scaffold"], [4, 0, 0, "-", "sphere_exclusion"], [4, 0, 0, "-", "time_based"]], "astartes.samplers.extrapolation.dbscan": [[4, 2, 1, "", "DBSCAN"]], "astartes.samplers.extrapolation.kmeans": [[4, 2, 1, "", "KMeans"]], "astartes.samplers.extrapolation.optisim": [[4, 2, 1, "", "OptiSim"]], "astartes.samplers.extrapolation.optisim.OptiSim": [[4, 3, 1, "", "get_dist"], [4, 3, 1, "", "move_item"], [4, 3, 1, "", "rchoose"]], "astartes.samplers.extrapolation.scaffold": [[4, 2, 1, "", "Scaffold"]], "astartes.samplers.extrapolation.scaffold.Scaffold": [[4, 3, 1, "", "generate_bemis_murcko_scaffold"], [4, 3, 1, "", "scaffold_to_smiles"], [4, 3, 1, "", "str_to_mol"]], "astartes.samplers.extrapolation.sphere_exclusion": [[4, 2, 1, "", "SphereExclusion"]], "astartes.samplers.extrapolation.time_based": [[4, 2, 1, "", "TimeBased"]], "astartes.samplers.interpolation": [[5, 0, 0, "-", "kennardstone"], [5, 0, 0, "-", "random_split"], [5, 0, 0, "-", "spxy"]], "astartes.samplers.interpolation.kennardstone": [[5, 2, 1, "", "KennardStone"]], "astartes.samplers.interpolation.random_split": [[5, 2, 1, "", "Random"]], "astartes.samplers.interpolation.spxy": [[5, 2, 1, "", "SPXY"]], "astartes.utils": [[6, 0, 0, "-", "array_type_helpers"], [6, 0, 0, "-", "exceptions"], [6, 0, 0, "-", "fast_kennard_stone"], [6, 1, 1, "", "generate_regression_results_dict"], [6, 0, 0, "-", "sampler_factory"], [6, 0, 0, "-", "user_utils"], [6, 0, 0, "-", "warnings"]], "astartes.utils.array_type_helpers": [[6, 1, 1, "", "convert_to_array"], [6, 1, 1, "", "panda_handla"], [6, 1, 1, "", "return_helper"]], "astartes.utils.exceptions": [[6, 4, 1, "", "InvalidConfigurationError"], [6, 4, 1, "", "InvalidModelTypeError"], [6, 4, 1, "", "MoleculesNotInstalledError"], [6, 4, 1, "", "SamplerNotImplementedError"], [6, 4, 1, "", "UncastableInputError"]], "astartes.utils.exceptions.InvalidConfigurationError": [[6, 3, 1, "", "__init__"]], "astartes.utils.exceptions.InvalidModelTypeError": [[6, 3, 1, "", "__init__"]], "astartes.utils.exceptions.MoleculesNotInstalledError": [[6, 3, 1, "", "__init__"]], "astartes.utils.exceptions.SamplerNotImplementedError": [[6, 3, 1, "", "__init__"]], "astartes.utils.exceptions.UncastableInputError": [[6, 3, 1, "", "__init__"]], "astartes.utils.fast_kennard_stone": [[6, 1, 1, "", "fast_kennard_stone"]], "astartes.utils.sampler_factory": [[6, 2, 1, "", "SamplerFactory"]], "astartes.utils.sampler_factory.SamplerFactory": [[6, 3, 1, "", "__init__"], [6, 3, 1, "", "get_sampler"]], "astartes.utils.user_utils": [[6, 1, 1, "", "display_results_as_table"], [6, 1, 1, "", "generate_regression_results_dict"]], "astartes.utils.warnings": [[6, 4, 1, "", "ConversionWarning"], [6, 4, 1, "", "ImperfectSplittingWarning"], [6, 4, 1, "", "NoMatchingScaffold"], [6, 4, 1, "", "NormalizationWarning"]], "astartes.utils.warnings.ConversionWarning": [[6, 3, 1, "", "__init__"]], "astartes.utils.warnings.ImperfectSplittingWarning": [[6, 3, 1, "", "__init__"]], "astartes.utils.warnings.NoMatchingScaffold": [[6, 3, 1, "", "__init__"]], "astartes.utils.warnings.NormalizationWarning": [[6, 3, 1, "", "__init__"]], "test": [[11, 0, 0, "-", "functional"], [12, 0, 0, "-", "regression"], [13, 0, 0, "-", "unit"]], "test.functional": [[11, 0, 0, "-", "test_astartes"], [11, 0, 0, "-", "test_molecules"]], "test.functional.test_astartes": [[11, 2, 1, "", "Test_astartes"]], "test.functional.test_astartes.Test_astartes": [[11, 3, 1, "", "setUpClass"], [11, 3, 1, "", "test_close_mispelling_sampler"], [11, 3, 1, "", "test_extrapolative_shuffling"], [11, 3, 1, "", "test_inconsistent_input_lengths"], [11, 3, 1, "", "test_insufficient_dataset_test"], [11, 3, 1, "", "test_insufficient_dataset_train"], [11, 3, 1, "", "test_insufficient_dataset_val"], [11, 3, 1, "", "test_not_implemented_sampler"], [11, 3, 1, "", "test_return_indices"], [11, 3, 1, "", "test_return_indices_with_validation"], [11, 3, 1, "", "test_split_validation"], [11, 3, 1, "", "test_train_test_split"], [11, 3, 1, "", "test_train_val_test_split"], [11, 3, 1, "", "test_train_val_test_split_extrpolation_shuffling"]], "test.functional.test_molecules": [[11, 2, 1, "", "Test_molecules"]], "test.functional.test_molecules.Test_molecules": [[11, 3, 1, "", "setUpClass"], [11, 3, 1, "", "test_fingerprints"], [11, 3, 1, "", "test_fprint_hopts"], [11, 3, 1, "", "test_maximum_call"], [11, 3, 1, "", "test_molecules"], [11, 3, 1, "", "test_molecules_with_rdkit"], [11, 3, 1, "", "test_molecules_with_troublesome_smiles"], [11, 3, 1, "", "test_sampler_hopts"], [11, 3, 1, "", "test_validation_split_molecules"]], "test.regression": [[12, 0, 0, "-", "test_regression"]], "test.regression.test_regression": [[12, 2, 1, "", "Test_regression"]], "test.regression.test_regression.Test_regression": [[12, 3, 1, "", "setUpClass"], [12, 3, 1, "", "test_extrapolation_regression"], [12, 3, 1, "", "test_interpolation_regression"], [12, 3, 1, "", "test_kmeans_regression_sklearn_v12"], [12, 3, 1, "", "test_kmeans_regression_sklearn_v13"], [12, 3, 1, "", "test_timebased_regression"]], "test.unit": [[14, 0, 0, "-", "samplers"], [17, 0, 0, "-", "utils"]], "test.unit.samplers": [[15, 0, 0, "-", "extrapolative"], [16, 0, 0, "-", "interpolative"]], "test.unit.samplers.extrapolative": [[15, 0, 0, "-", "test_DBSCAN"], [15, 0, 0, "-", "test_Scaffold"], [15, 0, 0, "-", "test_kmeans"], [15, 0, 0, "-", "test_optisim"], [15, 0, 0, "-", "test_sphere_exclusion"], [15, 0, 0, "-", "test_time_based"]], "test.unit.samplers.extrapolative.test_DBSCAN": [[15, 2, 1, "", "Test_DBSCAN"]], "test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_dbscan"], [15, 3, 1, "", "test_dbscan_sampling"]], "test.unit.samplers.extrapolative.test_Scaffold": [[15, 2, 1, "", "Test_scaffold"]], "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_include_chirality"], [15, 3, 1, "", "test_incorrect_input"], [15, 3, 1, "", "test_mol_from_inchi"], [15, 3, 1, "", "test_no_scaffold_found_warning"], [15, 3, 1, "", "test_remove_atom_map"], [15, 3, 1, "", "test_scaffold"], [15, 3, 1, "", "test_scaffold_sampling"]], "test.unit.samplers.extrapolative.test_kmeans": [[15, 2, 1, "", "Test_kmeans"]], "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_kmeans"], [15, 3, 1, "", "test_kmeans_sampling_v12"], [15, 3, 1, "", "test_kmeans_sampling_v13"]], "test.unit.samplers.extrapolative.test_optisim": [[15, 2, 1, "", "Test_optisim"]], "test.unit.samplers.extrapolative.test_optisim.Test_optisim": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_optisim"], [15, 3, 1, "", "test_optisim_sampling"]], "test.unit.samplers.extrapolative.test_sphere_exclusion": [[15, 2, 1, "", "Test_sphere_exclusion"]], "test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_sphereexclusion"], [15, 3, 1, "", "test_sphereexclusion_sampling"]], "test.unit.samplers.extrapolative.test_time_based": [[15, 2, 1, "", "Test_time_based"]], "test.unit.samplers.extrapolative.test_time_based.Test_time_based": [[15, 3, 1, "", "setUpClass"], [15, 3, 1, "", "test_incorrect_input"], [15, 3, 1, "", "test_mising_labels"], [15, 3, 1, "", "test_time_based_date"], [15, 3, 1, "", "test_time_based_datetime"], [15, 3, 1, "", "test_time_based_sampling"]], "test.unit.samplers.interpolative": [[16, 0, 0, "-", "test_kennard_stone"], [16, 0, 0, "-", "test_random"], [16, 0, 0, "-", "test_spxy"]], "test.unit.samplers.interpolative.test_kennard_stone": [[16, 2, 1, "", "Test_kennard_stone"]], "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone": [[16, 3, 1, "", "setUpClass"], [16, 3, 1, "", "test_kennard_stone"], [16, 3, 1, "", "test_kennard_stone_sample"], [16, 3, 1, "", "test_kennard_stone_sample_no_warning"]], "test.unit.samplers.interpolative.test_random": [[16, 2, 1, "", "Test_random"]], "test.unit.samplers.interpolative.test_random.Test_random": [[16, 3, 1, "", "setUpClass"], [16, 3, 1, "", "test_random"], [16, 3, 1, "", "test_random_sample"], [16, 3, 1, "", "test_random_sample_no_warning"]], "test.unit.samplers.interpolative.test_spxy": [[16, 2, 1, "", "Test_SPXY"]], "test.unit.samplers.interpolative.test_spxy.Test_SPXY": [[16, 3, 1, "", "setUpClass"], [16, 3, 1, "", "test_missing_y"], [16, 3, 1, "", "test_spxy"], [16, 3, 1, "", "test_spxy_sampling"]], "test.unit.utils": [[17, 0, 0, "-", "test_convert_to_array"], [17, 0, 0, "-", "test_sampler_factory"], [17, 0, 0, "-", "test_utils"]], "test.unit.utils.test_convert_to_array": [[17, 2, 1, "", "Test_convert_to_array"]], "test.unit.utils.test_convert_to_array.Test_convert_to_array": [[17, 3, 1, "", "test_bad_type_cast"], [17, 3, 1, "", "test_convertable_input"], [17, 3, 1, "", "test_panda_handla"], [17, 3, 1, "", "test_unconvertable_input"]], "test.unit.utils.test_sampler_factory": [[17, 2, 1, "", "Test_sampler_factory"]], "test.unit.utils.test_sampler_factory.Test_sampler_factory": [[17, 3, 1, "", "setUpClass"], [17, 3, 1, "", "test_train_test_split"]], "test.unit.utils.test_utils": [[17, 2, 1, "", "Test_utils"]], "test.unit.utils.test_utils.Test_utils": [[17, 3, 1, "", "setUpClass"], [17, 3, 1, "", "test_generate_regression_results_dict"]]}, "objtypes": {"0": "py:module", "1": "py:function", "2": "py:class", "3": "py:method", "4": "py:exception"}, "objnames": {"0": ["py", "module", "Python module"], "1": ["py", "function", "Python function"], "2": ["py", "class", "Python class"], "3": ["py", "method", "Python method"], "4": ["py", "exception", "Python exception"]}, "titleterms": {"contribut": [0, 1], "develop": [0, 1], "note": [0, 1], "instal": [0, 1, 9], "version": 0, "check": 0, "test": [0, 1, 10, 11, 12, 13, 14, 15, 16, 17], "ad": 0, "new": 0, "sampler": [0, 3, 4, 5, 9, 14, 15, 16], "Not": 0, "implement": [0, 1], "sampl": [0, 1], "algorithm": [0, 1, 9], "featur": [0, 9], "scheme": 0, "The": 0, "train_val_test_split": [0, 1, 9], "function": [0, 11], "philosophi": 0, "joss": 0, "branch": 0, "onlin": 1, "document": [1, 7], "astart": [1, 2, 3, 4, 5, 6, 7, 8, 9], "pip": 1, "conda": 1, "sourc": 1, "statement": [1, 9], "need": 1, "target": 1, "audienc": 1, "quick": 1, "start": 1, "exampl": 1, "notebook": 1, "withhold": 1, "data": 1, "evalu": 1, "impact": 1, "split": 1, "regress": [1, 12], "model": 1, "us": [1, 9], "categor": 1, "access": 1, "directli": 1, "theori": 1, "applic": 1, "ration": 1, "domain": 1, "specif": 1, "chemic": 1, "molecul": [1, 2], "subpackag": [1, 2, 3, 10, 13, 14], "reproduc": 1, "known": 1, "limit": 1, "how": 1, "cite": 1, "packag": [2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16, 17], "submodul": [2, 3, 4, 5, 6, 11, 12, 15, 16, 17], "main": 2, "modul": [2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16, 17], "content": [2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17], "abstract_sampl": 3, "extrapol": [4, 15], "dbscan": 4, "kmean": 4, "optisim": 4, "scaffold": 4, "sphere_exclus": 4, "time_bas": 4, "interpol": [5, 16], "kennardston": 5, "random_split": 5, "spxy": 5, "util": [6, 17], "array_type_help": 6, "except": 6, "fast_kennard_ston": 6, "sampler_factori": 6, "user_util": 6, "warn": [6, 9], "indic": 7, "tabl": 7, "transit": 9, "from": 9, "sklearn": 9, "step": 9, "1": 9, "2": 9, "chang": 9, "import": 9, "3": 9, "specifi": 9, "an": 9, "4": 9, "pass": 9, "keyword": 9, "argument": 9, "5": 9, "return_indic": 9, "improv": 9, "code": 9, "clariti": 9, "more": 9, "rigor": 9, "ml": 9, "custom": 9, "imperfectsplittingwarn": 9, "normalizationwarn": 9, "test_astart": 11, "test_molecul": 11, "test_regress": 12, "unit": [13, 14, 15, 16, 17], "test_dbscan": 15, "test_scaffold": 15, "test_kmean": 15, "test_optisim": 15, "test_sphere_exclus": 15, "test_time_bas": 15, "test_kennard_ston": 16, "test_random": 16, "test_spxi": 16, "test_convert_to_arrai": 17, "test_sampler_factori": 17, "test_util": 17}, "envversion": {"sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 60}, "alltitles": {"Contributing & Developer Notes": [[0, "contributing-developer-notes"], [1, "id7"]], "Developer Install": [[0, "developer-install"]], "Version Checking": [[0, "version-checking"]], "Testing": [[0, "testing"]], "Adding New Samplers": [[0, "adding-new-samplers"]], "Not Implemented Sampling Algorithms": [[0, "not-implemented-sampling-algorithms"]], "Adding New Featurization Schemes": [[0, "adding-new-featurization-schemes"]], "The train_val_test_split Function": [[0, "the-train-val-test-split-function"]], "Development Philosophy": [[0, "development-philosophy"]], "JOSS Branch": [[0, "joss-branch"]], "Online Documentation": [[1, "online-documentation"]], "Installing astartes": [[1, "installing-astartes"]], "pip": [[1, "pip"]], "conda": [[1, "conda"]], "Source": [[1, "source"]], "Statement of Need": [[1, "statement-of-need"]], "Target Audience": [[1, "target-audience"]], "Quick Start": [[1, "quick-start"]], "Example Notebooks": [[1, "example-notebooks"]], "Withhold Testing Data with train_val_test_split": [[1, "withhold-testing-data-with-train-val-test-split"]], "Evaluate the Impact of Splitting Algorithms on Regression Models": [[1, "evaluate-the-impact-of-splitting-algorithms-on-regression-models"]], "Using astartes with Categorical Data": [[1, "using-astartes-with-categorical-data"]], "Access Sampling Algorithms Directly": [[1, "access-sampling-algorithms-directly"]], "Theory and Application of astartes": [[1, "theory-and-application-of-astartes"]], "Rational Splitting Algorithms": [[1, "rational-splitting-algorithms"]], "Implemented Sampling Algorithms": [[1, "implemented-sampling-algorithms"]], "Domain-Specific Applications": [[1, "domain-specific-applications"]], "Chemical Data and the astartes.molecules Subpackage": [[1, "chemical-data-and-the-astartes-molecules-subpackage"]], "Reproducibility": [[1, "reproducibility"]], "Known Reproducibility Limitations": [[1, "known-reproducibility-limitations"]], "How to Cite": [[1, "how-to-cite"]], "astartes package": [[2, "astartes-package"]], "Subpackages": [[2, "subpackages"], [3, "subpackages"], [10, "subpackages"], [13, "subpackages"], [14, "subpackages"]], "Submodules": [[2, "submodules"], [3, "submodules"], [4, "submodules"], [5, "submodules"], [6, "submodules"], [11, "submodules"], [12, "submodules"], [15, "submodules"], [16, "submodules"], [17, "submodules"]], "astartes.main module": [[2, "module-astartes.main"]], "astartes.molecules module": [[2, "module-astartes.molecules"]], "Module contents": [[2, "module-astartes"], [3, "module-astartes.samplers"], [4, "module-astartes.samplers.extrapolation"], [5, "module-astartes.samplers.interpolation"], [6, "module-astartes.utils"], [10, "module-test"], [11, "module-test.functional"], [12, "module-test.regression"], [13, "module-test.unit"], [14, "module-test.unit.samplers"], [15, "module-test.unit.samplers.extrapolative"], [16, "module-test.unit.samplers.interpolative"], [17, "module-test.unit.utils"]], "astartes.samplers package": [[3, "astartes-samplers-package"]], "astartes.samplers.abstract_sampler module": [[3, "module-astartes.samplers.abstract_sampler"]], "astartes.samplers.extrapolation package": [[4, "astartes-samplers-extrapolation-package"]], "astartes.samplers.extrapolation.dbscan module": [[4, "module-astartes.samplers.extrapolation.dbscan"]], "astartes.samplers.extrapolation.kmeans module": [[4, "module-astartes.samplers.extrapolation.kmeans"]], "astartes.samplers.extrapolation.optisim module": [[4, "module-astartes.samplers.extrapolation.optisim"]], "astartes.samplers.extrapolation.scaffold module": [[4, "module-astartes.samplers.extrapolation.scaffold"]], "astartes.samplers.extrapolation.sphere_exclusion module": [[4, "module-astartes.samplers.extrapolation.sphere_exclusion"]], "astartes.samplers.extrapolation.time_based module": [[4, "module-astartes.samplers.extrapolation.time_based"]], "astartes.samplers.interpolation package": [[5, "astartes-samplers-interpolation-package"]], "astartes.samplers.interpolation.kennardstone module": [[5, "module-astartes.samplers.interpolation.kennardstone"]], "astartes.samplers.interpolation.random_split module": [[5, "module-astartes.samplers.interpolation.random_split"]], "astartes.samplers.interpolation.spxy module": [[5, "module-astartes.samplers.interpolation.spxy"]], "astartes.utils package": [[6, "astartes-utils-package"]], "astartes.utils.array_type_helpers module": [[6, "module-astartes.utils.array_type_helpers"]], "astartes.utils.exceptions module": [[6, "module-astartes.utils.exceptions"]], "astartes.utils.fast_kennard_stone module": [[6, "module-astartes.utils.fast_kennard_stone"]], "astartes.utils.sampler_factory module": [[6, "module-astartes.utils.sampler_factory"]], "astartes.utils.user_utils module": [[6, "module-astartes.utils.user_utils"]], "astartes.utils.warnings module": [[6, "module-astartes.utils.warnings"]], "astartes documentation": [[7, "astartes-documentation"]], "Contents:": [[7, null]], "Indices and tables": [[7, "indices-and-tables"]], "astartes": [[8, "astartes"]], "Transitioning from sklearn to astartes": [[9, "transitioning-from-sklearn-to-astartes"]], "Step 1. Installation": [[9, "step-1-installation"]], "Step 2. Changing the import Statement": [[9, "step-2-changing-the-import-statement"]], "Step 3. Specifying an Algorithmic Sampler": [[9, "step-3-specifying-an-algorithmic-sampler"]], "Step 4. Passing Keyword Arguments": [[9, "step-4-passing-keyword-arguments"]], "Step 5. Useful astartes Features": [[9, "step-5-useful-astartes-features"]], "return_indices: Improve Code Clarity": [[9, "return-indices-improve-code-clarity"]], "train_val_test_split: More Rigorous ML": [[9, "train-val-test-split-more-rigorous-ml"]], "Custom Warnings: ImperfectSplittingWarning and NormalizationWarning": [[9, "custom-warnings-imperfectsplittingwarning-and-normalizationwarning"]], "test package": [[10, "test-package"]], "test.functional package": [[11, "test-functional-package"]], "test.functional.test_astartes module": [[11, "module-test.functional.test_astartes"]], "test.functional.test_molecules module": [[11, "module-test.functional.test_molecules"]], "test.regression package": [[12, "test-regression-package"]], "test.regression.test_regression module": [[12, "module-test.regression.test_regression"]], "test.unit package": [[13, "test-unit-package"]], "test.unit.samplers package": [[14, "test-unit-samplers-package"]], "test.unit.samplers.extrapolative package": [[15, "test-unit-samplers-extrapolative-package"]], "test.unit.samplers.extrapolative.test_DBSCAN module": [[15, "module-test.unit.samplers.extrapolative.test_DBSCAN"]], "test.unit.samplers.extrapolative.test_Scaffold module": [[15, "module-test.unit.samplers.extrapolative.test_Scaffold"]], "test.unit.samplers.extrapolative.test_kmeans module": [[15, "module-test.unit.samplers.extrapolative.test_kmeans"]], "test.unit.samplers.extrapolative.test_optisim module": [[15, "module-test.unit.samplers.extrapolative.test_optisim"]], "test.unit.samplers.extrapolative.test_sphere_exclusion module": [[15, "module-test.unit.samplers.extrapolative.test_sphere_exclusion"]], "test.unit.samplers.extrapolative.test_time_based module": [[15, "module-test.unit.samplers.extrapolative.test_time_based"]], "test.unit.samplers.interpolative package": [[16, "test-unit-samplers-interpolative-package"]], "test.unit.samplers.interpolative.test_kennard_stone module": [[16, "module-test.unit.samplers.interpolative.test_kennard_stone"]], "test.unit.samplers.interpolative.test_random module": [[16, "module-test.unit.samplers.interpolative.test_random"]], "test.unit.samplers.interpolative.test_spxy module": [[16, "module-test.unit.samplers.interpolative.test_spxy"]], "test.unit.utils package": [[17, "test-unit-utils-package"]], "test.unit.utils.test_convert_to_array module": [[17, "module-test.unit.utils.test_convert_to_array"]], "test.unit.utils.test_sampler_factory module": [[17, "module-test.unit.utils.test_sampler_factory"]], "test.unit.utils.test_utils module": [[17, "module-test.unit.utils.test_utils"]]}, "indexentries": {"astartes": [[2, "module-astartes"]], "astartes.main": [[2, "module-astartes.main"]], "astartes.molecules": [[2, "module-astartes.molecules"]], "module": [[2, "module-astartes"], [2, "module-astartes.main"], [2, "module-astartes.molecules"], [3, "module-astartes.samplers"], [3, "module-astartes.samplers.abstract_sampler"], [4, "module-astartes.samplers.extrapolation"], [4, "module-astartes.samplers.extrapolation.dbscan"], [4, "module-astartes.samplers.extrapolation.kmeans"], [4, "module-astartes.samplers.extrapolation.optisim"], [4, "module-astartes.samplers.extrapolation.scaffold"], [4, "module-astartes.samplers.extrapolation.sphere_exclusion"], [4, "module-astartes.samplers.extrapolation.time_based"], [5, "module-astartes.samplers.interpolation"], [5, "module-astartes.samplers.interpolation.kennardstone"], [5, "module-astartes.samplers.interpolation.random_split"], [5, "module-astartes.samplers.interpolation.spxy"], [6, "module-astartes.utils"], [6, "module-astartes.utils.array_type_helpers"], [6, "module-astartes.utils.exceptions"], [6, "module-astartes.utils.fast_kennard_stone"], [6, "module-astartes.utils.sampler_factory"], [6, "module-astartes.utils.user_utils"], [6, "module-astartes.utils.warnings"], [10, "module-test"], [11, "module-test.functional"], [11, "module-test.functional.test_astartes"], [11, "module-test.functional.test_molecules"], [12, "module-test.regression"], [12, "module-test.regression.test_regression"], [13, "module-test.unit"], [14, "module-test.unit.samplers"], [15, "module-test.unit.samplers.extrapolative"], [15, "module-test.unit.samplers.extrapolative.test_DBSCAN"], [15, "module-test.unit.samplers.extrapolative.test_Scaffold"], [15, "module-test.unit.samplers.extrapolative.test_kmeans"], [15, "module-test.unit.samplers.extrapolative.test_optisim"], [15, "module-test.unit.samplers.extrapolative.test_sphere_exclusion"], [15, "module-test.unit.samplers.extrapolative.test_time_based"], [16, "module-test.unit.samplers.interpolative"], [16, "module-test.unit.samplers.interpolative.test_kennard_stone"], [16, "module-test.unit.samplers.interpolative.test_random"], [16, "module-test.unit.samplers.interpolative.test_spxy"], [17, "module-test.unit.utils"], [17, "module-test.unit.utils.test_convert_to_array"], [17, "module-test.unit.utils.test_sampler_factory"], [17, "module-test.unit.utils.test_utils"]], "train_test_split() (in module astartes.main)": [[2, "astartes.main.train_test_split"]], "train_test_split_molecules() (in module astartes.molecules)": [[2, "astartes.molecules.train_test_split_molecules"]], "train_val_test_split() (in module astartes.main)": [[2, "astartes.main.train_val_test_split"]], "train_val_test_split_molecules() (in module astartes.molecules)": [[2, "astartes.molecules.train_val_test_split_molecules"]], "abstractsampler (class in astartes.samplers.abstract_sampler)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler"]], "__init__() (astartes.samplers.abstract_sampler.abstractsampler method)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler.__init__"]], "astartes.samplers": [[3, "module-astartes.samplers"]], "astartes.samplers.abstract_sampler": [[3, "module-astartes.samplers.abstract_sampler"]], "get_clusters() (astartes.samplers.abstract_sampler.abstractsampler method)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler.get_clusters"]], "get_config() (astartes.samplers.abstract_sampler.abstractsampler method)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler.get_config"]], "get_sample_idxs() (astartes.samplers.abstract_sampler.abstractsampler method)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler.get_sample_idxs"]], "get_sorted_cluster_counter() (astartes.samplers.abstract_sampler.abstractsampler method)": [[3, "astartes.samplers.abstract_sampler.AbstractSampler.get_sorted_cluster_counter"]], "dbscan (class in astartes.samplers.extrapolation.dbscan)": [[4, "astartes.samplers.extrapolation.dbscan.DBSCAN"]], "kmeans (class in astartes.samplers.extrapolation.kmeans)": [[4, "astartes.samplers.extrapolation.kmeans.KMeans"]], "optisim (class in astartes.samplers.extrapolation.optisim)": [[4, "astartes.samplers.extrapolation.optisim.OptiSim"]], "scaffold (class in astartes.samplers.extrapolation.scaffold)": [[4, "astartes.samplers.extrapolation.scaffold.Scaffold"]], "sphereexclusion (class in astartes.samplers.extrapolation.sphere_exclusion)": [[4, "astartes.samplers.extrapolation.sphere_exclusion.SphereExclusion"]], "timebased (class in astartes.samplers.extrapolation.time_based)": [[4, "astartes.samplers.extrapolation.time_based.TimeBased"]], "astartes.samplers.extrapolation": [[4, "module-astartes.samplers.extrapolation"]], "astartes.samplers.extrapolation.dbscan": [[4, "module-astartes.samplers.extrapolation.dbscan"]], "astartes.samplers.extrapolation.kmeans": [[4, "module-astartes.samplers.extrapolation.kmeans"]], "astartes.samplers.extrapolation.optisim": [[4, "module-astartes.samplers.extrapolation.optisim"]], "astartes.samplers.extrapolation.scaffold": [[4, "module-astartes.samplers.extrapolation.scaffold"]], "astartes.samplers.extrapolation.sphere_exclusion": [[4, "module-astartes.samplers.extrapolation.sphere_exclusion"]], "astartes.samplers.extrapolation.time_based": [[4, "module-astartes.samplers.extrapolation.time_based"]], "generate_bemis_murcko_scaffold() (astartes.samplers.extrapolation.scaffold.scaffold method)": [[4, "astartes.samplers.extrapolation.scaffold.Scaffold.generate_bemis_murcko_scaffold"]], "get_dist() (astartes.samplers.extrapolation.optisim.optisim method)": [[4, "astartes.samplers.extrapolation.optisim.OptiSim.get_dist"]], "move_item() (astartes.samplers.extrapolation.optisim.optisim method)": [[4, "astartes.samplers.extrapolation.optisim.OptiSim.move_item"]], "rchoose() (astartes.samplers.extrapolation.optisim.optisim method)": [[4, "astartes.samplers.extrapolation.optisim.OptiSim.rchoose"]], "scaffold_to_smiles() (astartes.samplers.extrapolation.scaffold.scaffold method)": [[4, "astartes.samplers.extrapolation.scaffold.Scaffold.scaffold_to_smiles"]], "str_to_mol() (astartes.samplers.extrapolation.scaffold.scaffold method)": [[4, "astartes.samplers.extrapolation.scaffold.Scaffold.str_to_mol"]], "kennardstone (class in astartes.samplers.interpolation.kennardstone)": [[5, "astartes.samplers.interpolation.kennardstone.KennardStone"]], "random (class in astartes.samplers.interpolation.random_split)": [[5, "astartes.samplers.interpolation.random_split.Random"]], "spxy (class in astartes.samplers.interpolation.spxy)": [[5, "astartes.samplers.interpolation.spxy.SPXY"]], "astartes.samplers.interpolation": [[5, "module-astartes.samplers.interpolation"]], "astartes.samplers.interpolation.kennardstone": [[5, "module-astartes.samplers.interpolation.kennardstone"]], "astartes.samplers.interpolation.random_split": [[5, "module-astartes.samplers.interpolation.random_split"]], "astartes.samplers.interpolation.spxy": [[5, "module-astartes.samplers.interpolation.spxy"]], "conversionwarning": [[6, "astartes.utils.warnings.ConversionWarning"]], "imperfectsplittingwarning": [[6, "astartes.utils.warnings.ImperfectSplittingWarning"]], "invalidconfigurationerror": [[6, "astartes.utils.exceptions.InvalidConfigurationError"]], "invalidmodeltypeerror": [[6, "astartes.utils.exceptions.InvalidModelTypeError"]], "moleculesnotinstallederror": [[6, "astartes.utils.exceptions.MoleculesNotInstalledError"]], "nomatchingscaffold": [[6, "astartes.utils.warnings.NoMatchingScaffold"]], "normalizationwarning": [[6, "astartes.utils.warnings.NormalizationWarning"]], "samplerfactory (class in astartes.utils.sampler_factory)": [[6, "astartes.utils.sampler_factory.SamplerFactory"]], "samplernotimplementederror": [[6, "astartes.utils.exceptions.SamplerNotImplementedError"]], "uncastableinputerror": [[6, "astartes.utils.exceptions.UncastableInputError"]], "__init__() (astartes.utils.exceptions.invalidconfigurationerror method)": [[6, "astartes.utils.exceptions.InvalidConfigurationError.__init__"]], "__init__() (astartes.utils.exceptions.invalidmodeltypeerror method)": [[6, "astartes.utils.exceptions.InvalidModelTypeError.__init__"]], "__init__() (astartes.utils.exceptions.moleculesnotinstallederror method)": [[6, "astartes.utils.exceptions.MoleculesNotInstalledError.__init__"]], "__init__() (astartes.utils.exceptions.samplernotimplementederror method)": [[6, "astartes.utils.exceptions.SamplerNotImplementedError.__init__"]], "__init__() (astartes.utils.exceptions.uncastableinputerror method)": [[6, "astartes.utils.exceptions.UncastableInputError.__init__"]], "__init__() (astartes.utils.sampler_factory.samplerfactory method)": [[6, "astartes.utils.sampler_factory.SamplerFactory.__init__"]], "__init__() (astartes.utils.warnings.conversionwarning method)": [[6, "astartes.utils.warnings.ConversionWarning.__init__"]], "__init__() (astartes.utils.warnings.imperfectsplittingwarning method)": [[6, "astartes.utils.warnings.ImperfectSplittingWarning.__init__"]], "__init__() (astartes.utils.warnings.nomatchingscaffold method)": [[6, "astartes.utils.warnings.NoMatchingScaffold.__init__"]], "__init__() (astartes.utils.warnings.normalizationwarning method)": [[6, "astartes.utils.warnings.NormalizationWarning.__init__"]], "astartes.utils": [[6, "module-astartes.utils"]], "astartes.utils.array_type_helpers": [[6, "module-astartes.utils.array_type_helpers"]], "astartes.utils.exceptions": [[6, "module-astartes.utils.exceptions"]], "astartes.utils.fast_kennard_stone": [[6, "module-astartes.utils.fast_kennard_stone"]], "astartes.utils.sampler_factory": [[6, "module-astartes.utils.sampler_factory"]], "astartes.utils.user_utils": [[6, "module-astartes.utils.user_utils"]], "astartes.utils.warnings": [[6, "module-astartes.utils.warnings"]], "convert_to_array() (in module astartes.utils.array_type_helpers)": [[6, "astartes.utils.array_type_helpers.convert_to_array"]], "display_results_as_table() (in module astartes.utils.user_utils)": [[6, "astartes.utils.user_utils.display_results_as_table"]], "fast_kennard_stone() (in module astartes.utils.fast_kennard_stone)": [[6, "astartes.utils.fast_kennard_stone.fast_kennard_stone"]], "generate_regression_results_dict() (in module astartes.utils)": [[6, "astartes.utils.generate_regression_results_dict"]], "generate_regression_results_dict() (in module astartes.utils.user_utils)": [[6, "astartes.utils.user_utils.generate_regression_results_dict"]], "get_sampler() (astartes.utils.sampler_factory.samplerfactory method)": [[6, "astartes.utils.sampler_factory.SamplerFactory.get_sampler"]], "panda_handla() (in module astartes.utils.array_type_helpers)": [[6, "astartes.utils.array_type_helpers.panda_handla"]], "return_helper() (in module astartes.utils.array_type_helpers)": [[6, "astartes.utils.array_type_helpers.return_helper"]], "test": [[10, "module-test"]], "test_astartes (class in test.functional.test_astartes)": [[11, "test.functional.test_astartes.Test_astartes"]], "test_molecules (class in test.functional.test_molecules)": [[11, "test.functional.test_molecules.Test_molecules"]], "setupclass() (test.functional.test_astartes.test_astartes class method)": [[11, "test.functional.test_astartes.Test_astartes.setUpClass"]], "setupclass() (test.functional.test_molecules.test_molecules class method)": [[11, "test.functional.test_molecules.Test_molecules.setUpClass"]], "test.functional": [[11, "module-test.functional"]], "test.functional.test_astartes": [[11, "module-test.functional.test_astartes"]], "test.functional.test_molecules": [[11, "module-test.functional.test_molecules"]], "test_close_mispelling_sampler() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_close_mispelling_sampler"]], "test_extrapolative_shuffling() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_extrapolative_shuffling"]], "test_fingerprints() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_fingerprints"]], "test_fprint_hopts() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_fprint_hopts"]], "test_inconsistent_input_lengths() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_inconsistent_input_lengths"]], "test_insufficient_dataset_test() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_insufficient_dataset_test"]], "test_insufficient_dataset_train() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_insufficient_dataset_train"]], "test_insufficient_dataset_val() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_insufficient_dataset_val"]], "test_maximum_call() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_maximum_call"]], "test_molecules() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_molecules"]], "test_molecules_with_rdkit() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_molecules_with_rdkit"]], "test_molecules_with_troublesome_smiles() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_molecules_with_troublesome_smiles"]], "test_not_implemented_sampler() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_not_implemented_sampler"]], "test_return_indices() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_return_indices"]], "test_return_indices_with_validation() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_return_indices_with_validation"]], "test_sampler_hopts() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_sampler_hopts"]], "test_split_validation() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_split_validation"]], "test_train_test_split() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_train_test_split"]], "test_train_val_test_split() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_train_val_test_split"]], "test_train_val_test_split_extrpolation_shuffling() (test.functional.test_astartes.test_astartes method)": [[11, "test.functional.test_astartes.Test_astartes.test_train_val_test_split_extrpolation_shuffling"]], "test_validation_split_molecules() (test.functional.test_molecules.test_molecules method)": [[11, "test.functional.test_molecules.Test_molecules.test_validation_split_molecules"]], "test_regression (class in test.regression.test_regression)": [[12, "test.regression.test_regression.Test_regression"]], "setupclass() (test.regression.test_regression.test_regression class method)": [[12, "test.regression.test_regression.Test_regression.setUpClass"]], "test.regression": [[12, "module-test.regression"]], "test.regression.test_regression": [[12, "module-test.regression.test_regression"]], "test_extrapolation_regression() (test.regression.test_regression.test_regression method)": [[12, "test.regression.test_regression.Test_regression.test_extrapolation_regression"]], "test_interpolation_regression() (test.regression.test_regression.test_regression method)": [[12, "test.regression.test_regression.Test_regression.test_interpolation_regression"]], "test_kmeans_regression_sklearn_v12() (test.regression.test_regression.test_regression method)": [[12, "test.regression.test_regression.Test_regression.test_kmeans_regression_sklearn_v12"]], "test_kmeans_regression_sklearn_v13() (test.regression.test_regression.test_regression method)": [[12, "test.regression.test_regression.Test_regression.test_kmeans_regression_sklearn_v13"]], "test_timebased_regression() (test.regression.test_regression.test_regression method)": [[12, "test.regression.test_regression.Test_regression.test_timebased_regression"]], "test.unit": [[13, "module-test.unit"]], "test.unit.samplers": [[14, "module-test.unit.samplers"]], "test_dbscan (class in test.unit.samplers.extrapolative.test_dbscan)": [[15, "test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN"]], "test_kmeans (class in test.unit.samplers.extrapolative.test_kmeans)": [[15, "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans"]], "test_optisim (class in test.unit.samplers.extrapolative.test_optisim)": [[15, "test.unit.samplers.extrapolative.test_optisim.Test_optisim"]], "test_scaffold (class in test.unit.samplers.extrapolative.test_scaffold)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold"]], "test_sphere_exclusion (class in test.unit.samplers.extrapolative.test_sphere_exclusion)": [[15, "test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion"]], "test_time_based (class in test.unit.samplers.extrapolative.test_time_based)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based"]], "setupclass() (test.unit.samplers.extrapolative.test_dbscan.test_dbscan class method)": [[15, "test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN.setUpClass"]], "setupclass() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold class method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.setUpClass"]], "setupclass() (test.unit.samplers.extrapolative.test_kmeans.test_kmeans class method)": [[15, "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans.setUpClass"]], "setupclass() (test.unit.samplers.extrapolative.test_optisim.test_optisim class method)": [[15, "test.unit.samplers.extrapolative.test_optisim.Test_optisim.setUpClass"]], "setupclass() (test.unit.samplers.extrapolative.test_sphere_exclusion.test_sphere_exclusion class method)": [[15, "test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion.setUpClass"]], "setupclass() (test.unit.samplers.extrapolative.test_time_based.test_time_based class method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.setUpClass"]], "test.unit.samplers.extrapolative": [[15, "module-test.unit.samplers.extrapolative"]], "test.unit.samplers.extrapolative.test_dbscan": [[15, "module-test.unit.samplers.extrapolative.test_DBSCAN"]], "test.unit.samplers.extrapolative.test_scaffold": [[15, "module-test.unit.samplers.extrapolative.test_Scaffold"]], "test.unit.samplers.extrapolative.test_kmeans": [[15, "module-test.unit.samplers.extrapolative.test_kmeans"]], "test.unit.samplers.extrapolative.test_optisim": [[15, "module-test.unit.samplers.extrapolative.test_optisim"]], "test.unit.samplers.extrapolative.test_sphere_exclusion": [[15, "module-test.unit.samplers.extrapolative.test_sphere_exclusion"]], "test.unit.samplers.extrapolative.test_time_based": [[15, "module-test.unit.samplers.extrapolative.test_time_based"]], "test_dbscan() (test.unit.samplers.extrapolative.test_dbscan.test_dbscan method)": [[15, "test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN.test_dbscan"]], "test_dbscan_sampling() (test.unit.samplers.extrapolative.test_dbscan.test_dbscan method)": [[15, "test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN.test_dbscan_sampling"]], "test_include_chirality() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_include_chirality"]], "test_incorrect_input() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_incorrect_input"]], "test_incorrect_input() (test.unit.samplers.extrapolative.test_time_based.test_time_based method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.test_incorrect_input"]], "test_kmeans() (test.unit.samplers.extrapolative.test_kmeans.test_kmeans method)": [[15, "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans.test_kmeans"]], "test_kmeans_sampling_v12() (test.unit.samplers.extrapolative.test_kmeans.test_kmeans method)": [[15, "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans.test_kmeans_sampling_v12"]], "test_kmeans_sampling_v13() (test.unit.samplers.extrapolative.test_kmeans.test_kmeans method)": [[15, "test.unit.samplers.extrapolative.test_kmeans.Test_kmeans.test_kmeans_sampling_v13"]], "test_mising_labels() (test.unit.samplers.extrapolative.test_time_based.test_time_based method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.test_mising_labels"]], "test_mol_from_inchi() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_mol_from_inchi"]], "test_no_scaffold_found_warning() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_no_scaffold_found_warning"]], "test_optisim() (test.unit.samplers.extrapolative.test_optisim.test_optisim method)": [[15, "test.unit.samplers.extrapolative.test_optisim.Test_optisim.test_optisim"]], "test_optisim_sampling() (test.unit.samplers.extrapolative.test_optisim.test_optisim method)": [[15, "test.unit.samplers.extrapolative.test_optisim.Test_optisim.test_optisim_sampling"]], "test_remove_atom_map() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_remove_atom_map"]], "test_scaffold() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_scaffold"]], "test_scaffold_sampling() (test.unit.samplers.extrapolative.test_scaffold.test_scaffold method)": [[15, "test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold.test_scaffold_sampling"]], "test_sphereexclusion() (test.unit.samplers.extrapolative.test_sphere_exclusion.test_sphere_exclusion method)": [[15, "test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion.test_sphereexclusion"]], "test_sphereexclusion_sampling() (test.unit.samplers.extrapolative.test_sphere_exclusion.test_sphere_exclusion method)": [[15, "test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion.test_sphereexclusion_sampling"]], "test_time_based_date() (test.unit.samplers.extrapolative.test_time_based.test_time_based method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.test_time_based_date"]], "test_time_based_datetime() (test.unit.samplers.extrapolative.test_time_based.test_time_based method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.test_time_based_datetime"]], "test_time_based_sampling() (test.unit.samplers.extrapolative.test_time_based.test_time_based method)": [[15, "test.unit.samplers.extrapolative.test_time_based.Test_time_based.test_time_based_sampling"]], "test_spxy (class in test.unit.samplers.interpolative.test_spxy)": [[16, "test.unit.samplers.interpolative.test_spxy.Test_SPXY"]], "test_kennard_stone (class in test.unit.samplers.interpolative.test_kennard_stone)": [[16, "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone"]], "test_random (class in test.unit.samplers.interpolative.test_random)": [[16, "test.unit.samplers.interpolative.test_random.Test_random"]], "setupclass() (test.unit.samplers.interpolative.test_kennard_stone.test_kennard_stone class method)": [[16, "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone.setUpClass"]], "setupclass() (test.unit.samplers.interpolative.test_random.test_random class method)": [[16, "test.unit.samplers.interpolative.test_random.Test_random.setUpClass"]], "setupclass() (test.unit.samplers.interpolative.test_spxy.test_spxy class method)": [[16, "test.unit.samplers.interpolative.test_spxy.Test_SPXY.setUpClass"]], "test.unit.samplers.interpolative": [[16, "module-test.unit.samplers.interpolative"]], "test.unit.samplers.interpolative.test_kennard_stone": [[16, "module-test.unit.samplers.interpolative.test_kennard_stone"]], "test.unit.samplers.interpolative.test_random": [[16, "module-test.unit.samplers.interpolative.test_random"]], "test.unit.samplers.interpolative.test_spxy": [[16, "module-test.unit.samplers.interpolative.test_spxy"]], "test_kennard_stone() (test.unit.samplers.interpolative.test_kennard_stone.test_kennard_stone method)": [[16, "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone.test_kennard_stone"]], "test_kennard_stone_sample() (test.unit.samplers.interpolative.test_kennard_stone.test_kennard_stone method)": [[16, "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone.test_kennard_stone_sample"]], "test_kennard_stone_sample_no_warning() (test.unit.samplers.interpolative.test_kennard_stone.test_kennard_stone method)": [[16, "test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone.test_kennard_stone_sample_no_warning"]], "test_missing_y() (test.unit.samplers.interpolative.test_spxy.test_spxy method)": [[16, "test.unit.samplers.interpolative.test_spxy.Test_SPXY.test_missing_y"]], "test_random() (test.unit.samplers.interpolative.test_random.test_random method)": [[16, "test.unit.samplers.interpolative.test_random.Test_random.test_random"]], "test_random_sample() (test.unit.samplers.interpolative.test_random.test_random method)": [[16, "test.unit.samplers.interpolative.test_random.Test_random.test_random_sample"]], "test_random_sample_no_warning() (test.unit.samplers.interpolative.test_random.test_random method)": [[16, "test.unit.samplers.interpolative.test_random.Test_random.test_random_sample_no_warning"]], "test_spxy() (test.unit.samplers.interpolative.test_spxy.test_spxy method)": [[16, "test.unit.samplers.interpolative.test_spxy.Test_SPXY.test_spxy"]], "test_spxy_sampling() (test.unit.samplers.interpolative.test_spxy.test_spxy method)": [[16, "test.unit.samplers.interpolative.test_spxy.Test_SPXY.test_spxy_sampling"]], "test_convert_to_array (class in test.unit.utils.test_convert_to_array)": [[17, "test.unit.utils.test_convert_to_array.Test_convert_to_array"]], "test_sampler_factory (class in test.unit.utils.test_sampler_factory)": [[17, "test.unit.utils.test_sampler_factory.Test_sampler_factory"]], "test_utils (class in test.unit.utils.test_utils)": [[17, "test.unit.utils.test_utils.Test_utils"]], "setupclass() (test.unit.utils.test_sampler_factory.test_sampler_factory class method)": [[17, "test.unit.utils.test_sampler_factory.Test_sampler_factory.setUpClass"]], "setupclass() (test.unit.utils.test_utils.test_utils class method)": [[17, "test.unit.utils.test_utils.Test_utils.setUpClass"]], "test.unit.utils": [[17, "module-test.unit.utils"]], "test.unit.utils.test_convert_to_array": [[17, "module-test.unit.utils.test_convert_to_array"]], "test.unit.utils.test_sampler_factory": [[17, "module-test.unit.utils.test_sampler_factory"]], "test.unit.utils.test_utils": [[17, "module-test.unit.utils.test_utils"]], "test_bad_type_cast() (test.unit.utils.test_convert_to_array.test_convert_to_array method)": [[17, "test.unit.utils.test_convert_to_array.Test_convert_to_array.test_bad_type_cast"]], "test_convertable_input() (test.unit.utils.test_convert_to_array.test_convert_to_array method)": [[17, "test.unit.utils.test_convert_to_array.Test_convert_to_array.test_convertable_input"]], "test_generate_regression_results_dict() (test.unit.utils.test_utils.test_utils method)": [[17, "test.unit.utils.test_utils.Test_utils.test_generate_regression_results_dict"]], "test_panda_handla() (test.unit.utils.test_convert_to_array.test_convert_to_array method)": [[17, "test.unit.utils.test_convert_to_array.Test_convert_to_array.test_panda_handla"]], "test_train_test_split() (test.unit.utils.test_sampler_factory.test_sampler_factory method)": [[17, "test.unit.utils.test_sampler_factory.Test_sampler_factory.test_train_test_split"]], "test_unconvertable_input() (test.unit.utils.test_convert_to_array.test_convert_to_array method)": [[17, "test.unit.utils.test_convert_to_array.Test_convert_to_array.test_unconvertable_input"]]}}) \ No newline at end of file diff --git a/docs/sklearn_to_astartes.doctree b/docs/sklearn_to_astartes.doctree new file mode 100644 index 00000000..f249addc Binary files /dev/null and b/docs/sklearn_to_astartes.doctree differ diff --git a/docs/sklearn_to_astartes.html b/docs/sklearn_to_astartes.html new file mode 100644 index 00000000..fc41f724 --- /dev/null +++ b/docs/sklearn_to_astartes.html @@ -0,0 +1,278 @@ + + + + + + + Transitioning from sklearn to astartes — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Transitioning from sklearn to astartes

+
+

Step 1. Installation

+

astartes has been designed to rely on (1) as few packages as possible and (2) packages which are already likely to be installed in a Machine Learning (ML) Python workflow (i.e. Numpy and Sklearn). Because of this, astartes should be compatible with your existing workflow such as a conda environment.

+

To install astartes for general ML use (the sampling of arbitrary vectors): **pip install astartes**

+

For users in cheminformatics, astartes has an optional add-on that includes featurization as part of the sampling. To install, type **pip install 'astartes[molecules]'**. With this extra install, astartes uses ``AIMSim` <https://vlachosgroup.github.io/AIMSim/README.html>`_ to encode SMILES strings as feature vectors. The SMILES strings are parsed into molecular graphs using RDKit and then sampled with a single function call: train_test_split_molecules.

+
    +
  • If your workflow already has a featurization scheme in place (i.e. you already have a vector representation of your chemical of interest), you can directly use train_test_split (though we invite you to explore the many molecular descriptors made available through AIMSim).

  • +
+
+
+

Step 2. Changing the import Statement

+

In one of the first few lines of your Python script, you have the line from sklearn.model_selection import train_test_split. To switch to using astartes change this line to from astartes import train_test_split.

+

That’s it! You are now using astartes.

+

If you were just calling train_test_split(X, y), your script should now work in the exact same way as sklearn with no changes required.

+
X_train, X_test, y_train, y_test = train_test_split(
+    X,
+    y,
+    random_state=42,
+)
+
+
+

becomes

+
X_train, X_test, y_train, y_test = train_test_split(
+    X,
+    y,
+    random_state=42,
+)
+
+
+

But we encourage you to try one of our many other samplers (see below)!

+
+
+

Step 3. Specifying an Algorithmic Sampler

+

By default (for interoperability), astartes will use a random sampler to produce train/test splits - but the real value of astartes is in the algorithmic sampling algorithms it implements. Check out the README for a complete list of available algorithms and how to call and customize them.

+

If you existing call to train_test_split looks like this:

+
X_train, X_test, y_train, y_test = train_test_split(
+    X,
+    y,
+)
+
+
+

and you want to try out using Kennard-Stone sampling, switch it to this:

+
X_train, X_test, y_train, y_test = train_test_split(
+    X,
+    y,
+    sampler="kennard_stone",
+)
+
+
+

That’s it!

+
+
+

Step 4. Passing Keyword Arguments

+

All of the arguments to the sklearn‘s train_test_split can still be passed to astartestrain_test_split:

+
X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
+    X,
+    y,
+    labels,
+    train_size = 0.75,
+    test_size = 0.25,
+    sampler = "kmeans",
+    hopts = {"n_clusters": 4},
+)
+
+
+

Some samplers have tunable hyperparameters that allow you to more finely control their behavior. To do this with Sphere Exclusion, for example, switch your call to this:

+
X_train, X_test, y_train, y_test = train_test_split(
+    X,
+    y,
+    sampler="sphere_exclusion",
+    hopts={"distance_cutoff":0.15},
+)
+
+
+
+
+

Step 5. Useful astartes Features

+
+

return_indices: Improve Code Clarity

+

There are circumstances where the indices of the train/test data can be useful (for example, if y or labels are large, memory-intense objects), and there is no way to directly return these indices in sklearn. astartes will return the sampling splits themselves by default, but it can also return the indices for the user to manipulate according to their needs:

+
X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
+    X,
+    y,
+    labels,
+    return_indices = False,
+)
+
+
+

could instead be

+
X_train, X_test, y_train, y_test, labels_train, labels_test, indices_train, indices_test = train_test_split(
+    X,
+    y,
+    labels,
+    return_indices = True,
+)
+
+
+

If y or labels were large, memory-intense objects it could be beneficial to not pass them in to train_test_split and instead separate the existing lists later using the returned indices.

+
+
+

train_val_test_split: More Rigorous ML

+

Behind the scenes, train_test_split is actually just a one-line function that calls the real workhorse of astartes - train_val_test_split:

+
def train_test_split(
+    X: np.array,
+    ...
+    return_indices: bool = False,
+):
+    return train_val_test_split(
+        X, y, labels, train_size, 0, test_size, sampler, hopts, return_indices
+    )
+
+
+

The function call to train_val_test_split is identical to train_test_split and supports all the same samplers and hyperparameters, except for one additional keyword argument val_size:

+
def train_val_test_split(
+    X: np.array,
+    y: np.array = None,
+    labels: np.array = None,
+    train_size: float = 0.8,
+    val_size: float = 0.1,
+    test_size: float = 0.1,
+    sampler: str = "random",
+    hopts: dict = {},
+    return_indices: bool = False,
+):
+
+
+

When called, this will return three arrays from X, y, and labels (or three arrays of indices, if return_indices=True) rather than the usual two, according to the values given for train_size, val_size, and test_size in the function call.

+
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
+    X,
+    y,
+    train_size: float = 0.8,
+    val_size: float = 0.1,
+    test_size: float = 0.1,
+)
+
+
+

For truly rigorous ML modeling, the validation set should be used for hyperparameter tuning and the test set held out until the very final change has been made to the model to get a true sense of its performance. For better or for worse, this is not the current standard for ML modeling, but the authors believe it should be.

+
+
+

Custom Warnings: ImperfectSplittingWarning and NormalizationWarning

+

In the event that your requested train/validation/test split is not mathematically possible given the dimensions of the input data (i.e. you request 50/25/25 but have 101 data points), astartes will warn you during runtime that it has occurred. sklearn simply moves on quietly, and while this is fine most of the time, the authors felt it prudent to warn the user. +When entering a train/validation/test split, astartes will check that it is normalized and make it so if not, warning the user during runtime. This will hopefully help prevent head-scratching hours of debugging.

+
+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/sklearn_to_astartes.rst b/docs/sklearn_to_astartes.rst new file mode 100644 index 00000000..fa5248f3 --- /dev/null +++ b/docs/sklearn_to_astartes.rst @@ -0,0 +1,180 @@ + +Transitioning from ``sklearn`` to ``astartes`` +====================================================== + +Step 1. Installation +-------------------- + +``astartes`` has been designed to rely on (1) as few packages as possible and (2) packages which are already likely to be installed in a Machine Learning (ML) Python workflow (i.e. Numpy and Sklearn). Because of this, ``astartes`` should be compatible with your *existing* workflow such as a conda environment. + +To install ``astartes`` for general ML use (the sampling of arbitrary vectors): **\ ``pip install astartes``\ ** + +For users in cheminformatics, ``astartes`` has an optional add-on that includes featurization as part of the sampling. To install, type **\ ``pip install 'astartes[molecules]'``\ **. With this extra install, ``astartes`` uses `\ ``AIMSim`` `_ to encode SMILES strings as feature vectors. The SMILES strings are parsed into molecular graphs using RDKit and then sampled with a single function call: ``train_test_split_molecules``. + + +* If your workflow already has a featurization scheme in place (i.e. you already have a vector representation of your chemical of interest), you can directly use ``train_test_split`` (though we invite you to explore the many molecular descriptors made available through AIMSim). + +Step 2. Changing the ``import`` Statement +--------------------------------------------- + +In one of the first few lines of your Python script, you have the line ``from sklearn.model_selection import train_test_split``. To switch to using ``astartes`` change this line to ``from astartes import train_test_split``. + +That's it! You are now using ``astartes``. + +If you were just calling ``train_test_split(X, y)``\ , your script should now work in the exact same way as ``sklearn`` with no changes required. + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + random_state=42, + ) + +*becomes* + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + random_state=42, + ) + +But we encourage you to try one of our many other samplers (see below)! + +Step 3. Specifying an Algorithmic Sampler +----------------------------------------- + +By default (for interoperability), ``astartes`` will use a random sampler to produce train/test splits - but the real value of ``astartes`` is in the algorithmic sampling algorithms it implements. Check out the `README for a complete list of available algorithms `_ and how to call and customize them. + +If you existing call to ``train_test_split`` looks like this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + ) + +and you want to try out using Kennard-Stone sampling, switch it to this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler="kennard_stone", + ) + +That's it! + +Step 4. Passing Keyword Arguments +--------------------------------- + +All of the arguments to the ``sklearn``\ 's ``train_test_split`` can still be passed to ``astartes``\ ' ``train_test_split``\ : + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split( + X, + y, + labels, + train_size = 0.75, + test_size = 0.25, + sampler = "kmeans", + hopts = {"n_clusters": 4}, + ) + +Some samplers have tunable hyperparameters that allow you to more finely control their behavior. To do this with Sphere Exclusion, for example, switch your call to this: + +.. code-block:: python + + X_train, X_test, y_train, y_test = train_test_split( + X, + y, + sampler="sphere_exclusion", + hopts={"distance_cutoff":0.15}, + ) + +Step 5. Useful ``astartes`` Features +---------------------------------------- + +``return_indices``\ : Improve Code Clarity +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are circumstances where the indices of the train/test data can be useful (for example, if ``y`` or ``labels`` are large, memory-intense objects), and there is no way to directly return these indices in ``sklearn``. ``astartes`` will return the sampling splits themselves by default, but it can also return the indices for the user to manipulate according to their needs: + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split( + X, + y, + labels, + return_indices = False, + ) + +*could instead be* + +.. code-block:: python + + X_train, X_test, y_train, y_test, labels_train, labels_test, indices_train, indices_test = train_test_split( + X, + y, + labels, + return_indices = True, + ) + +If ``y`` or ``labels`` were large, memory-intense objects it could be beneficial to *not* pass them in to ``train_test_split`` and instead separate the existing lists later using the returned indices. + +``train_val_test_split``\ : More Rigorous ML +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Behind the scenes, ``train_test_split`` is actually just a one-line function that calls the real workhorse of ``astartes`` - ``train_val_test_split``\ : + +.. code-block:: python + + def train_test_split( + X: np.array, + ... + return_indices: bool = False, + ): + return train_val_test_split( + X, y, labels, train_size, 0, test_size, sampler, hopts, return_indices + ) + +The function call to ``train_val_test_split`` is identical to ``train_test_split`` and supports all the same samplers and hyperparameters, except for one additional keyword argument ``val_size``\ : + +.. code-block:: python + + def train_val_test_split( + X: np.array, + y: np.array = None, + labels: np.array = None, + train_size: float = 0.8, + val_size: float = 0.1, + test_size: float = 0.1, + sampler: str = "random", + hopts: dict = {}, + return_indices: bool = False, + ): + +When called, this will return *three* arrays from ``X``\ , ``y``\ , and ``labels`` (or three arrays of indices, if ``return_indices=True``\ ) rather than the usual two, according to the values given for ``train_size``\ , ``val_size``\ , and ``test_size`` in the function call. + +.. code-block:: python + + X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split( + X, + y, + train_size: float = 0.8, + val_size: float = 0.1, + test_size: float = 0.1, + ) + +For truly rigorous ML modeling, the validation set should be used for hyperparameter tuning and the test set held out until the *very final* change has been made to the model to get a true sense of its performance. For better or for worse, this is *not* the current standard for ML modeling, but the authors believe it should be. + +Custom Warnings: ``ImperfectSplittingWarning`` and ``NormalizationWarning`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the event that your requested train/validation/test split is not mathematically possible given the dimensions of the input data (i.e. you request 50/25/25 but have 101 data points), ``astartes`` will warn you during runtime that it has occurred. ``sklearn`` simply moves on quietly, and while this is fine *most* of the time, the authors felt it prudent to warn the user. +When entering a train/validation/test split, ``astartes`` will check that it is normalized and make it so if not, warning the user during runtime. This will hopefully help prevent head-scratching hours of debugging. diff --git a/docs/test.doctree b/docs/test.doctree new file mode 100644 index 00000000..fb5d912a Binary files /dev/null and b/docs/test.doctree differ diff --git a/docs/test.functional.doctree b/docs/test.functional.doctree new file mode 100644 index 00000000..269d3fcd Binary files /dev/null and b/docs/test.functional.doctree differ diff --git a/docs/test.functional.html b/docs/test.functional.html new file mode 100644 index 00000000..6d932449 --- /dev/null +++ b/docs/test.functional.html @@ -0,0 +1,306 @@ + + + + + + + test.functional package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.functional package

+
+

Submodules

+
+
+

test.functional.test_astartes module

+
+
+class test.functional.test_astartes.Test_astartes(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of astartes.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_close_mispelling_sampler()
+

Astartes should be helpful in the event of a typo.

+
+ +
+
+test_extrapolative_shuffling()
+

extrapolative samplers should split data differently with different random_state

+
+ +
+
+test_inconsistent_input_lengths()
+

Different length X, y, and labels should raise an exception at start.

+
+ +
+
+test_insufficient_dataset_test()
+

If the user requests a split that would result in rounding down the size of the +test set to zero, a helpful exception should be raised.

+
+ +
+
+test_insufficient_dataset_train()
+

If the user requests a split that would result in rounding down the size of the +test set to zero, a helpful exception should be raised.

+
+ +
+
+test_insufficient_dataset_val()
+

If the user requests a split that would result in rounding down the size of the +test set to zero, a helpful exception should be raised.

+
+ +
+
+test_not_implemented_sampler()
+

Astartes should suggest checking the docstring.

+
+ +
+
+test_return_indices()
+

Test the ability to return the indices and the values.

+
+ +
+
+test_return_indices_with_validation()
+

Test the ability to return indices in train_val_test_split

+
+ +
+
+test_split_validation()
+

Tests of the input split validation.

+
+ +
+
+test_train_test_split()
+

Funational test of train_test_split with imperfect splitting.

+
+ +
+
+test_train_val_test_split()
+

Split data into training, validation, and test sets.

+
+ +
+
+test_train_val_test_split_extrpolation_shuffling()
+

Split data into training, validation, and test sets with shuffling.

+
+ +
+ +
+
+

test.functional.test_molecules module

+
+
+class test.functional.test_molecules.Test_molecules(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of molecules.

+

Note: daylight_fingerprint is not compatible – inhomogenous arrays +(variable length descriptor)

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_fingerprints()
+

Test using different fingerprints with the molecular featurization.

+
+ +
+
+test_fprint_hopts()
+

Test specifying hyperparameters for the molecular featurization step.

+
+ +
+
+test_maximum_call()
+

Specify ALL the optional hyperparameters!

+
+ +
+
+test_molecules()
+

Try train_test_split molecules with every interpolative sampler.

+
+ +
+
+test_molecules_with_rdkit()
+

Try train_test_split molecules, every sampler, passing rdkit objects.

+
+ +
+
+test_molecules_with_troublesome_smiles()
+

Helpful errors when rdkit graphs can’t be featurized.

+
+ +
+
+test_sampler_hopts()
+

Test ability to pass through sampler hopts with molecules interface, expecting no warnings.

+
+ +
+
+test_validation_split_molecules()
+

Try train_val_test_split_molecule with every extrapolative sampler.

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.functional.rst b/docs/test.functional.rst new file mode 100644 index 00000000..0c1189a2 --- /dev/null +++ b/docs/test.functional.rst @@ -0,0 +1,29 @@ +test.functional package +======================= + +Submodules +---------- + +test.functional.test\_astartes module +------------------------------------- + +.. automodule:: test.functional.test_astartes + :members: + :undoc-members: + :show-inheritance: + +test.functional.test\_molecules module +-------------------------------------- + +.. automodule:: test.functional.test_molecules + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.functional + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.html b/docs/test.html new file mode 100644 index 00000000..711a9a07 --- /dev/null +++ b/docs/test.html @@ -0,0 +1,223 @@ + + + + + + + test package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test package

+
+

Subpackages

+
+ +
+
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.regression.doctree b/docs/test.regression.doctree new file mode 100644 index 00000000..b2442a3f Binary files /dev/null and b/docs/test.regression.doctree differ diff --git a/docs/test.regression.html b/docs/test.regression.html new file mode 100644 index 00000000..d58d57ce --- /dev/null +++ b/docs/test.regression.html @@ -0,0 +1,189 @@ + + + + + + + test.regression package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.regression package

+
+

Submodules

+
+
+

test.regression.test_regression module

+
+
+class test.regression.test_regression.Test_regression(methodName='runTest')
+

Bases: TestCase

+

Test for regression relative to saved reference splits.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_extrapolation_regression()
+

Regression testing of extrapolative methods relative to static results.

+
+ +
+
+test_interpolation_regression()
+

Regression testing of interpolative methods relative to static results.

+
+ +
+
+test_kmeans_regression_sklearn_v12()
+

Regression testing of KMeans in sklearn v1.2 or earlier.

+
+ +
+
+test_kmeans_regression_sklearn_v13()
+

Regression testing of KMeans in sklearn v1.3 or newer.

+
+ +
+
+test_timebased_regression()
+

Regression test TimeBased, which has labels to check as well.

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.regression.rst b/docs/test.regression.rst new file mode 100644 index 00000000..c3cb03bb --- /dev/null +++ b/docs/test.regression.rst @@ -0,0 +1,21 @@ +test.regression package +======================= + +Submodules +---------- + +test.regression.test\_regression module +--------------------------------------- + +.. automodule:: test.regression.test_regression + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.regression + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.rst b/docs/test.rst new file mode 100644 index 00000000..9fca9975 --- /dev/null +++ b/docs/test.rst @@ -0,0 +1,20 @@ +test package +============ + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.functional + test.regression + test.unit + +Module contents +--------------- + +.. automodule:: test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.unit.doctree b/docs/test.unit.doctree new file mode 100644 index 00000000..d21cd75c Binary files /dev/null and b/docs/test.unit.doctree differ diff --git a/docs/test.unit.html b/docs/test.unit.html new file mode 100644 index 00000000..9bf8b511 --- /dev/null +++ b/docs/test.unit.html @@ -0,0 +1,204 @@ + + + + + + + test.unit package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+ + +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.unit.rst b/docs/test.unit.rst new file mode 100644 index 00000000..d11aa2eb --- /dev/null +++ b/docs/test.unit.rst @@ -0,0 +1,19 @@ +test.unit package +================= + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.unit.samplers + test.unit.utils + +Module contents +--------------- + +.. automodule:: test.unit + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.unit.samplers.doctree b/docs/test.unit.samplers.doctree new file mode 100644 index 00000000..485bbee4 Binary files /dev/null and b/docs/test.unit.samplers.doctree differ diff --git a/docs/test.unit.samplers.extrapolative.doctree b/docs/test.unit.samplers.extrapolative.doctree new file mode 100644 index 00000000..a65c525f Binary files /dev/null and b/docs/test.unit.samplers.extrapolative.doctree differ diff --git a/docs/test.unit.samplers.extrapolative.html b/docs/test.unit.samplers.extrapolative.html new file mode 100644 index 00000000..9a05fa1e --- /dev/null +++ b/docs/test.unit.samplers.extrapolative.html @@ -0,0 +1,367 @@ + + + + + + + test.unit.samplers.extrapolative package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.unit.samplers.extrapolative package

+
+

Submodules

+
+
+

test.unit.samplers.extrapolative.test_DBSCAN module

+
+
+class test.unit.samplers.extrapolative.test_DBSCAN.Test_DBSCAN(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of dbscan.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_dbscan()
+

Directly instantiate and test DBSCAN.

+
+ +
+
+test_dbscan_sampling()
+

Use dbscan in the train_test_split and verify results.

+
+ +
+ +
+
+

test.unit.samplers.extrapolative.test_Scaffold module

+
+
+class test.unit.samplers.extrapolative.test_Scaffold.Test_scaffold(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of Scaffold.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_include_chirality()
+

Include chirality in scaffold calculation

+
+ +
+
+test_incorrect_input()
+

Calling with something other than SMILES should raise TypeError

+
+ +
+
+test_mol_from_inchi()
+

Ability to load data from InChi inputs

+
+ +
+
+test_no_scaffold_found_warning()
+

Molecules that cannot be scaffolded should raise a warning

+
+ +
+
+test_remove_atom_map()
+

Scaffolds should not include atom map numbers

+
+ +
+
+test_scaffold()
+

Directly instantiate and test Scaffold.

+
+ +
+
+test_scaffold_sampling()
+

Use Scaffold in the train_test_split and verify results.

+
+ +
+ +
+
+

test.unit.samplers.extrapolative.test_kmeans module

+
+
+class test.unit.samplers.extrapolative.test_kmeans.Test_kmeans(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of kmeans.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_kmeans()
+

Directly instantiate and test KMeans.

+
+ +
+
+test_kmeans_sampling_v12()
+

Use kmeans in the train_test_split and verify results.

+
+ +
+
+test_kmeans_sampling_v13()
+

Use kmeans in the train_test_split and verify results.

+
+ +
+ +
+
+

test.unit.samplers.extrapolative.test_optisim module

+
+
+class test.unit.samplers.extrapolative.test_optisim.Test_optisim(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of optisim.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_optisim()
+

Directly instantiate and test OptiSim

+
+ +
+
+test_optisim_sampling()
+

Use kmeans in the train_test_split and verify results.

+
+ +
+ +
+
+

test.unit.samplers.extrapolative.test_sphere_exclusion module

+
+
+class test.unit.samplers.extrapolative.test_sphere_exclusion.Test_sphere_exclusion(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of sphere_exclusion.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_sphereexclusion()
+

Directly instantiate and test KMeans.

+
+ +
+
+test_sphereexclusion_sampling()
+

Use kmeans in the train_test_split and verify results.

+
+ +
+ +
+
+

test.unit.samplers.extrapolative.test_time_based module

+
+
+class test.unit.samplers.extrapolative.test_time_based.Test_time_based(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of TimeBased.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_incorrect_input()
+

Specifying labels as neither date nor datetime object should raise TypeError

+
+ +
+
+test_mising_labels()
+

Not specifying labels should raise ValueError

+
+ +
+
+test_time_based_date()
+

Directly instantiate and test TimeBased.

+
+ +
+
+test_time_based_datetime()
+

Directly instantiate and test TimeBased.

+
+ +
+
+test_time_based_sampling()
+

Use time_based in the train_test_split and verify results.

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.unit.samplers.extrapolative.rst b/docs/test.unit.samplers.extrapolative.rst new file mode 100644 index 00000000..cd6b17ef --- /dev/null +++ b/docs/test.unit.samplers.extrapolative.rst @@ -0,0 +1,61 @@ +test.unit.samplers.extrapolative package +======================================== + +Submodules +---------- + +test.unit.samplers.extrapolative.test\_DBSCAN module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_DBSCAN + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_Scaffold module +------------------------------------------------------ + +.. automodule:: test.unit.samplers.extrapolative.test_Scaffold + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_kmeans module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_kmeans + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_optisim module +----------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_optisim + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_sphere\_exclusion module +--------------------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_sphere_exclusion + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.extrapolative.test\_time\_based module +--------------------------------------------------------- + +.. automodule:: test.unit.samplers.extrapolative.test_time_based + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.samplers.extrapolative + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.unit.samplers.html b/docs/test.unit.samplers.html new file mode 100644 index 00000000..f89c0b10 --- /dev/null +++ b/docs/test.unit.samplers.html @@ -0,0 +1,251 @@ + + + + + + + test.unit.samplers package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.unit.samplers package

+
+

Subpackages

+
+ +
+
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.unit.samplers.interpolative.doctree b/docs/test.unit.samplers.interpolative.doctree new file mode 100644 index 00000000..e54d79a5 Binary files /dev/null and b/docs/test.unit.samplers.interpolative.doctree differ diff --git a/docs/test.unit.samplers.interpolative.html b/docs/test.unit.samplers.interpolative.html new file mode 100644 index 00000000..25dc1106 --- /dev/null +++ b/docs/test.unit.samplers.interpolative.html @@ -0,0 +1,247 @@ + + + + + + + test.unit.samplers.interpolative package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.unit.samplers.interpolative package

+
+

Submodules

+
+
+

test.unit.samplers.interpolative.test_kennard_stone module

+
+
+class test.unit.samplers.interpolative.test_kennard_stone.Test_kennard_stone(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of kennard_stone.

+
+
+classmethod setUpClass()
+

Save re-used arrays as class attributes.

+
+ +
+
+test_kennard_stone()
+

Directly instantiate and test KennardStone.

+
+ +
+
+test_kennard_stone_sample()
+

Use kennard stone in tts and verify results

+
+ +
+
+test_kennard_stone_sample_no_warning()
+

Use kennard stone with a mathematically possible split requested

+
+ +
+ +
+
+

test.unit.samplers.interpolative.test_random module

+
+
+class test.unit.samplers.interpolative.test_random.Test_random(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of Random.

+
+
+classmethod setUpClass()
+

Save re-used arrays as class attributes.

+
+ +
+
+test_random()
+

Directly instantiate and test random.

+
+ +
+
+test_random_sample()
+

Use kennard stone in tts and verify results

+
+ +
+
+test_random_sample_no_warning()
+

Use random with a mathematically possible split requested

+
+ +
+ +
+
+

test.unit.samplers.interpolative.test_spxy module

+
+
+class test.unit.samplers.interpolative.test_spxy.Test_SPXY(methodName='runTest')
+

Bases: TestCase

+

Test the various functionalities of SPXY.

+
+
+classmethod setUpClass()
+

Convenience attributes for later tests.

+
+ +
+
+test_missing_y()
+

SPXY requires a y array and should complain when one is not provided.

+
+ +
+
+test_spxy()
+

Directly instantiate and test SPXY

+
+ +
+
+test_spxy_sampling()
+

Use spxy in the train_test_split and verify results.

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.unit.samplers.interpolative.rst b/docs/test.unit.samplers.interpolative.rst new file mode 100644 index 00000000..812af5a9 --- /dev/null +++ b/docs/test.unit.samplers.interpolative.rst @@ -0,0 +1,37 @@ +test.unit.samplers.interpolative package +======================================== + +Submodules +---------- + +test.unit.samplers.interpolative.test\_kennard\_stone module +------------------------------------------------------------ + +.. automodule:: test.unit.samplers.interpolative.test_kennard_stone + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.interpolative.test\_random module +---------------------------------------------------- + +.. automodule:: test.unit.samplers.interpolative.test_random + :members: + :undoc-members: + :show-inheritance: + +test.unit.samplers.interpolative.test\_spxy module +-------------------------------------------------- + +.. automodule:: test.unit.samplers.interpolative.test_spxy + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.samplers.interpolative + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.unit.samplers.rst b/docs/test.unit.samplers.rst new file mode 100644 index 00000000..dc112797 --- /dev/null +++ b/docs/test.unit.samplers.rst @@ -0,0 +1,19 @@ +test.unit.samplers package +========================== + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + test.unit.samplers.extrapolative + test.unit.samplers.interpolative + +Module contents +--------------- + +.. automodule:: test.unit.samplers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/test.unit.utils.doctree b/docs/test.unit.utils.doctree new file mode 100644 index 00000000..72e0d034 Binary files /dev/null and b/docs/test.unit.utils.doctree differ diff --git a/docs/test.unit.utils.html b/docs/test.unit.utils.html new file mode 100644 index 00000000..f72f5bb8 --- /dev/null +++ b/docs/test.unit.utils.html @@ -0,0 +1,220 @@ + + + + + + + test.unit.utils package — astartes astartes.__version__ documentation + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

test.unit.utils package

+
+

Submodules

+
+
+

test.unit.utils.test_convert_to_array module

+
+
+class test.unit.utils.test_convert_to_array.Test_convert_to_array(methodName='runTest')
+

Bases: TestCase

+

Test array type handling.

+
+
+test_bad_type_cast()
+

Raise error when casting arrays that do not contain supported types.

+
+ +
+
+test_convertable_input()
+

Raise warning when casting.

+
+ +
+
+test_panda_handla()
+

Splitting Dataframes and series should return them as such.

+
+ +
+
+test_unconvertable_input()
+

Raise error when casting fails.

+
+ +
+ +
+
+

test.unit.utils.test_sampler_factory module

+
+
+class test.unit.utils.test_sampler_factory.Test_sampler_factory(methodName='runTest')
+

Bases: TestCase

+

Test SamplerFactory functions on all samplers.

+
+
+classmethod setUpClass()
+

Save re-used arrays as class attributes.

+
+ +
+
+test_train_test_split()
+

Call sampler factory on all inputs.

+
+ +
+ +
+
+

test.unit.utils.test_utils module

+
+
+class test.unit.utils.test_utils.Test_utils(methodName='runTest')
+

Bases: TestCase

+

Test functions within utils.py.

+
+
+classmethod setUpClass()
+

Save re-used arrays as class attributes.

+
+ +
+
+test_generate_regression_results_dict()
+

Generate results dictionary for simple regression task.

+
+ +
+ +
+
+

Module contents

+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/test.unit.utils.rst b/docs/test.unit.utils.rst new file mode 100644 index 00000000..d4b96cf4 --- /dev/null +++ b/docs/test.unit.utils.rst @@ -0,0 +1,37 @@ +test.unit.utils package +======================= + +Submodules +---------- + +test.unit.utils.test\_convert\_to\_array module +----------------------------------------------- + +.. automodule:: test.unit.utils.test_convert_to_array + :members: + :undoc-members: + :show-inheritance: + +test.unit.utils.test\_sampler\_factory module +--------------------------------------------- + +.. automodule:: test.unit.utils.test_sampler_factory + :members: + :undoc-members: + :show-inheritance: + +test.unit.utils.test\_utils module +---------------------------------- + +.. automodule:: test.unit.utils.test_utils + :members: + :undoc-members: + :show-inheritance: + +Module contents +--------------- + +.. automodule:: test.unit.utils + :members: + :undoc-members: + :show-inheritance: