Skip to content

Latest commit

 

History

History
142 lines (127 loc) · 6.56 KB

sklearn_to_astartes.md

File metadata and controls

142 lines (127 loc) · 6.56 KB

Transitioning from sklearn to astartes

Step 1. Installation

astartes has been designed to rely on (1) as few packages as possible and (2) packages which are already likely to be installed in a Machine Learning (ML) Python workflow (i.e. Numpy and Sklearn). Because of this, astartes should be compatible with your existing workflow such as a conda environment.

To install astartes for general ML use (the sampling of arbitrary vectors): pip install astartes

For users in cheminformatics, astartes has an optional add-on that includes featurization as part of the sampling. To install, type pip install 'astartes[molecules]'. With this extra install, astartes uses AIMSim to encode SMILES strings as feature vectors. The SMILES strings are parsed into molecular graphs using RDKit and then sampled with a single function call: train_test_split_molecules.

  • If your workflow already has a featurization scheme in place (i.e. you already have a vector representation of your chemical of interest), you can directly use train_test_split (though we invite you to explore the many molecular descriptors made available through AIMSim).

Step 2. Changing the import Statement

In one of the first few lines of your Python script, you have the line from sklearn.model_selection import train_test_split. To switch to using astartes change this line to from astartes import train_test_split.

That's it! You are now using astartes.

If you were just calling train_test_split(X, y), your script should now work in the exact same way as sklearn with no changes required.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=42,
)

becomes

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=42,
)

But we encourage you to try one of our many other samplers (see below)!

Step 3. Specifying an Algorithmic Sampler

By default (for interoperability), astartes will use a random sampler to produce train/test splits - but the real value of astartes is in the algorithmic sampling algorithms it implements. Check out the README for a complete list of available algorithms and how to call and customize them.

If you existing call to train_test_split looks like this:

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
)

and you want to try out using Kennard-Stone sampling, switch it to this:

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    sampler="kennard_stone",
)

That's it!

Step 4. Passing Keyword Arguments

All of the arguments to the sklearn's train_test_split can still be passed to astartes' train_test_split:

X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
    X,
    y,
    labels,
    train_size = 0.75,
    test_size = 0.25,
    sampler = "kmeans",
    hopts = {"n_clusters": 4},
)

Some samplers have tunable hyperparameters that allow you to more finely control their behavior. To do this with Sphere Exclusion, for example, switch your call to this:

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    sampler="sphere_exclusion",
    hopts={"distance_cutoff":0.15},
)

Step 5. Useful astartes Features

return_indices: Improve Code Clarity

There are circumstances where the indices of the train/test data can be useful (for example, if y or labels are large, memory-intense objects), and there is no way to directly return these indices in sklearn. astartes will return the sampling splits themselves by default, but it can also return the indices for the user to manipulate according to their needs:

X_train, X_test, y_train, y_test, labels_train, labels_test = train_test_split(
    X,
    y,
    labels,
    return_indices = False,
)

could instead be

X_train, X_test, y_train, y_test, labels_train, labels_test, indices_train, indices_test = train_test_split(
    X,
    y,
    labels,
    return_indices = True,
)

If y or labels were large, memory-intense objects it could be beneficial to not pass them in to train_test_split and instead separate the existing lists later using the returned indices.

train_val_test_split: More Rigorous ML

Behind the scenes, train_test_split is actually just a one-line function that calls the real workhorse of astartes - train_val_test_split:

def train_test_split(
    X: np.array,
    ...
    return_indices: bool = False,
):
    return train_val_test_split(
        X, y, labels, train_size, 0, test_size, sampler, hopts, return_indices
    )

The function call to train_val_test_split is identical to train_test_split and supports all the same samplers and hyperparameters, except for one additional keyword argument val_size:

def train_val_test_split(
    X: np.array,
    y: np.array = None,
    labels: np.array = None,
    train_size: float = 0.8,
    val_size: float = 0.1,
    test_size: float = 0.1,
    sampler: str = "random",
    hopts: dict = {},
    return_indices: bool = False,
):

When called, this will return three arrays from X, y, and labels (or three arrays of indices, if return_indices=True) rather than the usual two, according to the values given for train_size, val_size, and test_size in the function call.

X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
    X,
    y,
    train_size: float = 0.8,
    val_size: float = 0.1,
    test_size: float = 0.1,
)

For truly rigorous ML modeling, the validation set should be used for hyperparameter tuning and the test set held out until the very final change has been made to the model to get a true sense of its performance. For better or for worse, this is not the current standard for ML modeling, but the authors believe it should be.

Custom Warnings: ImperfectSplittingWarning and NormalizationWarning

In the event that your requested train/validation/test split is not mathematically possible given the dimensions of the input data (i.e. you request 50/25/25 but have 101 data points), astartes will warn you during runtime that it has occurred. sklearn simply moves on quietly, and while this is fine most of the time, the authors felt it prudent to warn the user. When entering a train/validation/test split, astartes will check that it is normalized and make it so if not, warning the user during runtime. This will hopefully help prevent head-scratching hours of debugging.