deepchem.splits package

Submodules

deepchem.splits.splitters module

Contains an abstract base class that supports chemically aware data splits.

class deepchem.splits.splitters.ButinaSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing data splits based on the butina clustering of a bulk tanimoto fingerprint matrix.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, frac_train=None, frac_valid=None, frac_test=None, log_every_n=1000, cutoff=0.18)[source]

Splits internal compounds into train and validation based on the butina clustering algorithm. This splitting algorithm has an O(N^2) run time, where N is the number of elements in the dataset. The dataset is expected to be a classification dataset.

This algorithm is designed to generate validation data that are novel chemotypes.

Note that this function entirely disregards the ratios for frac_train, frac_valid, and frac_test. Furthermore, it does not generate a test set, only a train and valid set.

Setting a small cutoff value will generate smaller, finer clusters of high similarity, whereas setting a large cutoff value will generate larger, coarser clusters of low similarity.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

deepchem.splits.splitters.ClusterFps(fps, cutoff=0.2)[source]
class deepchem.splits.splitters.FingerprintSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing data splits based on the fingerprints of small molecules O(N**2) algorithm

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by fingerprint.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

update_distances(last_selected, cur_distances, distance_matrix, dont_update)[source]
class deepchem.splits.splitters.IndexSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for simple order based splits.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds into train/validation/test in provided order.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.IndiceSplitter(verbose=False, valid_indices=None, test_indices=None)[source]

Bases: deepchem.splits.splitters.Splitter

Class for splits based on input order.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds into train/validation/test in designated order.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.MaxMinSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing splits based on the MaxMin diversity algorithm. Intuitively, the test set is comprised of the most diverse compounds of the entire dataset. Furthermore, the validation set is comprised of diverse compounds under the test set.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds randomly into train/validation/test.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.MolecularWeightSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing data splits by molecular weight.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds into train/validation/test using the MW calculated by SMILES string.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.RandomGroupSplitter(groups, *args, **kwargs)[source]

Bases: deepchem.splits.splitters.Splitter

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]
train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.RandomSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing random data splits.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds randomly into train/validation/test.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.RandomStratifiedSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

RandomStratified Splitter class.

For sparse multitask datasets, a standard split offers no guarantees that the splits will have any activate compounds. This class guarantees that each task will have a proportional split of the activates in a split. TO do this, a ragged split is performed with different numbers of compounds taken from each task. Thus, the length of the split arrays may exceed the split of the original array. That said, no datapoint is copied to more than one split, so correctness is still ensured.

Note that this splitter is only valid for boolean label data.

TODO(rbharath): This splitter should be refactored to match style of other splitter classes.

get_task_split_indices(y, w, frac_split)[source]

Returns num datapoints needed per task to split properly.

k_fold_split(dataset, k, directories=None, **kwargs)[source]

Needs custom implementation due to ragged splits for stratification.

split(dataset, frac_split, split_dirs=None)[source]

Method that does bulk of splitting dataset.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000)[source]

Custom split due to raggedness in original split.

class deepchem.splits.splitters.ScaffoldSplitter(verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing data splits based on the scaffold of small molecules.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by scaffold.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.SingletaskStratifiedSplitter(task_number=0, verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class for doing data splits by stratification on a single task.

Example:

>>> n_samples = 100
>>> n_features = 10
>>> n_tasks = 10
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones_like(y)
>>> dataset = DiskDataset.from_numpy(np.ones((100,n_tasks)), np.ones((100,n_tasks)), verbose=False)
>>> splitter = SingletaskStratifiedSplitter(task_number=5, verbose=False)
>>> train_dataset, test_dataset = splitter.train_test_split(dataset)
k_fold_split(dataset, k, directories=None, seed=None, log_every_n=None, **kwargs)[source]

Splits compounds into k-folds using stratified sampling. Overriding base class k_fold_split.

Parameters:
  • dataset (dc.data.Dataset object) – Dataset.
  • k (int) – Number of folds.
  • seed (int (Optional, Default None)) – Random seed.
  • log_every_n (int (Optional, Default None)) – Log every n examples (not currently used).
Returns:

fold_datasets – List containing dc.data.Dataset objects

Return type:

List

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits compounds into train/validation/test using stratified sampling.

Parameters:
  • dataset (dc.data.Dataset object) – Dataset.
  • seed (int (Optional, Default None)) – Random seed.
  • frac_train (float (Optional, Default .8)) – Fraction of dataset put into training data.
  • frac_valid (float (Optional, Default .1)) – Fraction of dataset put into validation data.
  • frac_test (float (Optional, Default .1)) – Fraction of dataset put into test data.
  • log_every_n (int (Optional, Default None)) – Log every n examples (not currently used).
Returns:

retval – Tuple containing train indices, valid indices, and test indices

Return type:

Tuple

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.SpecifiedSplitter(input_file, split_field, verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

Class that splits data according to user specification.

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by user-specification.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.Splitter(verbose=False)[source]

Bases: object

Abstract base class for chemically aware splits..

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, frac_train=None, frac_valid=None, frac_test=None, log_every_n=None, verbose=False)[source]

Stub to be filled in by child classes.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)[source]

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)[source]

Splits self into train/validation/test sets.

Returns Dataset objects.

class deepchem.splits.splitters.TimeSplitterPDBbind(ids, year_file=None, verbose=False)[source]

Bases: deepchem.splits.splitters.Splitter

k_fold_split(dataset, k, directories=None, **kwargs)
Parameters:
  • dataset (Dataset) –
  • to do a k-fold split (Dataset) –
  • k (int) –
  • of folds (number) –
  • directories (list of str) –
  • of length 2*k filepaths to save the result disk-datasets (list) –
  • kwargs
Returns:

Return type:

list of length k tuples of (train, cv)

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits protein-ligand pairs in PDBbind into train/validation/test in time order.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, verbose=True)

Splits self into train/validation/test sets.

Returns Dataset objects.

deepchem.splits.splitters.generate_scaffold(smiles, include_chirality=False)[source]

Compute the Bemis-Murcko scaffold for a SMILES string.

deepchem.splits.splitters.randomize_arrays(array_list)[source]

deepchem.splits.task_splitter module

Contains an abstract base class that supports chemically aware data splits.

class deepchem.splits.task_splitter.TaskSplitter[source]

Bases: deepchem.splits.splitters.Splitter

Provides a simple interface for splitting datasets task-wise.

For some learning problems, the training and test datasets should have different tasks entirely. This is a different paradigm from the usual Splitter, which ensures that split datasets have different datapoints, not different tasks.

k_fold_split(dataset, K)[source]

Performs a K-fold split of the tasks for dataset.

If split is uneven, spillover goes to last fold.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • K (int) – Number of splits to be made
split(dataset, frac_train=None, frac_valid=None, frac_test=None, log_every_n=None, verbose=False)

Stub to be filled in by child classes.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, verbose=True)

Splits self into train/test sets. Returns Dataset objects.

train_valid_test_split(dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1)[source]

Performs a train/valid/test split of the tasks for dataset.

If split is uneven, spillover goes to test.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • frac_train (float, optional) – Proportion of tasks to be put into train. Rounded to nearest int.
  • frac_valid (float, optional) – Proportion of tasks to be put into valid. Rounded to nearest int.
  • frac_test (float, optional) – Proportion of tasks to be put into test. Rounded to nearest int.
deepchem.splits.task_splitter.merge_fold_datasets(fold_datasets)[source]

Merges fold datasets together.

Assumes that fold_datasets were outputted from k_fold_split. Specifically, assumes that each dataset contains the same datapoints, listed in the same ordering.

Module contents

Gathers all splitters in one place for convenient imports