deepchem.molnet package

Subpackages

Submodules

deepchem.molnet.check_availability module

deepchem.molnet.dnasim module

deepchem.molnet.dnasim.get_distribution(GC_fraction)[source]
deepchem.molnet.dnasim.motif_density(motif_name, seq_length, num_seqs, min_counts, max_counts, GC_fraction, central_bp=None)[source]

Returns sequences with motif density, along with embeddings array.

deepchem.molnet.dnasim.simple_motif_embedding(motif_name, seq_length, num_seqs, GC_fraction)[source]

Simulates sequences with a motif embedded anywhere in the sequence.

Parameters:
  • motif_name (str) – encode motif name
  • seq_length (int) – length of sequence
  • num_seqs (int) – number of sequences
  • GC_fraction (float) – GC basepair fraction in background sequence
Returns:

  • sequence_arr (1darray) – Array with sequence strings.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.dnasim.simulate_differential_accessibility(pos_motif_names, neg_motif_names, seq_length, min_num_motifs, max_num_motifs, num_pos, num_neg, GC_fraction)[source]

Generates data for differential accessibility task.

Parameters:
  • pos_motif_names (list) – List of strings.
  • neg_motif_names (list) – List of strings.
  • seq_length (int) –
  • min_num_motifs (int) –
  • max_num_motifs (int) –
  • num_pos (int) –
  • num_neg (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.dnasim.simulate_heterodimer_grammar(motif1, motif2, seq_length, min_spacing, max_spacing, num_pos, num_neg, GC_fraction)[source]
Simulates two classes of sequences with motif1 and motif2:
  • Positive class sequences with motif1 and motif2 positioned min_spacing and max_spacing
  • Negative class sequences with independent motif1 and motif2 positioned

anywhere in the sequence, not as a heterodimer grammar

Parameters:
  • seq_length (int, length of sequence) –
  • GC_fraction (float, GC fraction in background sequence) –
  • num_pos (int, number of positive class sequences) –
  • num_neg (int, number of negatice class sequences) –
  • motif1 (str, encode motif name) –
  • motif2 (str, encode motif name) –
  • min_spacing (int, minimum inter motif spacing) –
  • max_spacing (int, maximum inter motif spacing) –
Returns:

  • sequence_arr (1darray) – Array with sequence strings.
  • y (1darray) – Array with positive/negative class labels.
  • embedding_arr (list) – List of embedding objects.

deepchem.molnet.dnasim.simulate_motif_counting(motif_name, seq_length, pos_counts, neg_counts, num_pos, num_neg, GC_fraction)[source]

Generates data for motif counting task.

Parameters:
  • motif_name (str) –
  • seq_length (int) –
  • pos_counts (list) – (min_counts, max_counts) for positive set.
  • neg_counts (list) – (min_counts, max_counts) for negative set.
  • num_pos (int) –
  • num_neg (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.dnasim.simulate_motif_density_localization(motif_name, seq_length, center_size, min_motif_counts, max_motif_counts, num_pos, num_neg, GC_fraction)[source]
Simulates two classes of seqeuences:
  • Positive class sequences with multiple motif instances in center of the sequence.
  • Negative class sequences with multiple motif instances anywhere in the sequence.

The number of motif instances is uniformly sampled between minimum and maximum motif counts.

Parameters:
  • motif_name (str) – encode motif name
  • seq_length (int) – length of sequence
  • center_size (int) – length of central part of the sequence where motifs can be positioned
  • min_motif_counts (int) – minimum number of motif instances
  • max_motif_counts (int) – maximum number of motif instances
  • num_pos (int) – number of positive class sequences
  • num_neg (int) – number of negative class sequences
  • GC_fraction (float) – GC fraction in background sequence
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.dnasim.simulate_multi_motif_embedding(motif_names, seq_length, min_num_motifs, max_num_motifs, num_seqs, GC_fraction)[source]

Generates data for multi motif recognition task.

Parameters:
  • motif_names (list) – List of strings.
  • seq_length (int) –
  • min_num_motifs (int) –
  • max_num_motifs (int) –
  • num_seqs (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (ndarray) – Contains labels for each motif.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.dnasim.simulate_single_motif_detection(motif_name, seq_length, num_pos, num_neg, GC_fraction)[source]
Simulates two classes of seqeuences:
  • Positive class sequence with a motif embedded anywhere in the sequence
  • Negative class sequence without the motif
Parameters:
  • motif_name (str) – encode motif name
  • seq_length (int) – length of sequence
  • num_pos (int) – number of positive class sequences
  • num_neg (int) – number of negative class sequences
  • GC_fraction (float) – GC fraction in background sequence
Returns:

  • sequence_arr (1darray) – Array with sequence strings.
  • y (1darray) – Array with positive/negative class labels.
  • embedding_arr (1darray) – Array of embedding objects.

deepchem.molnet.preset_hyper_parameters module

Created on Tue Mar 7 00:07:10 2017

@author: zqwu

deepchem.molnet.run_benchmark module

Created on Mon Mar 06 14:25:40 2017

@author: Zhenqin Wu

deepchem.molnet.run_benchmark.benchmark_model(model, all_dataset, transformers, metric, test=False)[source]

Benchmark custom model.

model: user-defined model stucture
For user define model, it should include function: fit, evaluate.
all_dataset: (train, test, val) data tuple.
Returned by load_dataset function.

transformers

metric: string
choice of evaluation metrics.
deepchem.molnet.run_benchmark.load_dataset(dataset, featurizer, split='random')[source]

Load specific dataset for benchmark.

Parameters:
  • dataset (string) – choice of which datasets to use, should be: tox21, muv, sider, toxcast, pcba, delaney, kaggle, nci, clintox, hiv, pcba_128, pcba_146, pdbbind, chembl, qm7, qm7b, qm9, sampl
  • featurizer (string or dc.feat.Featurizer.) – choice of featurization.
  • split (string, optional (default=None)) – choice of splitter function, None = using the default splitter
deepchem.molnet.run_benchmark.run_benchmark(datasets, model, split=None, metric=None, direction=True, featurizer=None, n_features=0, out_path='.', hyper_parameters=None, hyper_param_search=False, max_iter=20, search_range=2, test=False, reload=True, seed=123)[source]

Run benchmark test on designated datasets with deepchem(or user-defined) model

Parameters:
  • datasets (list of string) – choice of which datasets to use, should be: bace_c, bace_r, bbbp, chembl, clearance, clintox, delaney, hiv, hopv, kaggle, lipo, muv, nci, pcba, pdbbind, ppb, qm7, qm7b, qm8, qm9, sampl, sider, tox21, toxcast
  • model (string or user-defined model stucture) – choice of which model to use, deepchem provides implementation of logistic regression, random forest, multitask network, bypass multitask network, irv, graph convolution; for user define model, it should include function: fit, evaluate
  • split (string, optional (default=None)) – choice of splitter function, None = using the default splitter
  • metric (string, optional (default=None)) – choice of evaluation metrics, None = using the default metrics(AUC & R2)
  • direction (bool, optional(default=True)) – Optimization direction when doing hyperparameter search Maximization(True) or minimization(False)
  • featurizer (string or dc.feat.Featurizer, optional (default=None)) – choice of featurization, None = using the default corresponding to model (string only applicable to deepchem models)
  • n_features (int, optional(default=0)) – depending on featurizers, redefined when using deepchem featurizers, need to be specified for user-defined featurizers(if using deepchem models)
  • out_path (string, optional(default='.')) – path of result file
  • hyper_parameters (dict, optional (default=None)) – hyper parameters for designated model, None = use preset values
  • hyper_param_search (bool, optional(default=False)) – whether to perform hyper parameter search, using gaussian process by default
  • max_iter (int, optional(default=20)) – number of optimization trials
  • search_range (int(float), optional(default=4)) – optimization on [initial values / search_range, initial values * search_range]
  • test (boolean, optional(default=False)) – whether to evaluate on test set
  • reload (boolean, optional(default=True)) – whether to save and reload featurized datasets

deepchem.molnet.run_benchmark_low_data module

Created on Mon Mar 06 14:25:40 2017

@author: Zhenqin Wu

deepchem.molnet.run_benchmark_models module

Created on Mon Mar 6 23:41:26 2017

@author: zqwu

deepchem.molnet.run_benchmark_models.benchmark_classification(train_dataset, valid_dataset, test_dataset, tasks, transformers, n_features, metric, model, test=False, hyper_parameters=None, seed=123)[source]

Calculate performance of different models on the specific dataset & tasks

Parameters:
  • train_dataset (dataset struct) – dataset used for model training and evaluation
  • valid_dataset (dataset struct) – dataset only used for model evaluation (and hyperparameter tuning)
  • test_dataset (dataset struct) – dataset only used for model evaluation
  • tasks (list of string) – list of targets(tasks, datasets)
  • transformers (dc.trans.Transformer struct) – transformer used for model evaluation
  • n_features (integer) – number of features, or length of binary fingerprints
  • metric (list of dc.metrics.Metric objects) – metrics used for evaluation
  • model (string, optional) – choice of model ‘rf’, ‘tf’, ‘tf_robust’, ‘logreg’, ‘irv’, ‘graphconv’, ‘dag’, ‘xgb’, ‘weave’, ‘kernelsvm’, ‘textcnn’, ‘mpnn’
  • test (boolean, optional) – whether to calculate test_set performance
  • hyper_parameters (dict, optional (default=None)) – hyper parameters for designated model, None = use preset values
Returns:

  • train_scores (dict) – predicting results(AUC) on training set
  • valid_scores (dict) – predicting results(AUC) on valid set
  • test_scores (dict) – predicting results(AUC) on test set

deepchem.molnet.run_benchmark_models.benchmark_regression(train_dataset, valid_dataset, test_dataset, tasks, transformers, n_features, metric, model, test=False, hyper_parameters=None, seed=123)[source]

Calculate performance of different models on the specific dataset & tasks

Parameters:
  • train_dataset (dataset struct) – dataset used for model training and evaluation
  • valid_dataset (dataset struct) – dataset only used for model evaluation (and hyperparameter tuning)
  • test_dataset (dataset struct) – dataset only used for model evaluation
  • tasks (list of string) – list of targets(tasks, datasets)
  • transformers (dc.trans.Transformer struct) – transformer used for model evaluation
  • n_features (integer) – number of features, or length of binary fingerprints
  • metric (list of dc.metrics.Metric objects) – metrics used for evaluation
  • model (string, optional) – choice of model ‘tf_regression’, ‘tf_regression_ft’, ‘rf_regression’, ‘graphconvreg’, ‘dtnn’, ‘dag_regression’, ‘xgb_regression’, ‘weave_regression’, ‘textcnn_regression’, ‘krr’, ‘ani’, ‘krr_ft’, ‘mpnn’
  • test (boolean, optional) – whether to calculate test_set performance
  • hyper_parameters (dict, optional (default=None)) – hyper parameters for designated model, None = use preset values
Returns:

  • train_scores (dict) – predicting results(R2) on training set
  • valid_scores (dict) – predicting results(R2) on valid set
  • test_scores (dict) – predicting results(R2) on test set

Module contents