deepchem.rl package

Submodules

deepchem.rl.a3c module

Asynchronous Advantage Actor-Critic (A3C) algorithm for reinforcement learning.

class deepchem.rl.a3c.A3C(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Bases: object

Implements the Asynchronous Advantage Actor-Critic (A3C) algorithm for reinforcement learning.

The algorithm is described in Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning” (https://arxiv.org/abs/1602.01783). This class supports environments with both discrete and continuous action spaces. For discrete action spaces, the “action” argument passed to the environment is an integer giving the index of the action to perform. The policy must output a vector called “action_prob” giving the probability of taking each action. For continous action spaces, the action is an array where each element is chosen independently from a normal distribution. The policy must output two arrays of the same shape: “action_mean” gives the mean value for each element, and “action_std” gives the standard deviation for each element. In either case, the policy must also output a scalar called “value” which is an estimate of the value function for the current state.

The algorithm optimizes all outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.
  2. The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
  3. An entropy term to encourage exploration.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):
... return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states.

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads
  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.
  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array) – the state of the environment for which to generate predictions
  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions.
  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
Returns:

Return type:

the array of action probabilities, and the estimated value function

restore()[source]

Reload the model parameters from the most recent checkpoint file.

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array) – the state of the environment for which to select an action
  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.
  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions.
  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
Returns:

Return type:

the index of the selected action

class deepchem.rl.a3c.A3CLossContinuous(value_weight, entropy_weight, **kwargs)[source]

Bases: deepchem.models.tensorgraph.layers.Layer

This layer computes the loss function for A3C with continuous action spaces.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)

The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> new_input = Feature(shape=(None, 100))
>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)
Parameters:
  • replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one.
  • variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model.
  • shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters:
  • summary_op (str) – summary operation to annotate node
  • summary_description (object, optional) – Optional summary_pb2.SummaryDescription()
  • collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters:
  • in_layers (list tensor) –
  • in tensors for the shared layer (List) –
Returns:

Return type:

Layer

class deepchem.rl.a3c.A3CLossDiscrete(value_weight, entropy_weight, **kwargs)[source]

Bases: deepchem.models.tensorgraph.layers.Layer

This layer computes the loss function for A3C with discrete action spaces.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)

The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> new_input = Feature(shape=(None, 100))
>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)
Parameters:
  • replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one.
  • variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model.
  • shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters:
  • summary_op (str) – summary operation to annotate node
  • summary_description (object, optional) – Optional summary_pb2.SummaryDescription()
  • collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters:
  • in_layers (list tensor) –
  • in tensors for the shared layer (List) –
Returns:

Return type:

Layer

deepchem.rl.mcts module

Monte Carlo tree search algorithm for reinforcement learning.

class deepchem.rl.mcts.MCTS(env, policy, max_search_depth=100, n_search_episodes=1000, discount_factor=0.99, value_weight=1.0, optimizer=<deepchem.models.tensorgraph.optimizers.Adam object>, model_dir=None)[source]

Bases: object

Implements a Monte Carlo tree search algorithm for reinforcement learning.

This is adapted from Silver et al, “Mastering the game of Go without human knowledge” (https://www.nature.com/articles/nature24270). The methods described in that paper rely on features of Go that are not generally true of all reinforcement learning problems. To transform it into a more generally useful RL algorithm, it has been necessary to change some aspects of the method. The overall approach used in this implementation is still the same, although some of the details differ.

This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. At every step of simulating an episode, it performs an expensive tree search to explore the consequences of many possible actions. Based on that search, it computes much better estimates for the value function of the current state and the desired action probabilities. In then tries to optimize the policy to make its outputs match the result of the tree search.

Optimization proceeds through a series of iterations. Each iteration consists of two stages:

  1. Simulate many episodes. At every step perform a tree search to determine targets for the probabilities and value function, and store them into a buffer.
  2. Optimize the policy using batches drawn from the buffer generated in step 1.

The tree search involves repeatedly selecting actions starting from the current state. This is done by using deepcopy() to clone the environment. It is essential that this produce a deterministic sequence of states: performing an action on the cloned environment must always lead to the same state as performing that action on the original environment. For environments whose state transitions are deterministic, this is not a problem. For ones whose state transitions are stochastic, it is essential that the random number generator used to select new states be stored as part of the environment and be properly cloned by deepcopy().

This class does not support policies that include recurrent layers.

fit(iterations, steps_per_iteration=10000, epochs_per_iteration=10, temperature=0.5, puct_scale=None, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters:
  • iterations (int) – the total number of iterations (simulation followed by optimization) to perform
  • steps_per_iteration (int) – the total number of steps to simulate in each iteration. Every step consists of a tree search, followed by selecting an action based on the results of the search.
  • epochs_per_iteration (int) – the number of epochs of optimization to perform for each iteration. Each epoch involves randomly ordering all the steps that were just simulated in the current iteration, splitting them into batches, and looping over the batches.
  • temperature (float) – the temperature factor to use when selecting a move for each step of simulation. Larger values produce a broader probability distribution and hence more exploration. Smaller values produce a stronger preference for whatever action did best in the tree search.
  • puct_scale (float) – the scale of the PUCT term in the expression for selecting actions during tree search. This should be roughly similar in magnitude to the rewards given by the environment, since the PUCT term is added to the mean discounted reward. This may be None, in which case a value is adaptively selected that tries to match the mean absolute value of the discounted reward.
  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.
  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
predict(state)[source]

Compute the policy’s output predictions for a state.

Parameters:state (array) – the state of the environment for which to generate predictions
Returns:
Return type:the array of action probabilities, and the estimated value function
restore()[source]

Reload the model parameters from the most recent checkpoint file.

select_action(state, deterministic=False)[source]

Select an action to perform based on the environment’s state.

Parameters:
  • state (array) – the state of the environment for which to select an action
  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.
Returns:

Return type:

the index of the selected action

class deepchem.rl.mcts.MCTSLoss(value_weight, **kwargs)[source]

Bases: deepchem.models.tensorgraph.layers.Layer

This layer computes the loss function for MCTS.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)

The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> new_input = Feature(shape=(None, 100))
>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)
Parameters:
  • replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one.
  • variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model.
  • shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters:
  • summary_op (str) – summary operation to annotate node
  • summary_description (object, optional) – Optional summary_pb2.SummaryDescription()
  • collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters:
  • in_layers (list tensor) –
  • in tensors for the shared layer (List) –
Returns:

Return type:

Layer

class deepchem.rl.mcts.TreeSearchNode(prior_prob)[source]

Bases: object

Represents a node in the Monte Carlo tree search.

deepchem.rl.ppo module

Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

class deepchem.rl.ppo.PPO(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Bases: object

Implements the Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

The algorithm is described in Schulman et al, “Proximal Policy Optimization Algorithms” (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.
  2. The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
  3. An entropy term to encourage exploration.

This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):
... return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states.

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads
  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.
  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array) – the state of the environment for which to generate predictions
  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions.
  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
Returns:

Return type:

the array of action probabilities, and the estimated value function

restore()[source]

Reload the model parameters from the most recent checkpoint file.

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array) – the state of the environment for which to select an action
  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.
  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions.
  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
Returns:

Return type:

the index of the selected action

class deepchem.rl.ppo.PPOLoss(value_weight, entropy_weight, clipping_width, **kwargs)[source]

Bases: deepchem.models.tensorgraph.layers.Layer

This layer computes the loss function for PPO.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)

The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> new_input = Feature(shape=(None, 100))
>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)
Parameters:
  • replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one.
  • variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model.
  • shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters:
  • summary_op (str) – summary operation to annotate node
  • summary_description (object, optional) – Optional summary_pb2.SummaryDescription()
  • collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters:
  • in_layers (list tensor) –
  • in tensors for the shared layer (List) –
Returns:

Return type:

Layer

Module contents

Interface for reinforcement learning.

class deepchem.rl.Environment(state_shape, n_actions=None, state_dtype=None, action_shape=None)[source]

Bases: object

An environment in which an actor performs actions to accomplish a task.

An environment has a current state, which is represented as either a single NumPy array, or optionally a list of NumPy arrays. When an action is taken, that causes the state to be updated. The environment also computes a reward for each action, and reports when the task has been terminated (meaning that no more actions may be taken).

Two types of actions are supported. For environments with discrete action spaces, the action is an integer specifying the index of the action to perform (out of a fixed list of possible actions). For environments with continuous action spaces, the action is a NumPy array.

Environment objects should be written to support pickle and deepcopy operations. Many algorithms involve creating multiple copies of the Environment, possibly running in different processes or even on different computers.

action_shape

The expected shape of NumPy arrays representing actions.

If the environment uses a discrete action space, this returns None.

n_actions

The number of possible actions that can be performed in this Environment.

If the environment uses a continuous action space, this returns None.

reset()[source]

Initialize the environment in preparation for doing calculations with it.

This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

state

The current state of the environment, represented as either a NumPy array or list of arrays.

If reset() has not yet been called at least once, this is undefined.

state_dtype

The dtypes of the arrays that describe a state.

If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

state_shape

The shape of the arrays that describe a state.

If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

step(action)[source]

Take a time step by performing an action.

This causes the “state” and “terminated” properties to be updated.

Parameters:action (object) – an object describing the action to take
Returns:
  • the reward earned by taking the action, represented as a floating point number
  • (higher values are better)
terminated

Whether the task has reached its end.

If reset() has not yet been called at least once, this is undefined.

class deepchem.rl.GymEnvironment(name)[source]

Bases: deepchem.rl.Environment

This is a convenience class for working with environments from OpenAI Gym.

action_shape

The expected shape of NumPy arrays representing actions.

If the environment uses a discrete action space, this returns None.

n_actions

The number of possible actions that can be performed in this Environment.

If the environment uses a continuous action space, this returns None.

reset()[source]
state

The current state of the environment, represented as either a NumPy array or list of arrays.

If reset() has not yet been called at least once, this is undefined.

state_dtype

The dtypes of the arrays that describe a state.

If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

state_shape

The shape of the arrays that describe a state.

If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

step(action)[source]
terminated

Whether the task has reached its end.

If reset() has not yet been called at least once, this is undefined.

class deepchem.rl.Policy[source]

Bases: object

A policy for taking actions within an environment.

A policy is defined by a set of TensorGraph Layer objects that perform the necessary calculations. There are many algorithms for reinforcement learning, and they differ in what values they require a policy to compute. That makes it impossible to define a single interface allowing any policy to be optimized with any algorithm. Instead, this interface just tries to be as flexible and generic as possible. Each algorithm must document what values it expects create_layers() to return.

Policy objects should be written to support pickling. Many algorithms involve creating multiple copies of the Policy, possibly running in different processes or even on different computers.

create_layers(state, **kwargs)[source]

Create the TensorGraph Layers that define the policy.

The arguments always include a list of Feature layers representing the current state of the environment (one layer for each array in the state). Depending on the algorithm being used, other arguments might get passed as well. It is up to each algorithm to document that.

This method should construct and return a dict that maps strings to Layer objects. Each algorithm must document what Layers it expects the policy to create. If this method is called multiple times, it should create a new set of Layers every time.