# deepchem.rl package¶

## deepchem.rl.a3c module¶

Asynchronous Advantage Actor-Critic (A3C) algorithm for reinforcement learning.

class deepchem.rl.a3c.A3C(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Bases: object

Implements the Asynchronous Advantage Actor-Critic (A3C) algorithm for reinforcement learning.

The algorithm is described in Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning” (https://arxiv.org/abs/1602.01783). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:

1. The policy loss, which seeks to maximize the discounted reward for each action.
2. The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
3. An entropy term to encourage exploration.

This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):
... return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states.

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters: total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted. checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters: state (array) – the state of the environment for which to generate predictions use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions. save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept. the array of action probabilities, and the estimated value function
restore()[source]

Reload the model parameters from the most recent checkpoint file.

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters: state (array) – the state of the environment for which to select an action deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities. use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions. save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept. the index of the selected action
class deepchem.rl.a3c.A3CLoss(value_weight, entropy_weight, **kwargs)[source]

This layer computes the loss function for A3C.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)


The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)

Parameters: replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one. variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model. shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters: summary_op (str) – summary operation to annotate node summary_description (object, optional) – Optional summary_pb2.SummaryDescription() collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters: in_layers (list tensor) – in tensors for the shared layer (List) – Layer

## deepchem.rl.ppo module¶

Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

class deepchem.rl.ppo.PPO(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Bases: object

Implements the Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

The algorithm is described in Schulman et al, “Proximal Policy Optimization Algorithms” (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:

1. The policy loss, which seeks to maximize the discounted reward for each action.
2. The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
3. An entropy term to encourage exploration.

This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):
... return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states.

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters: total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted. checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters: state (array) – the state of the environment for which to generate predictions use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions. save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept. the array of action probabilities, and the estimated value function
restore()[source]

Reload the model parameters from the most recent checkpoint file.

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters: state (array) – the state of the environment for which to select an action deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities. use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to all zeros before computing the predictions. save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept. the index of the selected action
class deepchem.rl.ppo.PPOLoss(value_weight, entropy_weight, clipping_width, **kwargs)[source]

This layer computes the loss function for PPO.

add_summary_to_tg()

Can only be called after self.create_layer to gaurentee that name is not none

clone(in_layers)

Create a copy of this layer with different inputs.

copy(replacements={}, variables_graph=None, shared=False)

Duplicate this Layer and all its inputs.

This is similar to clone(), but instead of only cloning one layer, it also recursively calls copy() on all of this layer’s inputs to clone the entire hierarchy of layers. In the process, you can optionally tell it to replace particular layers with specific existing ones. For example, you can clone a stack of layers, while connecting the topmost ones to different inputs.

For example, consider a stack of dense layers that depend on an input:

>>> input = Feature(shape=(None, 100))
>>> dense1 = Dense(100, in_layers=input)
>>> dense2 = Dense(100, in_layers=dense1)
>>> dense3 = Dense(100, in_layers=dense2)


The following will clone all three dense layers, but not the input layer. Instead, the input to the first dense layer will be a different layer specified in the replacements map.

>>> replacements = {input: new_input}
>>> dense3_copy = dense3.copy(replacements)

Parameters: replacements (map) – specifies existing layers, and the layers to replace them with (instead of cloning them). This argument serves two purposes. First, you can pass in a list of replacements to control which layers get cloned. In addition, as each layer is cloned, it is added to this map. On exit, it therefore contains a complete record of all layers that were copied, and a reference to the copy of each one. variables_graph (TensorGraph) – an optional TensorGraph from which to take variables. If this is specified, the current value of each variable in each layer is recorded, and the copy has that value specified as its initial value. This allows a piece of a pre-trained model to be copied to another model. shared (bool) – if True, create new layers by calling shared() on the input layers. This means the newly created layers will share variables with the original ones.
create_tensor(**kwargs)[source]
layer_number_dict = {}
none_tensors()
set_summary(summary_op, summary_description=None, collections=None)

Annotates a tensor with a tf.summary operation Collects data from self.out_tensor by default but can be changed by setting self.tb_input to another tensor in create_tensor

Parameters: summary_op (str) – summary operation to annotate node summary_description (object, optional) – Optional summary_pb2.SummaryDescription() collections (list of graph collections keys, optional) – New summary op is added to these collections. Defaults to [GraphKeys.SUMMARIES]
set_tensors(tensor)
set_variable_initial_values(values)

Set the initial values of all variables.

This takes a list, which contains the initial values to use for all of this layer’s values (in the same order retured by TensorGraph.get_layer_variables()). When this layer is used in a TensorGraph, it will automatically initialize each variable to the value specified in the list. Note that some layers also have separate mechanisms for specifying variable initializers; this method overrides them. The purpose of this method is to let a Layer object represent a pre-trained layer, complete with trained values for its variables.

shape

Get the shape of this Layer’s output.

shared(in_layers)

Create a copy of this layer that shares variables with it.

This is similar to clone(), but where clone() creates two independent layers, this causes the layers to share variables with each other.

Parameters: in_layers (list tensor) – in tensors for the shared layer (List) – Layer

## Module contents¶

Interface for reinforcement learning.

class deepchem.rl.Environment(state_shape, n_actions, state_dtype=None)[source]

Bases: object

An environment in which an actor performs actions to accomplish a task.

An environment has a current state, which is represented as either a single NumPy array, or optionally a list of NumPy arrays. When an action is taken, that causes the state to be updated. Exactly what is meant by an “action” is defined by each subclass. As far as this interface is concerned, it is simply an arbitrary object. The environment also computes a reward for each action, and reports when the task has been terminated (meaning that no more actions may be taken).

Environment objects should be written to support pickle and deepcopy operations. Many algorithms involve creating multiple copies of the Environment, possibly running in different processes or even on different computers.

n_actions

The number of possible actions that can be performed in this Environment.

reset()[source]

Initialize the environment in preparation for doing calculations with it.

This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

state

The current state of the environment, represented as either a NumPy array or list of arrays.

If reset() has not yet been called at least once, this is undefined.

state_dtype

The dtypes of the arrays that describe a state.

If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

state_shape

The shape of the arrays that describe a state.

If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

step(action)[source]

Take a time step by performing an action.

This causes the “state” and “terminated” properties to be updated.

Parameters: action (object) – an object describing the action to take the reward earned by taking the action, represented as a floating point number (higher values are better)
terminated

Whether the task has reached its end.

If reset() has not yet been called at least once, this is undefined.

class deepchem.rl.GymEnvironment(name)[source]

This is a convenience class for working with environments from OpenAI Gym.

n_actions

The number of possible actions that can be performed in this Environment.

reset()[source]
state

The current state of the environment, represented as either a NumPy array or list of arrays.

If reset() has not yet been called at least once, this is undefined.

state_dtype

The dtypes of the arrays that describe a state.

If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

state_shape

The shape of the arrays that describe a state.

If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

step(action)[source]
terminated

Whether the task has reached its end.

If reset() has not yet been called at least once, this is undefined.

class deepchem.rl.Policy[source]

Bases: object

A policy for taking actions within an environment.

A policy is defined by a set of TensorGraph Layer objects that perform the necessary calculations. There are many algorithms for reinforcement learning, and they differ in what values they require a policy to compute. That makes it impossible to define a single interface allowing any policy to be optimized with any algorithm. Instead, this interface just tries to be as flexible and generic as possible. Each algorithm must document what values it expects create_layers() to return.

Policy objects should be written to support pickling. Many algorithms involve creating multiple copies of the Policy, possibly running in different processes or even on different computers.

create_layers(state, **kwargs)[source]

Create the TensorGraph Layers that define the policy.

The arguments always include a list of Feature layers representing the current state of the environment (one layer for each array in the state). Depending on the algorithm being used, other arguments might get passed as well. It is up to each algorithm to document that.

This method should construct and return a dict that maps strings to Layer objects. Each algorithm must document what Layers it expects the policy to create. If this method is called multiple times, it should create a new set of Layers every time.