btgym.algorithms.policy.base module

class btgym.algorithms.policy.base.BaseAacPolicy(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=False, **kwargs)[source]

Base advantage actor-critic Convolution-LSTM policy estimator with auxiliary control tasks for discrete or nested discrete action spaces.

Papers:

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.

Parameters:
  • ob_space – instance of btgym.spaces.DictSpace
  • ac_space – instance of btgym.spaces.ActionDictSpace
  • rp_sequence_size – reward prediction sample length
  • lstm_class – tf.nn.lstm class
  • lstm_layers – tuple of LSTM layers sizes
  • aux_estimate – bool, if True - add auxiliary tasks estimations to self.callbacks dictionary
  • time_flat – bool, if True - use static rnn, dynamic otherwise
  • not used (**kwargs) –
get_initial_features(**kwargs)[source]

Returns initial context.

Returns:LSTM zero-state tuple.
act(observation, lstm_state, last_action, last_reward, deterministic=False)[source]

Emits action.

Parameters:
  • observation – dictionary containing single observation
  • lstm_state – lstm context value
  • last_action – action value from previous step
  • last_reward – reward value previous step
  • deterministic – bool, it True - act deterministically, use random sampling otherwise (default); effective for discrete action sapce only (TODO: continious)
Returns:

Action as dictionary of several action encodings, actions logits, V-fn value, output RNN state

get_value(observation, lstm_state, last_action, last_reward)[source]

Estimates policy V-function.

Parameters:
  • observation – single observation value
  • lstm_state – lstm context value
  • last_action – action value from previous step
  • last_reward – reward value from previous step
Returns:

V-function value

get_pc_target(state, last_state, **kwargs)[source]

Estimates pixel-control task target.

Parameters:
  • state – single observation value
  • last_state – single observation value
  • **kwargs – not used
Returns:

Estimated absolute difference between two subsampled states.

static get_sample_config(*args, **kwargs)[source]

Dummy implementation.

Returns:default data sample configuration dictionary btgym.datafeed.base.EnvResetConfig
class btgym.algorithms.policy.base.Aac1dPolicy(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=True, **kwargs)[source]

AAC policy for one-dimensional signal obs. state.

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects bi-modal observation as dict: external, internal.

Parameters:
  • ob_space – dictionary of observation state shapes
  • ac_space – discrete action space shape (length)
  • rp_sequence_size – reward prediction sample length
  • lstm_class – tf.nn.lstm class
  • lstm_layers – tuple of LSTM layers sizes
  • aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary.
  • not used (**kwargs) –

btgym.algorithms.policy.stacked_lstm module

class btgym.algorithms.policy.stacked_lstm.StackedLstmPolicy(ob_space, ac_space, rp_sequence_size, state_encoder_class_ref=<function conv_2d_network>, lstm_class_ref=<class 'tensorflow.contrib.rnn.python.ops.rnn_cell.LayerNormBasicLSTMCell'>, lstm_layers=(256, 256), linear_layer_ref=<function noisy_linear>, share_encoder_params=False, dropout_keep_prob=1.0, action_dp_alpha=200.0, aux_estimate=False, encode_internal_state=False, static_rnn=True, shared_p_v=False, **kwargs)[source]

Conv.-Stacked_LSTM policy, based on NAV A3C agent architecture from

LEARNING TO NAVIGATE IN COMPLEX ENVIRONMENTS by Mirowski et all. and

LEARNING TO REINFORCEMENT LEARN by JX Wang et all.

Papers:

https://arxiv.org/pdf/1611.03673.pdf

https://arxiv.org/pdf/1611.05763.pdf

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.

Parameters:
  • ob_space – instance of btgym.spaces.DictSpace
  • ac_space – instance of btgym.spaces.ActionDictSpace
  • rp_sequence_size – reward prediction sample length
  • lstm_class_ref – tf.nn.lstm class to use
  • lstm_layers – tuple of LSTM layers sizes
  • linear_layer_ref – linear layer class to use
  • share_encoder_params – bool, whether to share encoder parameters for every ‘external’ data stream
  • dropout_keep_prob – in (0, 1] dropout regularisation parameter
  • action_dp_alpha
  • aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary
  • encode_internal_state – use encoder over ‘internal’ part of observation space
  • static_rnn – (bool), it True - use static rnn graph, dynamic otherwise
  • not used (**kwargs) –
class btgym.algorithms.policy.stacked_lstm.AacStackedRL2Policy(lstm_2_init_period=50, **kwargs)[source]

Attempt to implement two-level RL^2 This policy class in conjunction with DataDomain classes from btgym.datafeed is aimed to implement RL^2 algorithm by Duan et al.

Paper: FAST REINFORCEMENT LEARNING VIA SLOW REINFORCEMENT LEARNING, https://arxiv.org/pdf/1611.02779.pdf

The only difference from Base policy is get_initial_features() method, which has been changed either to reset RNN context to zero-state or return context from the end of previous episode, depending on episode metadata received or `lstm_2_init_period’ parameter.

get_initial_features(state, context=None)[source]

Returns RNN initial context. RNN_1 (lower) context is reset at every call.

RNN_2 (upper) context is reset:
  • every `lstm_2_init_period’ episodes;
  • episode initial state trial_num metadata has been changed form last call (new train trial started);
  • episode metatdata type is non-zero (test episode);
  • no context arg is provided (initial episode of training);
  • … else carries context on to new episode;

Episode metadata are provided by DataTrialIterator, which is shaping Trial data distribution in this case, and delivered through env.strategy as separate key in observation dictionary.

Parameters:
  • state – initial episode state (result of env.reset())
  • context – last previous episode RNN state (last_context of runner)
Returns:

2_RNN zero-state tuple.

Raises:

KeyError if [`metadata`] – [trial_num,`type`] keys not found