btgym.algorithms.policy.base module¶
-
class
btgym.algorithms.policy.base.
BaseAacPolicy
(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=False, **kwargs)[source]¶ Base advantage actor-critic Convolution-LSTM policy estimator with auxiliary control tasks for discrete or nested discrete action spaces.
Papers:
Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.
Parameters: - ob_space – instance of btgym.spaces.DictSpace
- ac_space – instance of btgym.spaces.ActionDictSpace
- rp_sequence_size – reward prediction sample length
- lstm_class – tf.nn.lstm class
- lstm_layers – tuple of LSTM layers sizes
- aux_estimate – bool, if True - add auxiliary tasks estimations to self.callbacks dictionary
- time_flat – bool, if True - use static rnn, dynamic otherwise
- not used (**kwargs) –
-
act
(observation, lstm_state, last_action, last_reward, deterministic=False)[source]¶ Emits action.
Parameters: - observation – dictionary containing single observation
- lstm_state – lstm context value
- last_action – action value from previous step
- last_reward – reward value previous step
- deterministic – bool, it True - act deterministically, use random sampling otherwise (default); effective for discrete action sapce only (TODO: continious)
Returns: Action as dictionary of several action encodings, actions logits, V-fn value, output RNN state
-
get_value
(observation, lstm_state, last_action, last_reward)[source]¶ Estimates policy V-function.
Parameters: - observation – single observation value
- lstm_state – lstm context value
- last_action – action value from previous step
- last_reward – reward value from previous step
Returns: V-function value
-
class
btgym.algorithms.policy.base.
Aac1dPolicy
(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=True, **kwargs)[source]¶ AAC policy for one-dimensional signal obs. state.
Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects bi-modal observation as dict: external, internal.
Parameters: - ob_space – dictionary of observation state shapes
- ac_space – discrete action space shape (length)
- rp_sequence_size – reward prediction sample length
- lstm_class – tf.nn.lstm class
- lstm_layers – tuple of LSTM layers sizes
- aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary.
- not used (**kwargs) –
btgym.algorithms.policy.stacked_lstm module¶
-
class
btgym.algorithms.policy.stacked_lstm.
StackedLstmPolicy
(ob_space, ac_space, rp_sequence_size, state_encoder_class_ref=<function conv_2d_network>, lstm_class_ref=<class 'tensorflow.contrib.rnn.python.ops.rnn_cell.LayerNormBasicLSTMCell'>, lstm_layers=(256, 256), linear_layer_ref=<function noisy_linear>, share_encoder_params=False, dropout_keep_prob=1.0, action_dp_alpha=200.0, aux_estimate=False, encode_internal_state=False, static_rnn=True, shared_p_v=False, **kwargs)[source]¶ Conv.-Stacked_LSTM policy, based on NAV A3C agent architecture from
LEARNING TO NAVIGATE IN COMPLEX ENVIRONMENTS by Mirowski et all. and
LEARNING TO REINFORCEMENT LEARN by JX Wang et all.
Papers:
https://arxiv.org/pdf/1611.03673.pdf
https://arxiv.org/pdf/1611.05763.pdf
Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.
Parameters: - ob_space – instance of btgym.spaces.DictSpace
- ac_space – instance of btgym.spaces.ActionDictSpace
- rp_sequence_size – reward prediction sample length
- lstm_class_ref – tf.nn.lstm class to use
- lstm_layers – tuple of LSTM layers sizes
- linear_layer_ref – linear layer class to use
- share_encoder_params – bool, whether to share encoder parameters for every ‘external’ data stream
- dropout_keep_prob – in (0, 1] dropout regularisation parameter
- action_dp_alpha –
- aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary
- encode_internal_state – use encoder over ‘internal’ part of observation space
- static_rnn – (bool), it True - use static rnn graph, dynamic otherwise
- not used (**kwargs) –
-
class
btgym.algorithms.policy.stacked_lstm.
AacStackedRL2Policy
(lstm_2_init_period=50, **kwargs)[source]¶ Attempt to implement two-level RL^2 This policy class in conjunction with DataDomain classes from btgym.datafeed is aimed to implement RL^2 algorithm by Duan et al.
Paper: FAST REINFORCEMENT LEARNING VIA SLOW REINFORCEMENT LEARNING, https://arxiv.org/pdf/1611.02779.pdf
The only difference from Base policy is get_initial_features() method, which has been changed either to reset RNN context to zero-state or return context from the end of previous episode, depending on episode metadata received or `lstm_2_init_period’ parameter.
-
get_initial_features
(state, context=None)[source]¶ Returns RNN initial context. RNN_1 (lower) context is reset at every call.
- RNN_2 (upper) context is reset:
- every `lstm_2_init_period’ episodes;
- episode initial state trial_num metadata has been changed form last call (new train trial started);
- episode metatdata type is non-zero (test episode);
- no context arg is provided (initial episode of training);
- … else carries context on to new episode;
Episode metadata are provided by DataTrialIterator, which is shaping Trial data distribution in this case, and delivered through env.strategy as separate key in observation dictionary.
Parameters: - state – initial episode state (result of env.reset())
- context – last previous episode RNN state (last_context of runner)
Returns: 2_RNN zero-state tuple.
Raises: KeyError if [`metadata`] – [trial_num,`type`] keys not found
-