btgym.algorithms.policy.base module¶

class btgym.algorithms.policy.base.BaseAacPolicy(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=False, **kwargs)[source]¶

Base advantage actor-critic Convolution-LSTM policy estimator with auxiliary control tasks for discrete or nested discrete action spaces.

Papers:

https://arxiv.org/abs/1602.01783 https://arxiv.org/abs/1611.05397

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.

Parameters:

ob_space – instance of btgym.spaces.DictSpace
ac_space – instance of btgym.spaces.ActionDictSpace
rp_sequence_size – reward prediction sample length
lstm_class – tf.nn.lstm class
lstm_layers – tuple of LSTM layers sizes
aux_estimate – bool, if True - add auxiliary tasks estimations to self.callbacks dictionary
time_flat – bool, if True - use static rnn, dynamic otherwise
not used (**kwargs) –

get_initial_features(**kwargs)[source]¶

Returns initial context.

Returns:	LSTM zero-state tuple.

act(observation, lstm_state, last_action, last_reward, deterministic=False)[source]¶

Emits action.

Parameters:	observation – dictionary containing single observation lstm_state – lstm context value last_action – action value from previous step last_reward – reward value previous step deterministic – bool, it True - act deterministically, use random sampling otherwise (default); effective for discrete action sapce only (TODO: continious)
Returns:	Action as dictionary of several action encodings, actions logits, V-fn value, output RNN state

get_value(observation, lstm_state, last_action, last_reward)[source]¶

Estimates policy V-function.

Parameters:	observation – single observation value lstm_state – lstm context value last_action – action value from previous step last_reward – reward value from previous step
Returns:	V-function value

get_pc_target(state, last_state, **kwargs)[source]¶

Estimates pixel-control task target.

Parameters:	state – single observation value last_state – single observation value **kwargs – not used
Returns:	Estimated absolute difference between two subsampled states.

static get_sample_config(*args, **kwargs)[source]¶

Dummy implementation.

Returns:	default data sample configuration dictionary btgym.datafeed.base.EnvResetConfig

class btgym.algorithms.policy.base.Aac1dPolicy(ob_space, ac_space, rp_sequence_size, lstm_class=<class 'tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell'>, lstm_layers=(256, ), action_dp_alpha=200.0, aux_estimate=True, **kwargs)[source]¶

AAC policy for one-dimensional signal obs. state.

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects bi-modal observation as dict: external, internal.

Parameters:	ob_space – dictionary of observation state shapes ac_space – discrete action space shape (length) rp_sequence_size – reward prediction sample length lstm_class – tf.nn.lstm class lstm_layers – tuple of LSTM layers sizes aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary. not used (**kwargs) –

btgym.algorithms.policy.stacked_lstm module¶

class btgym.algorithms.policy.stacked_lstm.StackedLstmPolicy(ob_space, ac_space, rp_sequence_size, state_encoder_class_ref=<function conv_2d_network>, lstm_class_ref=<class 'tensorflow.contrib.rnn.python.ops.rnn_cell.LayerNormBasicLSTMCell'>, lstm_layers=(256, 256), linear_layer_ref=<function noisy_linear>, share_encoder_params=False, dropout_keep_prob=1.0, action_dp_alpha=200.0, aux_estimate=False, encode_internal_state=False, static_rnn=True, shared_p_v=False, **kwargs)[source]¶

Conv.-Stacked_LSTM policy, based on NAV A3C agent architecture from

LEARNING TO NAVIGATE IN COMPLEX ENVIRONMENTS by Mirowski et all. and

LEARNING TO REINFORCEMENT LEARN by JX Wang et all.

Papers:

https://arxiv.org/pdf/1611.03673.pdf

https://arxiv.org/pdf/1611.05763.pdf

Defines [partially shared] on/off-policy networks for estimating action-logits, value function, reward and state ‘pixel_change’ predictions. Expects multi-modal observation as array of shape ob_space.

Parameters:

ob_space – instance of btgym.spaces.DictSpace
ac_space – instance of btgym.spaces.ActionDictSpace
rp_sequence_size – reward prediction sample length
lstm_class_ref – tf.nn.lstm class to use
lstm_layers – tuple of LSTM layers sizes
linear_layer_ref – linear layer class to use
share_encoder_params – bool, whether to share encoder parameters for every ‘external’ data stream
dropout_keep_prob – in (0, 1] dropout regularisation parameter
action_dp_alpha –
aux_estimate – (bool), if True - add auxiliary tasks estimations to self.callbacks dictionary
encode_internal_state – use encoder over ‘internal’ part of observation space
static_rnn – (bool), it True - use static rnn graph, dynamic otherwise
not used (**kwargs) –

class btgym.algorithms.policy.stacked_lstm.AacStackedRL2Policy(lstm_2_init_period=50, **kwargs)[source]¶

Attempt to implement two-level RL^2 This policy class in conjunction with DataDomain classes from btgym.datafeed is aimed to implement RL^2 algorithm by Duan et al.

Paper: FAST REINFORCEMENT LEARNING VIA SLOW REINFORCEMENT LEARNING, https://arxiv.org/pdf/1611.02779.pdf

The only difference from Base policy is get_initial_features() method, which has been changed either to reset RNN context to zero-state or return context from the end of previous episode, depending on episode metadata received or `lstm_2_init_period’ parameter.

get_initial_features(state, context=None)[source]¶

Returns RNN initial context. RNN_1 (lower) context is reset at every call.

RNN_2 (upper) context is reset:

every `lstm_2_init_period’ episodes;
episode initial state trial_num metadata has been changed form last call (new train trial started);
episode metatdata type is non-zero (test episode);
no context arg is provided (initial episode of training);
… else carries context on to new episode;

Episode metadata are provided by DataTrialIterator, which is shaping Trial data distribution in this case, and delivered through env.strategy as separate key in observation dictionary.

Parameters:	state – initial episode state (result of env.reset()) context – last previous episode RNN state (last_context of runner)
Returns:	2_RNN zero-state tuple.
Raises:	KeyError if [`metadata`] – [trial_num,`type`] keys not found