btgym.research.gps.aac module¶
-
class
btgym.research.gps.aac.
GuidedAAC
(expert_loss=<function guided_aac_loss_def_0_3>, aac_lambda=1.0, guided_lambda=1.0, guided_decay_steps=None, runner_config=None, aux_render_modes=None, name='GuidedA3C', **kwargs)[source]¶ Actor-critic framework augmented with expert actions imitation loss: L_gps = aac_lambda * L_a3c + guided_lambda * L_im.
This implementation is loosely refereed as ‘guided policy search’ after algorithm described in paper by S. Levine and P. Abbeel Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics
in a sense that exploits idea of fitting ‘local’ (here - single episode) oracle for environment with generally unknown dynamics and use actions demonstrated by it to optimize trajectory distribution for training agent.
Note that this particular implementation of expert does not provides complete action-state space trajectory for agent to follow. Instead it estimates advised categorical distribution over actions conditioned on external (i.e. price dynamics) state observations only.
- Papers:
- Levine et al., ‘Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics’
- https://people.eecs.berkeley.edu/~svlevine/papers/mfcgps.pdf
- Brys et al., ‘Reinforcement Learning from Demonstration through Shaping’
- https://www.ijcai.org/Proceedings/15/Papers/472.pdf
- Wiewiora et al., ‘Principled Methods for Advising Reinforcement Learning Agents’
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6412&rep=rep1&type=pdf
Parameters: - expert_loss – callable returning tensor holding on_policy imitation loss graph and summaries
- aac_lambda – float, main on_policy a3c loss lambda
- guided_lambda – float, imitation loss lambda
- guided_decay_steps – number of steps guided_lambda is annealed to zero
- name – str, name scope
- **kwargs – see BaseAAC kwargs
btgym.research.gps.policy module¶
-
class
btgym.research.gps.policy.
GuidedPolicy_0_0
(conv_2d_layer_config=((32, (3, 1), (2, 1)), (32, (3, 1), (2, 1)), (64, (3, 1), (2, 1)), (64, (3, 1), (2, 1))), lstm_class_ref=<class 'tensorflow.contrib.rnn.python.ops.rnn_cell.LayerNormBasicLSTMCell'>, lstm_layers=(256, 256), lstm_2_init_period=50, linear_layer_ref=<function noisy_linear>, **kwargs)[source]¶ Guided policy: simple configuration wrapper around Stacked LSTM architecture.
btgym.research.gps.loss module¶
-
btgym.research.gps.loss.
guided_aac_loss_def_0_0
(pi_actions, expert_actions, name='on_policy/aac', verbose=False, **kwargs)[source]¶ Cross-entropy imitation loss on expert actions.
Parameters: - pi_actions – tensor holding policy actions logits
- expert_actions – tensor holding expert actions probability distribution
- name – loss op name scope
Returns: tensor holding estimated imitation loss; list of related tensorboard summaries.
-
btgym.research.gps.loss.
guided_aac_loss_def_0_1
(pi_actions, expert_actions, name='on_policy/aac', verbose=False, **kwargs)[source]¶ Cross-entropy imitation loss on {buy, sell} subset of expert actions.
Parameters: - pi_actions – tensor holding policy actions logits
- expert_actions – tensor holding expert actions probability distribution
- name – loss op name scope
Returns: tensor holding estimated imitation loss; list of related tensorboard summaries.
-
btgym.research.gps.loss.
guided_aac_loss_def_0_3
(pi_actions, expert_actions, name='on_policy/aac', verbose=False, **kwargs)[source]¶ MSE imitation loss on {buy, sell} subset of expert actions.
Parameters: - pi_actions – tensor holding policy actions logits
- expert_actions – tensor holding expert actions probability distribution
- name – loss op name scope
Returns: tensor holding estimated imitation loss; list of related tensorboard summaries.
btgym.research.gps.strategy module¶
btgym.research.gps.oracle module¶
-
class
btgym.research.gps.oracle.
Oracle
(action_space=(0, 1, 2, 3), time_threshold=5, pips_threshold=10, pips_scale=0.0001, kernel_size=5, kernel_stddev=1)[source]¶ Irresponsible financial adviser.
Parameters: - action_space – actions to advice: 0 - hold, 1- buy, 2 - sell, 3 - close
- time_threshold – how many points (in number of ENVIRONMENT timesteps) on each side to use for the comparison to consider comparator(n, n+x) to be True
- pips_threshold – int, minimal peaks difference in pips to consider comparator(n, n+x) to be True
- pips_scale – actual single pip value wrt signal value
- kernel_size – gaussian convolution kernel size (used to compute distribution over actions)
- kernel_stddev – gaussian kernel standard deviation
-
filter_by_margine
(lst, threshold)[source]¶ Filters out peaks by their ‘value’ difference withing tolerance given. Filtering is done from first to last index by removing every succeeding element of list from now on if its value difference with value in hand is less than given threshold.
Parameters: - lst – list of tuples; each tuple is (value, index)
- threshold – value filtering threshold
Returns: filtered out list of tuples
-
estimate_actions
(episode_data)[source]¶ Estimates hold/buy/sell signals based on local peaks filtered by time horizon and signal amplitude.
Parameters: episode_data – 1D np.array of unscaled [but possibly resampled] price values in OHL[CV] format Returns: 1D vector of signals of same length as episode_data
-
adjust_signals
(signal)[source]¶ Add simple heuristics (based on examining learnt policy actions distribution): - repeat same buy or sell signal kernel_size - 1 times.
-
fit
(episode_data, resampling_factor=1)[source]¶ Estimates advised actions probabilities distribution based on data received.
Parameters: - episode_data – 1D np.array of unscaled price values in OHL[CV] format
- resampling_factor – factor by which to resample given data by taking min/max values inside every resampled bar
Returns: Np.array of size [resampled_data_size, actions_space_size] of probabilities of advised actions, where resampled_data_size = int(len(episode_data) / resampling_factor) + 1/0
-
resample_data
(episode_data, factor=1)[source]¶ Resamples raw observations according to given skip_frame value and estimates mean value of newly formed bars.
Parameters: - episode_data – np.array of shape [episode_length, values]
- factor – scalar
Returns: np.array of median Hi/Lo observations of size [int(episode_length/skip_frame) + 1, 1]
-
class
btgym.research.gps.oracle.
Oracle2
(action_space=(0, 1, 2, 3), gamma=1.0, **kwargs)[source]¶ [less]Irresponsible financial adviser.
Parameters: - action_space – actions to advice: 0 - hold, 1- buy, 2 - sell, 3 - close
- gamma – price movement horizon discount, in (0, 1]
-
fit
(episode_data, resampling_factor=1)[source]¶ Estimates advised actions probabilities distribution based on data received.
Parameters: - episode_data – 1D np.array of unscaled price values in OHL[CV] format
- resampling_factor – factor by which to resample given data by taking min/max values inside every resampled bar
Returns: Np.array of size [resampled_data_size, actions_space_size] of probabilities of advised actions, where resampled_data_size = int(len(episode_data) / resampling_factor) + 1/0
-
static
resample_data
(episode_data, factor=1)[source]¶ Resamples raw observations according to given skip_frame value and estimates mean value of newly formed bars.
Parameters: - episode_data – np.array of shape [episode_length, values]
- factor – scalar
Returns: np.array of median Hi/Lo observations of size [int(episode_length/skip_frame) + 1, 1]