btgym.algorithms package¶

btgym.algorithms.nn subpackage¶

btgym.algorithms.policy subpackage¶

btgym.algorithms.runner subpackage¶

btgym.algorithms.launcher module¶

btgym.algorithms.worker module¶

class btgym.algorithms.worker.FastSaver(var_list=None, reshape=False, sharded=False, max_to_keep=5, keep_checkpoint_every_n_hours=10000.0, name=None, restore_sequentially=False, saver_def=None, builder=None, defer_build=False, allow_empty=False, write_version=2, pad_step_number=False, save_relative_paths=False, filename=None)[source]¶

Disables write_meta_graph argument, which freezes entire process and is mostly useless.

Creates a Saver.

The constructor adds ops to save and restore variables.

var_list specifies the variables that will be saved and restored. It can be passed as a dict or a list:

A dict of names to variables: The keys are the names that will be used to save or restore the variables in the checkpoint files.
A list of variables: The variables will be keyed with their op name in the checkpoint files.

For example:

```python v1 = tf.Variable(…, name=’v1’) v2 = tf.Variable(…, name=’v2’)

# Pass the variables as a dict: saver = tf.train.Saver({‘v1’: v1, ‘v2’: v2})

# Or pass them as a list. saver = tf.train.Saver([v1, v2]) # Passing a list is equivalent to passing a dict with the variable op names # as keys: saver = tf.train.Saver({v.op.name: v for v in [v1, v2]}) ```

The optional reshape argument, if True, allows restoring a variable from a save file where the variable had a different shape, but the same number of elements and type. This is useful if you have reshaped a variable and want to reload it from an older checkpoint.

The optional sharded argument, if True, instructs the saver to shard checkpoints per device.

Parameters:

var_list – A list of Variable/SaveableObject, or a dictionary mapping names to SaveableObject`s. If `None, defaults to the list of all saveable objects.
reshape – If True, allows restoring parameters from a checkpoint where the variables have a different shape.
sharded – If True, shard the checkpoints, one per device.
max_to_keep – Maximum number of recent checkpoints to keep. Defaults to 5.
keep_checkpoint_every_n_hours – How often to keep checkpoints. Defaults to 10,000 hours.
name – String. Optional name to use as a prefix when adding operations.
restore_sequentially – A Bool, which if true, causes restore of different variables to happen sequentially within each device. This can lower memory usage when restoring very large models.
saver_def – Optional SaverDef proto to use instead of running the builder. This is only useful for specialty code that wants to recreate a Saver object for a previously built Graph that had a Saver. The saver_def proto should be the one returned by the as_saver_def() call of the Saver that was created for that Graph.
builder – Optional SaverBuilder to use if a saver_def was not provided. Defaults to BulkSaverBuilder().
defer_build – If True, defer adding the save and restore ops to the build() call. In that case build() should be called before finalizing the graph or using the saver.
allow_empty – If False (default) raise an error if there are no variables in the graph. Otherwise, construct the saver anyway and make it a no-op.
write_version – controls what format to use when saving checkpoints. It also affects certain filepath matching logic. The V2 format is the recommended choice: it is much more optimized than V1 in terms of memory required and latency incurred during restore. Regardless of this flag, the Saver is able to restore from both V2 and V1 checkpoints.
pad_step_number – if True, pads the global step number in the checkpoint filepaths to some fixed width (8 by default). This is turned off by default.
save_relative_paths – If True, will write relative paths to the checkpoint state file. This is needed if the user wants to copy the checkpoint directory and reload from the copied directory.
filename – If known at graph construction time, filename used for variable loading/saving.

Raises:

TypeError – If var_list is invalid.
ValueError – If any of the keys or values in var_list are not unique.
RuntimeError – If eager execution is enabled and`var_list` does not specify a list of varialbes to save.

@compatibility(eager) When eager execution is enabled, var_list must specify a list or dict of variables to save. Otherwise, a RuntimeError will be raised. @end_compatibility

class btgym.algorithms.worker.Worker(env_config, policy_config, trainer_config, cluster_spec, job_name, task, log_dir, log_ckpt_subdir, initial_ckpt_dir, save_secs, log_level, max_env_steps, random_seed=None, render_last_env=False, test_mode=False)[source]¶

Distributed tf worker class.

Sets up environment, trainer and starts training process in supervised session.

Parameters:

env_config – environment class_config_dict.
policy_config – model policy estimator class_config_dict.
trainer_config – algorithm class_config_dict.
cluster_spec – tf.cluster specification.
job_name – worker or parameter server.
task – integer number, 0 is chief worker.
log_dir – path for tb summaries and current checkpoints.
log_ckpt_subdir – log_dir subdirectory to store current checkpoints
initial_ckpt_dir – path for checkpoint to load as pre-trained model.
save_secs – int, save model checkpoint every N secs.
log_level – int, logbook.level
max_env_steps – number of environment steps to run training on
random_seed – int or None
render_last_env – bool, if True - render enabled for last environment in a list; first otherwise
test_mode – if True - use Atari mode, BTGym otherwise.
Note –
- Conventional self.global_step refers to number of environment steps,
  
  summarized over all environment instances, not to number of policy optimizer train steps.
- Every worker can run several environments in parralell, as specified by `cluster_config’[‘num_envs’].
  
  If use 4 forkers and num_envs=4 => total number of environments is 16. Every env instance has it’s own ThreadRunner process.
- When using replay memory, keep in mind that every ThreadRunner is keeping it’s own replay memory,
  
  If memory_size = 2000, num_workers=4, num_envs=4 => total replay memory size equals 32 000 frames.

run()[source]¶: Worker runtime body.

btgym.algorithms.aac module¶

class btgym.algorithms.aac.BaseAAC(env, task, policy_config, log_level, name='AAC', on_policy_loss=<function aac_loss_def>, off_policy_loss=<function aac_loss_def>, vr_loss=<function value_fn_loss_def>, rp_loss=<function rp_loss_def>, pc_loss=<function pc_loss_def>, runner_config=None, runner_fn_ref=<function BaseEnvRunnerFn>, cluster_spec=None, random_seed=None, model_gamma=0.99, model_gae_lambda=1.0, model_beta=0.01, opt_max_env_steps=10000000, opt_decay_steps=None, opt_end_learn_rate=None, opt_learn_rate=0.0001, opt_decay=0.99, opt_momentum=0.0, opt_epsilon=1e-08, rollout_length=20, time_flat=False, episode_train_test_cycle=(1, 0), episode_summary_freq=2, env_render_freq=10, model_summary_freq=100, test_mode=False, replay_memory_size=2000, replay_batch_size=None, replay_rollout_length=None, use_off_policy_aac=False, use_reward_prediction=False, use_pixel_control=False, use_value_replay=False, rp_lambda=1.0, pc_lambda=1.0, vr_lambda=1.0, off_aac_lambda=1, gamma_pc=0.9, rp_reward_threshold=0.1, rp_sequence_size=3, clip_epsilon=0.1, num_epochs=1, pi_prime_update_period=1, global_step_op=None, global_episode_op=None, inc_episode_op=None, _use_global_network=True, _use_target_policy=False, _use_local_memory=False, aux_render_modes=None, **kwargs)[source]¶

Base Asynchronous Advantage Actor Critic algorithm framework class with auxiliary control tasks and option to run several instances of environment for every worker in vectorized fashion, PAAC-like. Can be configured to run with different losses and policies.

Auxiliary tasks implementation borrows heavily from Kosuke Miyoshi code, under Apache License 2.0: https://miyosuda.github.io/ https://github.com/miyosuda/unreal

Original A3C code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent

Papers: https://arxiv.org/abs/1602.01783 https://arxiv.org/abs/1611.05397

Parameters:

env – environment instance or list of instances
task – int, parent worker id
policy_config – policy estimator class and configuration dictionary
log_level – int, logbook.level
name – str, class-wide name-scope
on_policy_loss – callable returning tensor holding on_policy training loss graph and summaries
off_policy_loss – callable returning tensor holding off_policy training loss graph and summaries
vr_loss – callable returning tensor holding value replay loss graph and summaries
rp_loss – callable returning tensor holding reward prediction loss graph and summaries
pc_loss – callable returning tensor holding pixel_control loss graph and summaries
runner_config – runner class and configuration dictionary,
runner_fn_ref – callable defining environment runner execution logic, valid only if no ‘runner_config’ arg is provided
cluster_spec – dict, full training cluster spec (may be used by meta-trainer)
random_seed – int or None
model_gamma – scalar, gamma discount factor
model_gae_lambda – scalar, GAE lambda
model_beta – entropy regularization beta, scalar or [high_bound, low_bound] for log_uniform.
opt_max_env_steps – int, total number of environment steps to run training on.
opt_decay_steps – int, learn ratio decay steps, in number of environment steps.
opt_end_learn_rate – scalar, final learn rate
opt_learn_rate – start learn rate, scalar or [high_bound, low_bound] for log_uniform distr.
opt_decay – scalar, optimizer decay, if apll.
opt_momentum – scalar, optimizer momentum, if apll.
opt_epsilon – scalar, optimizer epsilon
rollout_length – int, on-policy rollout length
time_flat – bool, flatten rnn time-steps in rollouts while training - see Notes below
episode_train_test_cycle – tuple or list as (train_number, test_number), def=(1,0): enables infinite loop such as: run train_number of train data episodes, than test_number of test data episodes, repeat. Should be consistent with provided dataset parameters (test data should exist if test_number > 0)
episode_summary_freq – int, write episode summary for every i’th episode
env_render_freq – int, write environment rendering summary for every i’th train step
model_summary_freq – int, write model summary for every i’th train step
test_mode – bool, True: Atari, False: BTGym
replay_memory_size – int, in number of experiences
replay_batch_size – int, mini-batch size for off-policy training, def = 1
replay_rollout_length – int off-policy rollout length by def. equals on_policy_rollout_length
use_off_policy_aac – bool, use full AAC off-policy loss instead of Value-replay
use_reward_prediction – bool, use aux. off-policy reward prediction task
use_pixel_control – bool, use aux. off-policy pixel control task
use_value_replay – bool, use aux. off-policy value replay task (not used if use_off_policy_aac=True)
rp_lambda – reward prediction loss weight, scalar or [high, low] for log_uniform distr.
pc_lambda – pixel control loss weight, scalar or [high, low] for log_uniform distr.
vr_lambda – value replay loss weight, scalar or [high, low] for log_uniform distr.
off_aac_lambda – off-policy AAC loss weight, scalar or [high, low] for log_uniform distr.
gamma_pc – NOT USED
rp_reward_threshold – scalar, reward prediction classification threshold, above which reward is ‘non-zero’
rp_sequence_size – int, reward prediction sample size, in number of experiences
clip_epsilon – scalar, PPO: surrogate L^clip epsilon
num_epochs – int, num. of SGD runs for every train step, val. > 1 should be used with caution.
pi_prime_update_period – int, PPO: pi to pi_old update period in number of train steps, def: 1
global_step_op – external tf.variable holding global step counter
global_episode_op – external tf.variable holding global episode counter
inc_episode_op – external tf.op incrementing global step counter
_use_global_network – bool, either to use parameter server policy instance
_use_target_policy – bool, PPO: use target policy (aka pi_old), delayed by pi_prime_update_period delay
_use_local_memory – bool: use in-process replay memory instead of runner-based one
aux_render_modes – additional visualisations to include in per-episode rendering summary

Note

On time_flat arg:
There are two alternatives to run RNN part of policy estimator:
1. Feed initial RNN state for every experience frame in rollout
  
  (those are stored anyway if we want random memory repaly sampling) and do single time-step RNN advance for all experiences in a batch; this is when time_flat=True;
2. Reshape incoming batch after convolution part of network in time-wise fashion
  
  for every rollout in a batch i.e. batch_size=number_of_rollouts and rnn_timesteps=max_rollout_length. In this case we need to feed initial rnn_states for rollouts only. There is some little extra work to pad rollouts to max_time_size and feed true rollout lengths to rnn. Thus, when time_flat=False, we unroll RNN in specified number of time-steps for every rollout.
Both options has pros and cons:
Unrolling dynamic RNN is computationally more expensive but gives clearly faster convergence,

[possibly] due to the fact that RNN states for 2nd, 3rd, … frames of rollouts are computed using updated policy estimator, which is supposed to be closer to optimal one. When time_flattened, every time-step uses RNN states computed when rollout was collected (i.e. by behavioral policy estimator with older parameters).

Nevertheless, time_flat:
allows use of static RNN;

one can safely shuffle training batch or mix on-policy and off-policy data in single mini-batch,

ensuring iid property; - allowing second-order derivatives which is impossible in current tf dynamic RNN implementation as it uses tf.while_loop internally; - computationally cheaper;

get_data(**kwargs)[source]¶

Collect rollouts from every environment.

Returns:	dictionary of lists of data streams collected from every runner

get_sample_config(_new_trial=True, **kwargs)[source]¶

WARNING: _new_trial=True is quick fix, TODO: fix it properly! Returns environment configuration parameters for next episode to sample. By default is simple stateful iterator, works correctly with DTGymDataset data class, repeating cycle:

sample num_train_episodes from train data,

sample num_test_episodes from test data.

Convention: supposed to override dummy method of local policy instance, see inside ._make_policy() method

Returns:	configuration dictionary of type btgym.datafeed.base.EnvResetConfig

start(sess, summary_writer, **kwargs)[source]¶

Executes all initializing operations, starts environment runner[s]. Supposed to be called by parent worker just before training loop starts.

Parameters:	sess – tf session object. kwargs – not used by default.

process_data(sess, data, is_train, pi, pi_prime=None)[source]¶

Processes data, composes train step feed dictionary. :param sess: tf session obj. :param pi: policy to feed :param pi_prime: optional policy to feed :param data: data dictionary :type data: dict :param is_train: is data provided are train or test

Returns:	feed_dict – train step feed dictionary
Return type:	dict

process_summary(sess, data, model_data=None, step=None, episode=None)[source]¶: Fetches and writes summary data from data and model_data. :param sess: tf summary obj. :param data: thread_runner rollouts and metadata :type data: dict :param model_data: model summary data :type model_data: dict :param step: int, global step or None :param episode: int, global episode number or None

process(sess, **kwargs)[source]¶

Main train step method wrapper. Override if needed.

Parameters:	sess (tensorflow.Session) – tf session obj. kwargs – any

class btgym.algorithms.aac.Unreal(**kwargs)[source]¶

Unreal: Asynchronous Advantage Actor Critic with auxiliary control tasks.

Auxiliary tasks implementation borrows heavily from Kosuke Miyoshi code, under Apache License 2.0: https://miyosuda.github.io/ https://github.com/miyosuda/unreal

Original A3C code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent

Papers: https://arxiv.org/abs/1602.01783 https://arxiv.org/abs/1611.05397

See BaseAAC class args for details:

Parameters:

env – environment instance or list of instances
task – int, parent worker id
policy_config – policy estimator class and configuration dictionary
log_level – int, logbook.level
on_policy_loss – callable returning tensor holding on_policy training loss graph and summaries
off_policy_loss – callable returning tensor holding off_policy training loss graph and summaries
vr_loss – callable returning tensor holding value replay loss graph and summaries
rp_loss – callable returning tensor holding reward prediction loss graph and summaries
pc_loss – callable returning tensor holding pixel_control loss graph and summaries
random_seed – int or None
model_gamma – scalar, gamma discount factor
model_gae_lambda – scalar, GAE lambda
model_beta – entropy regularization beta, scalar or [high_bound, low_bound] for log_uniform.
opt_max_env_steps – int, total number of environment steps to run training on.
opt_decay_steps – int, learn ratio decay steps, in number of environment steps.
opt_end_learn_rate – scalar, final learn rate
opt_learn_rate – start learn rate, scalar or [high_bound, low_bound] for log_uniform distr.
opt_decay – scalar, optimizer decay, if apll.
opt_momentum – scalar, optimizer momentum, if apll.
opt_epsilon – scalar, optimizer epsilon
rollout_length – int, on-policy rollout length
time_flat – bool, flatten rnn time-steps in rollouts while training - see Notes below
episode_train_test_cycle – tuple or list as (train_number, test_number), def=(1,0): enables infinite loop such as: run train_number of train data episodes, than test_number of test data episodes, repeat. Should be consistent with provided dataset parameters (test data should exist if test_number > 0)
episode_summary_freq – int, write episode summary for every i’th episode
env_render_freq – int, write environment rendering summary for every i’th train step
model_summary_freq – int, write model summary for every i’th train step
test_mode – bool, True: Atari, False: BTGym
replay_memory_size – int, in number of experiences
replay_batch_size – int, mini-batch size for off-policy training, def = 1
replay_rollout_length – int off-policy rollout length by def. equals on_policy_rollout_length
use_off_policy_aac – bool, use full AAC off-policy loss instead of Value-replay
use_reward_prediction – bool, use aux. off-policy reward prediction task
use_pixel_control – bool, use aux. off-policy pixel control task
use_value_replay – bool, use aux. off-policy value replay task (not used if use_off_policy_aac=True)
rp_lambda – reward prediction loss weight, scalar or [high, low] for log_uniform distr.
pc_lambda – pixel control loss weight, scalar or [high, low] for log_uniform distr.
vr_lambda – value replay loss weight, scalar or [high, low] for log_uniform distr.
off_aac_lambda – off-policy AAC loss weight, scalar or [high, low] for log_uniform distr.
gamma_pc – NOT USED
rp_reward_threshold – scalar, reward prediction classification threshold, above which reward is ‘non-zero’
rp_sequence_size – int, reward prediction sample size, in number of experiences
clip_epsilon – scalar, PPO: surrogate L^clip epsilon
num_epochs – int, num. of SGD runs for every train step, val. > 1 should be used with caution.
pi_prime_update_period – int, PPO: pi to pi_old update period in number of train steps, def: 1
_use_target_policy – bool, PPO: use target policy (aka pi_old), delayed by pi_prime_update_period delay

Note

On time_flat arg:
There are two alternatives to run RNN part of policy estimator:
1. Feed initial RNN state for every experience frame in rollout
  
  (those are stored anyway if we want random memory repaly sampling) and do single time-step RNN advance for all experiences in a batch; this is when time_flat=True;
2. Reshape incoming batch after convolution part of network in time-wise fashion
  
  for every rollout in a batch i.e. batch_size=number_of_rollouts and rnn_timesteps=max_rollout_length. In this case we need to feed initial rnn_states for rollouts only. There is some little extra work to pad rollouts to max_time_size and feed true rollout lengths to rnn. Thus, when time_flat=False, we unroll RNN in specified number of time-steps for every rollout.
Both options has pros and cons:

Unrolling dynamic RNN is computationally more expensive but gives clearly faster convergence,

[possibly] due to the fact that RNN states for 2nd, 3rd, … frames of rollouts are computed using updated policy estimator, which is supposed to be closer to optimal one. When time_flattened, every time-step uses RNN states computed when rollout was collected (i.e. by behavioral policy estimator with older parameters).

Nevertheless, time_flatting can be interesting

because one can safely shuffle training batch or mix on-policy and off-policy data in single mini-batch, ensuring iid property and allowing, say, proper batch normalisation (this has yet to be tested).

class btgym.algorithms.aac.A3C(**kwargs)[source]¶

Vanilla Asynchronous Advantage Actor Critic algorithm.

Based on original code taken from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent

Paper: https://arxiv.org/abs/1602.01783

A3C args. is a subset of BaseAAC arguments, see BaseAAC class for descriptions.

Parameters:

env –
task –
policy_config –
log –
random_seed –
model_gamma –
model_gae_lambda –
model_beta –
opt_max_env_steps –
opt_decay_steps –
opt_end_learn_rate –
opt_learn_rate –
opt_decay –
opt_momentum –
opt_epsilon –
rollout_length –
episode_summary_freq –
env_render_freq –
model_summary_freq –
test_mode –

class btgym.algorithms.aac.PPO(**kwargs)[source]¶

AAC with Proximal Policy Optimization surrogate L^Clip loss, optionally augmented with auxiliary control tasks.

paper: https://arxiv.org/pdf/1707.06347.pdf

Based on PPO-SGD code from OpenAI Baselines repository under MIT licence: https://github.com/openai/baselines

Async. framework code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent

PPO args. is a subset of BaseAAC arguments, see BaseAAC class for descriptions.

Parameters:

env –
task –
policy_config –
log_level –
vr_loss –
rp_loss –
pc_loss –
random_seed –
model_gamma –
model_gae_lambda –
model_beta –
opt_max_env_steps –
opt_decay_steps –
opt_end_learn_rate –
opt_learn_rate –
opt_decay –
opt_momentum –
opt_epsilon –
rollout_length –
episode_summary_freq –
env_render_freq –
model_summary_freq –
test_mode –
replay_memory_size –
replay_rollout_length –
use_off_policy_aac –
use_reward_prediction –
use_pixel_control –
use_value_replay –
rp_lambda –
pc_lambda –
vr_lambda –
off_aac_lambda –
rp_reward_threshold –
rp_sequence_size –
clip_epsilon –
num_epochs –
pi_prime_update_period –

btgym.algorithms.rollout module¶

btgym.algorithms.rollout.make_data_getter(queue)[source]¶

Data stream getter constructor.

Parameters:	queue – instance of Queue class to get rollouts from.
Returns:	callable, returning dictionary of data.

class btgym.algorithms.rollout.Rollout[source]¶

Experience rollout as [nested] dictionary of lists of ndarrays, tuples and rnn states.

add(values, _struct=None)[source]¶

Adds single experience frame to rollout.

Parameters:	values – [nested] dictionary of values.

add_memory_sample(sample)[source]¶: Given replay memory sample as list of experience-dictionaries of length, converts it to rollout of same length.

process(gamma, gae_lambda=1.0, size=None, time_flat=False)[source]¶

Converts single-trajectory rollout of experiences to dictionary of ready-to-feed arrays. Computes rollout returns and the advantages. Pads with zeroes to desired length, if size arg is given.

Parameters:

gamma – discount factor
gae_lambda – GAE lambda
size – if given and time_flat=False, pads outputs with zeroes along `time’ dim. to exact ‘size’.
time_flat – reduce time dimension to 1 step by stacking all experiences along batch dimension.

Returns:

[1, time_size, depth] or [1, size, depth] if not time_flatten and size is not/given, with single context entry for entire trajectory, i.e. of size [1, context_depth];

[batch_size, 1, depth], if time_flatten, with batch_size = time_size and context entry for every experience frame, i.e. of size [batch_size, context_depth].

Return type:

batch as [nested] dictionary of np.arrays, tuples and LSTMStateTuples. of size

process_rp(reward_threshold=0.1)[source]¶

Processes rollout process()-alike and estimates reward prediction target for first n-1 frames.

Parameters:	reward_threshold – reward values such as \|r\|> reward_threshold are classified as neg. or pos.
Returns:	Processed batch with size reduced by one and with extra rp_target key holding one hot encodings for classes {zero, positive, negative}.

get_frame(idx, _struct=None)[source]¶

Extracts single experience from rollout.

Parameters:	idx – experience position
Returns:	frame as [nested] dictionary

pop_frame(idx, _struct=None)[source]¶

Pops single experience from rollout.

Parameters:	idx – experience position
Returns:	frame as [nested] dictionary

btgym.algorithms.memory module¶

class btgym.algorithms.memory.Memory(history_size, max_sample_size, priority_sample_size, log_level=13, rollout_provider=None, task=-1, reward_threshold=0.1, use_priority_sampling=False)[source]¶

Replay memory with rebalanced replay based on reward value.

Note

must be filled up before calling sampling methods.

Parameters:

history_size – number of experiences stored;
max_sample_size – maximum allowed sample size (e.g. off-policy rollout length);
priority_sample_size – sample size of priority_sample() method
log_level – int, logbook.level;
rollout_provider – callable returning list of Rollouts NOT USED
task – parent worker id;
reward_threshold – if |experience.reward| > reward_threshold: experience is saved as ‘prioritized’;

add(frame)[source]¶

Appends single experience frame to memory.

Parameters:	frame – dictionary of values.

add_rollout(rollout)[source]¶

Adds frames from given rollout to memory with respect to episode continuation.

Parameters:	rollout – Rollout instance.

fill()[source]¶

Fills replay memory with initial experiences. NOT USED. Supposed to be called by parent worker() just before training begins.

Parameters:	rollout_getter – callable, returning list of Rollouts.

sample_uniform(sequence_size)[source]¶

Uniformly samples sequence of successive frames of size sequence_size or less (~off-policy rollout).

Parameters:	sequence_size – maximum sample size.
Returns:	instance of Rollout of size <= sequence_size.

_sample_priority(size=None, exact_size=False, skewness=2, sample_attempts=100)[source]¶

Implements rebalanced replay. Samples sequence of successive frames from distribution skewed by means of reward of last sample frame.

Parameters:

size – sample size, must be <= self.max_sample_size;
exact_size – whether accept sample with size less than ‘size’ or re-sample to get sample of exact size (used for reward prediction task);
skewness – int>=1, sampling probability denominator, such as probability of sampling sequence with last frame having non-zero reward is: p[non_zero]=1/skewness;
sample_attempts – if exact_size=True, sets number of re-sampling attempts to get sample of continuous experiences (no Terminal frames inside except last one); if number is reached - sample returned ‘as is’.

Returns:

instance of Rollout().

btgym.algorithms.envs module¶

class btgym.algorithms.envs.AtariRescale42x42(env_id=None)[source]¶

Gym wrapper, pipes Atari into BTgym algorithms, as later expect observations to be DictSpace. Makes Atari environment return state as dictionary with single key ‘external’ holding normalized in [0,1] grayscale 42x42 visual output.

Parameters:	env_id – conventional Gym id.