btgym.algorithms package¶
btgym.algorithms.nn subpackage¶
btgym.algorithms.policy subpackage¶
btgym.algorithms.runner subpackage¶
btgym.algorithms.launcher module¶
btgym.algorithms.worker module¶
-
class
btgym.algorithms.worker.
FastSaver
(var_list=None, reshape=False, sharded=False, max_to_keep=5, keep_checkpoint_every_n_hours=10000.0, name=None, restore_sequentially=False, saver_def=None, builder=None, defer_build=False, allow_empty=False, write_version=2, pad_step_number=False, save_relative_paths=False, filename=None)[source]¶ Disables write_meta_graph argument, which freezes entire process and is mostly useless.
Creates a Saver.
The constructor adds ops to save and restore variables.
var_list specifies the variables that will be saved and restored. It can be passed as a dict or a list:
- A dict of names to variables: The keys are the names that will be used to save or restore the variables in the checkpoint files.
- A list of variables: The variables will be keyed with their op name in the checkpoint files.
For example:
```python v1 = tf.Variable(…, name=’v1’) v2 = tf.Variable(…, name=’v2’)
# Pass the variables as a dict: saver = tf.train.Saver({‘v1’: v1, ‘v2’: v2})
# Or pass them as a list. saver = tf.train.Saver([v1, v2]) # Passing a list is equivalent to passing a dict with the variable op names # as keys: saver = tf.train.Saver({v.op.name: v for v in [v1, v2]}) ```
The optional reshape argument, if True, allows restoring a variable from a save file where the variable had a different shape, but the same number of elements and type. This is useful if you have reshaped a variable and want to reload it from an older checkpoint.
The optional sharded argument, if True, instructs the saver to shard checkpoints per device.
Parameters: - var_list – A list of Variable/SaveableObject, or a dictionary mapping names to SaveableObject`s. If `None, defaults to the list of all saveable objects.
- reshape – If True, allows restoring parameters from a checkpoint where the variables have a different shape.
- sharded – If True, shard the checkpoints, one per device.
- max_to_keep – Maximum number of recent checkpoints to keep. Defaults to 5.
- keep_checkpoint_every_n_hours – How often to keep checkpoints. Defaults to 10,000 hours.
- name – String. Optional name to use as a prefix when adding operations.
- restore_sequentially – A Bool, which if true, causes restore of different variables to happen sequentially within each device. This can lower memory usage when restoring very large models.
- saver_def – Optional SaverDef proto to use instead of running the builder. This is only useful for specialty code that wants to recreate a Saver object for a previously built Graph that had a Saver. The saver_def proto should be the one returned by the as_saver_def() call of the Saver that was created for that Graph.
- builder – Optional SaverBuilder to use if a saver_def was not provided. Defaults to BulkSaverBuilder().
- defer_build – If True, defer adding the save and restore ops to the build() call. In that case build() should be called before finalizing the graph or using the saver.
- allow_empty – If False (default) raise an error if there are no variables in the graph. Otherwise, construct the saver anyway and make it a no-op.
- write_version – controls what format to use when saving checkpoints. It also affects certain filepath matching logic. The V2 format is the recommended choice: it is much more optimized than V1 in terms of memory required and latency incurred during restore. Regardless of this flag, the Saver is able to restore from both V2 and V1 checkpoints.
- pad_step_number – if True, pads the global step number in the checkpoint filepaths to some fixed width (8 by default). This is turned off by default.
- save_relative_paths – If True, will write relative paths to the checkpoint state file. This is needed if the user wants to copy the checkpoint directory and reload from the copied directory.
- filename – If known at graph construction time, filename used for variable loading/saving.
Raises: TypeError
– If var_list is invalid.ValueError
– If any of the keys or values in var_list are not unique.RuntimeError
– If eager execution is enabled and`var_list` does not specify a list of varialbes to save.
@compatibility(eager) When eager execution is enabled, var_list must specify a list or dict of variables to save. Otherwise, a RuntimeError will be raised. @end_compatibility
-
class
btgym.algorithms.worker.
Worker
(env_config, policy_config, trainer_config, cluster_spec, job_name, task, log_dir, log_ckpt_subdir, initial_ckpt_dir, save_secs, log_level, max_env_steps, random_seed=None, render_last_env=False, test_mode=False)[source]¶ Distributed tf worker class.
Sets up environment, trainer and starts training process in supervised session.
Parameters: - env_config – environment class_config_dict.
- policy_config – model policy estimator class_config_dict.
- trainer_config – algorithm class_config_dict.
- cluster_spec – tf.cluster specification.
- job_name – worker or parameter server.
- task – integer number, 0 is chief worker.
- log_dir – path for tb summaries and current checkpoints.
- log_ckpt_subdir – log_dir subdirectory to store current checkpoints
- initial_ckpt_dir – path for checkpoint to load as pre-trained model.
- save_secs – int, save model checkpoint every N secs.
- log_level – int, logbook.level
- max_env_steps – number of environment steps to run training on
- random_seed – int or None
- render_last_env – bool, if True - render enabled for last environment in a list; first otherwise
- test_mode – if True - use Atari mode, BTGym otherwise.
- Note –
- Conventional self.global_step refers to number of environment steps,
- summarized over all environment instances, not to number of policy optimizer train steps.
- Every worker can run several environments in parralell, as specified by `cluster_config’[‘num_envs’].
- If use 4 forkers and num_envs=4 => total number of environments is 16. Every env instance has it’s own ThreadRunner process.
- When using replay memory, keep in mind that every ThreadRunner is keeping it’s own replay memory,
- If memory_size = 2000, num_workers=4, num_envs=4 => total replay memory size equals 32 000 frames.
btgym.algorithms.aac module¶
-
class
btgym.algorithms.aac.
BaseAAC
(env, task, policy_config, log_level, name='AAC', on_policy_loss=<function aac_loss_def>, off_policy_loss=<function aac_loss_def>, vr_loss=<function value_fn_loss_def>, rp_loss=<function rp_loss_def>, pc_loss=<function pc_loss_def>, runner_config=None, runner_fn_ref=<function BaseEnvRunnerFn>, cluster_spec=None, random_seed=None, model_gamma=0.99, model_gae_lambda=1.0, model_beta=0.01, opt_max_env_steps=10000000, opt_decay_steps=None, opt_end_learn_rate=None, opt_learn_rate=0.0001, opt_decay=0.99, opt_momentum=0.0, opt_epsilon=1e-08, rollout_length=20, time_flat=False, episode_train_test_cycle=(1, 0), episode_summary_freq=2, env_render_freq=10, model_summary_freq=100, test_mode=False, replay_memory_size=2000, replay_batch_size=None, replay_rollout_length=None, use_off_policy_aac=False, use_reward_prediction=False, use_pixel_control=False, use_value_replay=False, rp_lambda=1.0, pc_lambda=1.0, vr_lambda=1.0, off_aac_lambda=1, gamma_pc=0.9, rp_reward_threshold=0.1, rp_sequence_size=3, clip_epsilon=0.1, num_epochs=1, pi_prime_update_period=1, global_step_op=None, global_episode_op=None, inc_episode_op=None, _use_global_network=True, _use_target_policy=False, _use_local_memory=False, aux_render_modes=None, **kwargs)[source]¶ Base Asynchronous Advantage Actor Critic algorithm framework class with auxiliary control tasks and option to run several instances of environment for every worker in vectorized fashion, PAAC-like. Can be configured to run with different losses and policies.
Auxiliary tasks implementation borrows heavily from Kosuke Miyoshi code, under Apache License 2.0: https://miyosuda.github.io/ https://github.com/miyosuda/unreal
Original A3C code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent
Papers: https://arxiv.org/abs/1602.01783 https://arxiv.org/abs/1611.05397
Parameters: - env – environment instance or list of instances
- task – int, parent worker id
- policy_config – policy estimator class and configuration dictionary
- log_level – int, logbook.level
- name – str, class-wide name-scope
- on_policy_loss – callable returning tensor holding on_policy training loss graph and summaries
- off_policy_loss – callable returning tensor holding off_policy training loss graph and summaries
- vr_loss – callable returning tensor holding value replay loss graph and summaries
- rp_loss – callable returning tensor holding reward prediction loss graph and summaries
- pc_loss – callable returning tensor holding pixel_control loss graph and summaries
- runner_config – runner class and configuration dictionary,
- runner_fn_ref – callable defining environment runner execution logic, valid only if no ‘runner_config’ arg is provided
- cluster_spec – dict, full training cluster spec (may be used by meta-trainer)
- random_seed – int or None
- model_gamma – scalar, gamma discount factor
- model_gae_lambda – scalar, GAE lambda
- model_beta – entropy regularization beta, scalar or [high_bound, low_bound] for log_uniform.
- opt_max_env_steps – int, total number of environment steps to run training on.
- opt_decay_steps – int, learn ratio decay steps, in number of environment steps.
- opt_end_learn_rate – scalar, final learn rate
- opt_learn_rate – start learn rate, scalar or [high_bound, low_bound] for log_uniform distr.
- opt_decay – scalar, optimizer decay, if apll.
- opt_momentum – scalar, optimizer momentum, if apll.
- opt_epsilon – scalar, optimizer epsilon
- rollout_length – int, on-policy rollout length
- time_flat – bool, flatten rnn time-steps in rollouts while training - see Notes below
- episode_train_test_cycle – tuple or list as (train_number, test_number), def=(1,0): enables infinite loop such as: run train_number of train data episodes, than test_number of test data episodes, repeat. Should be consistent with provided dataset parameters (test data should exist if test_number > 0)
- episode_summary_freq – int, write episode summary for every i’th episode
- env_render_freq – int, write environment rendering summary for every i’th train step
- model_summary_freq – int, write model summary for every i’th train step
- test_mode – bool, True: Atari, False: BTGym
- replay_memory_size – int, in number of experiences
- replay_batch_size – int, mini-batch size for off-policy training, def = 1
- replay_rollout_length – int off-policy rollout length by def. equals on_policy_rollout_length
- use_off_policy_aac – bool, use full AAC off-policy loss instead of Value-replay
- use_reward_prediction – bool, use aux. off-policy reward prediction task
- use_pixel_control – bool, use aux. off-policy pixel control task
- use_value_replay – bool, use aux. off-policy value replay task (not used if use_off_policy_aac=True)
- rp_lambda – reward prediction loss weight, scalar or [high, low] for log_uniform distr.
- pc_lambda – pixel control loss weight, scalar or [high, low] for log_uniform distr.
- vr_lambda – value replay loss weight, scalar or [high, low] for log_uniform distr.
- off_aac_lambda – off-policy AAC loss weight, scalar or [high, low] for log_uniform distr.
- gamma_pc – NOT USED
- rp_reward_threshold – scalar, reward prediction classification threshold, above which reward is ‘non-zero’
- rp_sequence_size – int, reward prediction sample size, in number of experiences
- clip_epsilon – scalar, PPO: surrogate L^clip epsilon
- num_epochs – int, num. of SGD runs for every train step, val. > 1 should be used with caution.
- pi_prime_update_period – int, PPO: pi to pi_old update period in number of train steps, def: 1
- global_step_op – external tf.variable holding global step counter
- global_episode_op – external tf.variable holding global episode counter
- inc_episode_op – external tf.op incrementing global step counter
- _use_global_network – bool, either to use parameter server policy instance
- _use_target_policy – bool, PPO: use target policy (aka pi_old), delayed by pi_prime_update_period delay
- _use_local_memory – bool: use in-process replay memory instead of runner-based one
- aux_render_modes – additional visualisations to include in per-episode rendering summary
Note
On time_flat arg:
There are two alternatives to run RNN part of policy estimator:
- Feed initial RNN state for every experience frame in rollout
- (those are stored anyway if we want random memory repaly sampling) and do single time-step RNN advance for all experiences in a batch; this is when time_flat=True;
- Reshape incoming batch after convolution part of network in time-wise fashion
- for every rollout in a batch i.e. batch_size=number_of_rollouts and rnn_timesteps=max_rollout_length. In this case we need to feed initial rnn_states for rollouts only. There is some little extra work to pad rollouts to max_time_size and feed true rollout lengths to rnn. Thus, when time_flat=False, we unroll RNN in specified number of time-steps for every rollout.
Both options has pros and cons:
- Unrolling dynamic RNN is computationally more expensive but gives clearly faster convergence,
[possibly] due to the fact that RNN states for 2nd, 3rd, … frames of rollouts are computed using updated policy estimator, which is supposed to be closer to optimal one. When time_flattened, every time-step uses RNN states computed when rollout was collected (i.e. by behavioral policy estimator with older parameters).
- Nevertheless, time_flat:
- allows use of static RNN;
- one can safely shuffle training batch or mix on-policy and off-policy data in single mini-batch,
ensuring iid property; - allowing second-order derivatives which is impossible in current tf dynamic RNN implementation as it uses tf.while_loop internally; - computationally cheaper;
-
get_data
(**kwargs)[source]¶ Collect rollouts from every environment.
Returns: dictionary of lists of data streams collected from every runner
-
get_sample_config
(_new_trial=True, **kwargs)[source]¶ WARNING: _new_trial=True is quick fix, TODO: fix it properly! Returns environment configuration parameters for next episode to sample. By default is simple stateful iterator, works correctly with DTGymDataset data class, repeating cycle:
- sample num_train_episodes from train data,
- sample num_test_episodes from test data.
Convention: supposed to override dummy method of local policy instance, see inside ._make_policy() method
Returns: configuration dictionary of type btgym.datafeed.base.EnvResetConfig
-
start
(sess, summary_writer, **kwargs)[source]¶ Executes all initializing operations, starts environment runner[s]. Supposed to be called by parent worker just before training loop starts.
Parameters: - sess – tf session object.
- kwargs – not used by default.
-
process_data
(sess, data, is_train, pi, pi_prime=None)[source]¶ Processes data, composes train step feed dictionary. :param sess: tf session obj. :param pi: policy to feed :param pi_prime: optional policy to feed :param data: data dictionary :type data: dict :param is_train: is data provided are train or test
Returns: feed_dict – train step feed dictionary Return type: dict
-
process_summary
(sess, data, model_data=None, step=None, episode=None)[source]¶ Fetches and writes summary data from data and model_data. :param sess: tf summary obj. :param data: thread_runner rollouts and metadata :type data: dict :param model_data: model summary data :type model_data: dict :param step: int, global step or None :param episode: int, global episode number or None
-
class
btgym.algorithms.aac.
Unreal
(**kwargs)[source]¶ Unreal: Asynchronous Advantage Actor Critic with auxiliary control tasks.
Auxiliary tasks implementation borrows heavily from Kosuke Miyoshi code, under Apache License 2.0: https://miyosuda.github.io/ https://github.com/miyosuda/unreal
Original A3C code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent
Papers: https://arxiv.org/abs/1602.01783 https://arxiv.org/abs/1611.05397
See BaseAAC class args for details:
Parameters: - env – environment instance or list of instances
- task – int, parent worker id
- policy_config – policy estimator class and configuration dictionary
- log_level – int, logbook.level
- on_policy_loss – callable returning tensor holding on_policy training loss graph and summaries
- off_policy_loss – callable returning tensor holding off_policy training loss graph and summaries
- vr_loss – callable returning tensor holding value replay loss graph and summaries
- rp_loss – callable returning tensor holding reward prediction loss graph and summaries
- pc_loss – callable returning tensor holding pixel_control loss graph and summaries
- random_seed – int or None
- model_gamma – scalar, gamma discount factor
- model_gae_lambda – scalar, GAE lambda
- model_beta – entropy regularization beta, scalar or [high_bound, low_bound] for log_uniform.
- opt_max_env_steps – int, total number of environment steps to run training on.
- opt_decay_steps – int, learn ratio decay steps, in number of environment steps.
- opt_end_learn_rate – scalar, final learn rate
- opt_learn_rate – start learn rate, scalar or [high_bound, low_bound] for log_uniform distr.
- opt_decay – scalar, optimizer decay, if apll.
- opt_momentum – scalar, optimizer momentum, if apll.
- opt_epsilon – scalar, optimizer epsilon
- rollout_length – int, on-policy rollout length
- time_flat – bool, flatten rnn time-steps in rollouts while training - see Notes below
- episode_train_test_cycle – tuple or list as (train_number, test_number), def=(1,0): enables infinite loop such as: run train_number of train data episodes, than test_number of test data episodes, repeat. Should be consistent with provided dataset parameters (test data should exist if test_number > 0)
- episode_summary_freq – int, write episode summary for every i’th episode
- env_render_freq – int, write environment rendering summary for every i’th train step
- model_summary_freq – int, write model summary for every i’th train step
- test_mode – bool, True: Atari, False: BTGym
- replay_memory_size – int, in number of experiences
- replay_batch_size – int, mini-batch size for off-policy training, def = 1
- replay_rollout_length – int off-policy rollout length by def. equals on_policy_rollout_length
- use_off_policy_aac – bool, use full AAC off-policy loss instead of Value-replay
- use_reward_prediction – bool, use aux. off-policy reward prediction task
- use_pixel_control – bool, use aux. off-policy pixel control task
- use_value_replay – bool, use aux. off-policy value replay task (not used if use_off_policy_aac=True)
- rp_lambda – reward prediction loss weight, scalar or [high, low] for log_uniform distr.
- pc_lambda – pixel control loss weight, scalar or [high, low] for log_uniform distr.
- vr_lambda – value replay loss weight, scalar or [high, low] for log_uniform distr.
- off_aac_lambda – off-policy AAC loss weight, scalar or [high, low] for log_uniform distr.
- gamma_pc – NOT USED
- rp_reward_threshold – scalar, reward prediction classification threshold, above which reward is ‘non-zero’
- rp_sequence_size – int, reward prediction sample size, in number of experiences
- clip_epsilon – scalar, PPO: surrogate L^clip epsilon
- num_epochs – int, num. of SGD runs for every train step, val. > 1 should be used with caution.
- pi_prime_update_period – int, PPO: pi to pi_old update period in number of train steps, def: 1
- _use_target_policy – bool, PPO: use target policy (aka pi_old), delayed by pi_prime_update_period delay
Note
On time_flat arg:
There are two alternatives to run RNN part of policy estimator:
- Feed initial RNN state for every experience frame in rollout
- (those are stored anyway if we want random memory repaly sampling) and do single time-step RNN advance for all experiences in a batch; this is when time_flat=True;
- Reshape incoming batch after convolution part of network in time-wise fashion
- for every rollout in a batch i.e. batch_size=number_of_rollouts and rnn_timesteps=max_rollout_length. In this case we need to feed initial rnn_states for rollouts only. There is some little extra work to pad rollouts to max_time_size and feed true rollout lengths to rnn. Thus, when time_flat=False, we unroll RNN in specified number of time-steps for every rollout.
Both options has pros and cons:
- Unrolling dynamic RNN is computationally more expensive but gives clearly faster convergence,
[possibly] due to the fact that RNN states for 2nd, 3rd, … frames of rollouts are computed using updated policy estimator, which is supposed to be closer to optimal one. When time_flattened, every time-step uses RNN states computed when rollout was collected (i.e. by behavioral policy estimator with older parameters).
- Nevertheless, time_flatting can be interesting
because one can safely shuffle training batch or mix on-policy and off-policy data in single mini-batch, ensuring iid property and allowing, say, proper batch normalisation (this has yet to be tested).
-
class
btgym.algorithms.aac.
A3C
(**kwargs)[source]¶ Vanilla Asynchronous Advantage Actor Critic algorithm.
Based on original code taken from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent
Paper: https://arxiv.org/abs/1602.01783
A3C args. is a subset of BaseAAC arguments, see BaseAAC class for descriptions.
Parameters: - env –
- task –
- policy_config –
- log –
- random_seed –
- model_gamma –
- model_gae_lambda –
- model_beta –
- opt_max_env_steps –
- opt_decay_steps –
- opt_end_learn_rate –
- opt_learn_rate –
- opt_decay –
- opt_momentum –
- opt_epsilon –
- rollout_length –
- episode_summary_freq –
- env_render_freq –
- model_summary_freq –
- test_mode –
-
class
btgym.algorithms.aac.
PPO
(**kwargs)[source]¶ AAC with Proximal Policy Optimization surrogate L^Clip loss, optionally augmented with auxiliary control tasks.
paper: https://arxiv.org/pdf/1707.06347.pdf
Based on PPO-SGD code from OpenAI Baselines repository under MIT licence: https://github.com/openai/baselines
Async. framework code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent
PPO args. is a subset of BaseAAC arguments, see BaseAAC class for descriptions.Parameters: - env –
- task –
- policy_config –
- log_level –
- vr_loss –
- rp_loss –
- pc_loss –
- random_seed –
- model_gamma –
- model_gae_lambda –
- model_beta –
- opt_max_env_steps –
- opt_decay_steps –
- opt_end_learn_rate –
- opt_learn_rate –
- opt_decay –
- opt_momentum –
- opt_epsilon –
- rollout_length –
- episode_summary_freq –
- env_render_freq –
- model_summary_freq –
- test_mode –
- replay_memory_size –
- replay_rollout_length –
- use_off_policy_aac –
- use_reward_prediction –
- use_pixel_control –
- use_value_replay –
- rp_lambda –
- pc_lambda –
- vr_lambda –
- off_aac_lambda –
- rp_reward_threshold –
- rp_sequence_size –
- clip_epsilon –
- num_epochs –
- pi_prime_update_period –
btgym.algorithms.rollout module¶
-
btgym.algorithms.rollout.
make_data_getter
(queue)[source]¶ Data stream getter constructor.
Parameters: queue – instance of Queue class to get rollouts from. Returns: callable, returning dictionary of data.
-
class
btgym.algorithms.rollout.
Rollout
[source]¶ Experience rollout as [nested] dictionary of lists of ndarrays, tuples and rnn states.
-
add
(values, _struct=None)[source]¶ Adds single experience frame to rollout.
Parameters: values – [nested] dictionary of values.
-
add_memory_sample
(sample)[source]¶ Given replay memory sample as list of experience-dictionaries of length, converts it to rollout of same length.
-
process
(gamma, gae_lambda=1.0, size=None, time_flat=False)[source]¶ Converts single-trajectory rollout of experiences to dictionary of ready-to-feed arrays. Computes rollout returns and the advantages. Pads with zeroes to desired length, if size arg is given.
Parameters: - gamma – discount factor
- gae_lambda – GAE lambda
- size – if given and time_flat=False, pads outputs with zeroes along `time’ dim. to exact ‘size’.
- time_flat – reduce time dimension to 1 step by stacking all experiences along batch dimension.
Returns: [1, time_size, depth] or [1, size, depth] if not time_flatten and size is not/given, with single context entry for entire trajectory, i.e. of size [1, context_depth];
[batch_size, 1, depth], if time_flatten, with batch_size = time_size and context entry for every experience frame, i.e. of size [batch_size, context_depth].
Return type: batch as [nested] dictionary of np.arrays, tuples and LSTMStateTuples. of size
-
process_rp
(reward_threshold=0.1)[source]¶ Processes rollout process()-alike and estimates reward prediction target for first n-1 frames.
Parameters: reward_threshold – reward values such as |r|> reward_threshold are classified as neg. or pos. Returns: Processed batch with size reduced by one and with extra rp_target key holding one hot encodings for classes {zero, positive, negative}.
-
btgym.algorithms.memory module¶
-
class
btgym.algorithms.memory.
Memory
(history_size, max_sample_size, priority_sample_size, log_level=13, rollout_provider=None, task=-1, reward_threshold=0.1, use_priority_sampling=False)[source]¶ Replay memory with rebalanced replay based on reward value.
Note
must be filled up before calling sampling methods.
Parameters: - history_size – number of experiences stored;
- max_sample_size – maximum allowed sample size (e.g. off-policy rollout length);
- priority_sample_size – sample size of priority_sample() method
- log_level – int, logbook.level;
- rollout_provider – callable returning list of Rollouts NOT USED
- task – parent worker id;
- reward_threshold – if |experience.reward| > reward_threshold: experience is saved as ‘prioritized’;
-
add
(frame)[source]¶ Appends single experience frame to memory.
Parameters: frame – dictionary of values.
-
add_rollout
(rollout)[source]¶ Adds frames from given rollout to memory with respect to episode continuation.
Parameters: rollout – Rollout instance.
-
fill
()[source]¶ Fills replay memory with initial experiences. NOT USED. Supposed to be called by parent worker() just before training begins.
Parameters: rollout_getter – callable, returning list of Rollouts.
-
sample_uniform
(sequence_size)[source]¶ Uniformly samples sequence of successive frames of size sequence_size or less (~off-policy rollout).
Parameters: sequence_size – maximum sample size. Returns: instance of Rollout of size <= sequence_size.
-
_sample_priority
(size=None, exact_size=False, skewness=2, sample_attempts=100)[source]¶ Implements rebalanced replay. Samples sequence of successive frames from distribution skewed by means of reward of last sample frame.
Parameters: - size – sample size, must be <= self.max_sample_size;
- exact_size – whether accept sample with size less than ‘size’ or re-sample to get sample of exact size (used for reward prediction task);
- skewness – int>=1, sampling probability denominator, such as probability of sampling sequence with last frame having non-zero reward is: p[non_zero]=1/skewness;
- sample_attempts – if exact_size=True, sets number of re-sampling attempts to get sample of continuous experiences (no Terminal frames inside except last one); if number is reached - sample returned ‘as is’.
Returns: instance of Rollout().
btgym.algorithms.envs module¶
-
class
btgym.algorithms.envs.
AtariRescale42x42
(env_id=None)[source]¶ Gym wrapper, pipes Atari into BTgym algorithms, as later expect observations to be DictSpace. Makes Atari environment return state as dictionary with single key ‘external’ holding normalized in [0,1] grayscale 42x42 visual output.
Parameters: env_id – conventional Gym id.