btgym.algorithms.runner.base module¶
-
btgym.algorithms.runner.base.
BaseEnvRunnerFn
(sess, env, policy, task, rollout_length, summary_writer, episode_summary_freq, env_render_freq, atari_test, ep_summary, memory_config, log, **kwargs)[source]¶ Default function defining runtime logic of the thread runner. In brief, it constantly keeps on running the policy, and as long as the rollout exceeds a certain length, the thread runner appends all the collected data to the queue.
Parameters: - env – environment instance
- policy – policy instance
- task – int
- rollout_length – int
- episode_summary_freq – int
- env_render_freq – int
- atari_test – bool, Atari or BTGyn
- ep_summary – dict of tf.summary op and placeholders
- memory_config – replay memory configuration dictionary
- log – logbook logger
- Yelds:
- collected data as dictionary of on_policy, off_policy rollouts and episode statistics.
btgym.algorithms.runner.threadrunner module¶
-
class
btgym.algorithms.runner.threadrunner.
RunnerThread
(env, policy, task, rollout_length, episode_summary_freq, env_render_freq, test, ep_summary, runner_fn_ref=<function BaseEnvRunnerFn>, memory_config=None, log_level=13, **kwargs)[source]¶ Async. framework code comes from OpenAI repository under MIT licence: https://github.com/openai/universe-starter-agent
Despite the fact BTgym is not real-time environment [yet], thread-runner approach is still here. From original universe-starter-agent: …One of the key distinctions between a normal environment and a universe environment is that a universe environment is _real time_. This means that there should be a thread that would constantly interact with the environment and tell it what to do. This thread is here.
Another idea is to see ThreadRunner as all-in-one data provider, thus shaping data distribution fed to estimator from single place. So, replay memory is also here, as well as some service functions (collecting summary data).
Parameters: - env – environment instance
- policy – policy instance
- task – int
- rollout_length – int
- episode_summary_freq – int
- env_render_freq – int
- test – Atari or BTGyn
- ep_summary – tf.summary
- runner_fn_ref – callable defining runner execution logic
- memory_config – replay memory configuration dictionary
- log_level – int, logbook.level
btgym.algorithms.runner.synchro module¶
-
class
btgym.algorithms.runner.synchro.
BaseSynchroRunner
(env, task, rollout_length, episode_summary_freq, env_render_freq, ep_summary, test=False, policy=None, data_sample_config=None, memory_config=None, test_conditions=None, test_deterministic=True, slowdown_steps=0, global_step_op=None, aux_render_modes=None, _implemented_aux_render_modes=None, name='synchro', log_level=13, **kwargs)[source]¶ Experience provider class. Interacts with environment and outputs data in form of rollouts augmented with relevant summaries and metadata. This runner is synchronous in sense that data collection is in-process’ and every rollout is collected by explicit call to respective `get_data() method [this is unlike ‘async-` thread-runner version found earlier in this this package which, once being started, runs on its own and can not be moderated]. Makes precise control on policy being executed possible. Does not support ‘atari’ mode.
Parameters: - env – BTgym environment instance
- task – int, runner task id
- rollout_length – int
- episode_summary_freq – int
- env_render_freq – int
- test – legacy, not used
- ep_summary – legacy, not used
- policy – policy instance to execute
- data_sample_config – dict, data sampling configuration dictionary
- memory_config – dict, replay memory configuration
- test_conditions – dict or None, dictionary of single experience conditions to check to mark it as test one.
- test_deterministic – bool, if True - act deterministically for test episodes
- slowdown_time – time to sleep between steps
- aux_render_modes – iterable of str, additional summaries to compute
- iterable of str, implemented additional summaries (_implemented_aux_render_modes) –
- name – str, name scope
- log_level – int, logbook.level
-
start
(sess, summary_writer, init_context=None, data_sample_config=None)[source]¶ Executes initial sequence; fills initial replay memory if any.
-
get_init_experience
(policy, policy_sync_op=None, init_context=None, data_sample_config=None)[source]¶ Starts new environment episode.
Parameters: - policy – policy to execute.
- policy_sync_op – operation copying local behavioural policy params from global one
- init_context – initial policy context for new episode.
- data_sample_config – configuration dictionary of type btgym.datafeed.base.EnvResetConfig
Returns: incomplete initial experience of episode as dictionary (misses bootstrapped R value), next_state, next, policy RNN context action_reward
-
get_experience
(policy, state, context, action, reward, policy_sync_op=None)[source]¶ Get single experience (possibly terminal).
Returns: incomplete experience as dictionary (misses bootstrapped R value), next_state, next, policy RNN context action_reward
-
get_train_stat
(is_test=False)[source]¶ Updates and computes average statistics for train episodes. :param is_test: bool, current episode type
Returns: dict of stats
-
get_test_stat
(is_test=False)[source]¶ Updates and computes statistics for single test episode.
Parameters: is_test – bool, current episode type Returns: dict of stats
-
get_ep_render
(is_test=False)[source]¶ Collects environment renderings. Relies on environment renderer class methods, so it is only valid when environment rendering is enabled (typically it is true for master runner).
Returns: dictionary of images as rgb arrays
-
get_data
(policy=None, policy_sync_op=None, init_context=None, data_sample_config=None, rollout_length=None, force_new_episode=False)[source]¶ Collects single trajectory rollout and bunch of summaries using specified policy. Updates episode statistics and replay memory.
Parameters: - policy – policy to execute
- policy_sync_op – operation copying local behavioural policy params from global one
- init_context – if specified, overrides initial episode context provided bu self.context (valid only if new episode is started within this rollout).
- data_sample_config – environment configuration parameters for next episode to sample: configuration dictionary of type `btgym.datafeed.base.EnvResetConfig
- rollout_length – length of rollout to collect, if specified - overrides self.rollout_length attr
- force_new_episode – bool, if True - resets the environment
Returns: data dictionary
-
get_batch
(size, policy=None, policy_sync_op=None, require_terminal=True, same_trial=True, init_context=None, data_sample_config=None)[source]¶ Returns batch as list of ‘size’ or more rollouts collected under specified policy. Rollouts can be collected from several episodes consequently; there is may be more rollouts than set ‘size’ if it is necessary to collect at least one terminal rollout.
Parameters: - size – int, number of rollouts to collect
- policy – policy to use
- policy_sync_op – operation copying local behavioural policy params from global one
- require_terminal – bool, if True - require at least one terminal rollout to be present.
- same_trial – bool, if True - all episodes are sampled from same trial
- init_context – if specified, overrides initial episode context provided bu self.context
- data_sample_config – environment configuration parameters for all episodes in batch: configuration dictionary of type `btgym.datafeed.base.EnvResetConfig
Returns: ‘data’key holding list of data dictionaries; ‘terminal_context’ key holding list of terminal output contexts. If ‘require_terminal = True, this list is guarantied to hold at least one element.
Return type: dict containing
-
class
btgym.algorithms.runner.synchro.
VerboseSynchroRunner
(name='verbose_synchro', aux_render_modes=('action_prob', 'value_fn', 'lstm_1_h', 'lstm_2_h'), **kwargs)[source]¶ Extends BaseSynchroRunner class with additional visualisation summaries in some expense of running speed.