btgym.datafeed package¶
btgym.datafeed.base module¶
-
btgym.datafeed.base.
DataSampleConfig
= {'get_new': True, 'sample_type': 0, 'timestamp': None, 'b_alpha': 1, 'b_beta': 1}¶ dict – Conventional sampling configuration template to pass to data class sample() method –
`sample = my_data.sample(**DataSampleConfig)`
-
btgym.datafeed.base.
EnvResetConfig
= {'episode_config': {'get_new': True, 'sample_type': 0, 'timestamp': None, 'b_alpha': 1, 'b_beta': 1}, 'trial_config': {'get_new': True, 'sample_type': 0, 'timestamp': None, 'b_alpha': 1, 'b_beta': 1}}¶ dict – Conventional reset configuration template to pass to environment reset() method –
`observation = env.reset(**EnvResetConfig)`
-
class
btgym.datafeed.base.
BTgymBaseData
(filename=None, dataframe=None, parsing_params=None, sampling_params=None, name='base_data', data_names=('default_asset', ), task=0, frozen_time_split=None, log_level=13, _config_stack=None, **kwargs)[source]¶ Base BTgym data provider class. Provides core data loading, sampling, splitting and converting functionality. Do not use directly.
Enables Pipe:
CSV[source data]-->pandas[for efficient sampling]-->bt.feeds
Parameters: - filename – Str or iterable of of str, filenames holding csv data; should be given either here or when calling read_csv(), see Notes
- dataframe – pd.dataframe holding data, if this arg is given - overrides ``filename` arg.
- CSV to Pandas parsing (specific_params) –
- sep – ‘;’
- header – 0
- index_col – 0
- parse_dates – True
- names – [‘open’, ‘high’, ‘low’, ‘close’, ‘volume’]
- Pandas to BT.feeds conversion (specific_params) –
- timeframe=1 – 1 minute.
- datetime – 0
- open – 1
- high – 2
- low – 3
- close – 4
- volume – -1
- openinterest – -1
- Sampling (specific_params) –
- sample_class_ref – None - if not None, than sample() method will return instance of specified class, which itself must be subclass of BaseBTgymDataset, else returns instance of the base data class.
- start_weekdays – [0, 1, 2, 3, ] - Only weekdays from the list will be used for sample start.
- start_00 – True - sample start time will be set to first record of the day (usually 00:00).
- sample_duration – {‘days’: 1, ‘hours’: 23, ‘minutes’: 55} - Maximum sample time duration in days, hours, minutes
- time_gap – {‘’days’: 0, hours’: 5, ‘minutes’: 0} - Data omittance threshold: maximum no-data time gap allowed within sample in days, hours. Thereby, if set to be < 1 day, samples containing weekends and holidays gaps will be rejected.
- test_period – {‘days’: 0, ‘hours’: 0, ‘minutes’: 0} - setting this param to non-zero duration forces instance.data split to train / test subsets with test subset duration equal to test_period with time_gap tolerance. Train data always precedes test one: [0_record<-train_data->split_point_record<-test_data->last_record].
- sample_expanding – None, reserved for child classes.
Note
- CSV file can contain duplicate records, checks will be performed and all duplicates will be removed;
- CSV file should be properly sorted by date_time in ascending order, no sorting checks performed.
- When supplying list of file_names, all files should be also listed ascending by their time period, no correct sampling will be possible otherwise.
- Default parameters are source-specific and made to correctly parse 1 minute Forex generic ASCII data files from www.HistData.com. Tune according to your data source.
-
set_params
(params_dict)[source]¶ Batch attribute setter.
Parameters: params_dict – dictionary of parameters to be set as instance attributes.
-
set_logger
(level=None, task=None)[source]¶ Sets logbook logger.
Parameters: - level – logbook.level, int
- task – task id, int
-
reset
(data_filename=None, **kwargs)[source]¶ Gets instance ready.
Parameters: - data_filename – [opt] string or list of strings.
- kwargs – not used.
-
read_csv
(data_filename=None, force_reload=False)[source]¶ Populates instance by loading data: CSV file –> pandas dataframe.
Parameters: - data_filename – [opt] csv data filename as string or list of such strings.
- force_reload – ignore loaded data.
-
describe
()[source]¶ Returns summary dataset statistic as pandas dataframe:
- records count,
- data mean,
- data std dev,
- min value,
- 25% percentile,
- 50% percentile,
- 75% percentile,
- max value
for every data column.
-
to_btfeed
()[source]¶ Performs BTgymData–>bt.feed conversion.
Returns: {data_line_name: bt.datafeed instance}. Return type: dict of type
-
_sample
(get_new=True, sample_type=0, b_alpha=1.0, b_beta=1.0, force_interval=False, interval=None, **kwargs)[source]¶ Samples continuous subset of data.
Parameters: - get_new (bool) – sample new (True) or reuse (False) last made sample;
- sample_type (int or bool) – 0 (train) or 1 (test) - get sample from train or test data subsets respectively.
- b_alpha (float) – beta-distribution sampling alpha > 0, valid for train episodes.
- b_beta (float) – beta-distribution sampling beta > 0, valid for train episodes.
- force_interval (bool) – use exact sampling interval (should be given)
- interval (iterable of int, len2) – exact interval to sample from when force_interval=True
Returns: if no sample_class_ref param been set:
BTgymDataset instance with number of records ~ max_episode_len, where ~ tolerance is set by time_gap param;- else:
- sample_class_ref instance with same as above number of records.
Note
Train sample start position within interval is drawn from beta-distribution with default parameters b_alpha=1, b_beta=1, i.e. uniform one. Beta-distribution makes skewed sampling possible , e.g. to give recent episodes higher probability of being sampled, e.g.: b_alpha=10, b_beta=0.8. Test samples are always uniform one.
-
_sample_random
(sample_type=0, timestamp=None, name='random_sample_', interval=None, force_interval=False, **kwargs)[source]¶ Randomly samples continuous subset of data.
Parameters: name – str, sample filename id Returns: BTgymDataset instance with number of records ~ max_episode_len, where ~ tolerance is set by time_gap param.
-
_sample_interval
(interval, b_alpha=1.0, b_beta=1.0, name='interval_sample_', force_interval=False, **kwargs)[source]¶ Samples continuous subset of data, such as entire episode records lie within positions specified by interval. Episode start position within interval is drawn from beta-distribution parametrised by b_alpha, b_beta. By default distribution is uniform one.
Parameters: - interval – tuple, list or 1d-array of integers of length 2: [lower_row_number, upper_row_number];
- b_alpha – float > 0, sampling B-distribution alpha param, def=1;
- b_beta – float > 0, sampling B-distribution beta param, def=1;
- name – str, sample filename id
- force_interval – bool, if true: force exact interval sampling
Returns: - number of records ~ max_episode_len, subj. to time_gap param; 2. actual episode start position is sampled from interval;
- False if it is not possible to sample instance with set args.
Return type: - BTgymDataset instance such as
-
_sample_aligned_interval
(interval, align_left=False, b_alpha=1.0, b_beta=1.0, name='interval_sample_', force_interval=False, **kwargs)[source]¶ Samples continuous subset of data, such as entire episode records lie within positions specified by interval Episode start position within interval is drawn from beta-distribution parametrised by b_alpha, b_beta. By default distribution is uniform one.
Parameters: - interval – tuple, list or 1d-array of integers of length 2: [lower_row_number, upper_row_number];
- align – if True - try to align sample to beginning of interval;
- b_alpha – float > 0, sampling B-distribution alpha param, def=1;
- b_beta – float > 0, sampling B-distribution beta param, def=1;
- name – str, sample filename id
- force_interval – bool, if true: force exact interval sampling
Returns: - number of records ~ max_episode_len, subj. to time_gap param; 2. actual episode start position is sampled from interval;
- False if it is not possible to sample instance with set args.
Return type: - BTgymDataset instance such as
-
btgym.datafeed.base.
random_beta
()¶ beta(a, b, size=None)
Draw samples from a Beta distribution.
The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function
\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]where the normalisation, B, is the beta function,
\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]It is often seen in Bayesian inference and order statistics.
Parameters: - a (float or array_like of floats) – Alpha, non-negative.
- b (float or array_like of floats) – Beta, non-negative.
- size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifa
andb
are both scalars. Otherwise,np.broadcast(a, b).size
samples are drawn.
Returns: out – Drawn samples from the parameterized beta distribution.
Return type: ndarray or scalar
btgym.datafeed.derivative module¶
-
class
btgym.datafeed.derivative.
BTgymEpisode
(filename=None, parsing_params=None, sampling_params=None, name=None, data_names=('default_asset', ), task=0, log_level=13, _config_stack=None)[source]¶ Low-level data class. Implements Episode object containing single episode data sequence. Doesnt allows further sampling and data loading. Supposed to be converted to bt.datafeed object via .to_btfeed() method. Do not use directly.
-
class
btgym.datafeed.derivative.
BTgymDataTrial
(filename=None, parsing_params=None, sampling_params=None, name=None, data_names=('default_asset', ), frozen_time_split=None, task=0, log_level=13, _config_stack=None)[source]¶ Intermediate-level data class. Implements conception of Trial object. Supports data train/test separation. Do not use directly.
Parameters: - filename – not used;
- sampling_params – dict, sample retrieving options, see base class description for details;
- task – int, optional;
- parsing_params – csv parsing options, see base class description for details;
- log_level – int, optional, logbook.level;
- _config_stack – dict, holding configuration for nested child samples;
-
class
btgym.datafeed.derivative.
BTgymRandomDataDomain
(trial_params, episode_params, filename=None, dataframe=None, parsing_params=None, target_period=None, use_target_backshift=False, frozen_time_split=None, name='RndDataDomain', task=0, data_names=('default_asset', ), log_level=13)[source]¶ Top-level data class. Implements one way data domains can be defined, namely when source domain precedes and target one. Implements pipe:
Domain.sample() --> Trial.sample() --> Episode.to_btfeed() --> bt.Startegy
This particular class randomly samples Trials from provided dataset.
Parameters: - filename – Str or list of str, file_names containing CSV historic data;
- dataframe – pd.dataframe or iterable of pd.dataframes containing historic data;
- parsing_params – csv parsing options, see base class description for details;
- trial_params – dict, describes trial parameters, should contain keys: {sample_duration, time_gap, start_00, start_weekdays, test_period, expanding};
- episode_params – dict, describes episode parameters, should contain keys: {sample_duration, time_gap, start_00, start_weekdays};
- target_period – dict, None or Int, domain target period, def={‘days’: 0, ‘hours’: 0, ‘minutes’: 0}; setting this param to non-zero duration forces separation to source/target domains (which can be thought of as creating top-level train/test subsets) with target data duration equal to target_period; if set to None - no target period assumed; if set to -1 - no source period assumed; Source data always precedes target one.
- use_target_backshift – bool, if true - target domain is shifted back by the duration of trial train period, thus allowing training on part of target domain data, namely train part of the trial closest to source/target break point.
- name – str, optional
- task – int, optional
- log_level – int, logbook.level
-
trial_class_ref
¶ alias of
BTgymDataTrial
-
episode_class_ref
¶ alias of
BTgymEpisode
-
class
btgym.datafeed.derivative.
BTgymDataset
(filename=None, episode_duration=None, time_gap=None, start_00=False, start_weekdays=None, parsing_params=None, target_period=None, name='SimpleDataSet', data_names=('default_asset', ), log_level=13, **kwargs)[source]¶ Simple top-level data class, implements direct random episode sampling from data set induced by csv file, i.e it is a special case for Trial=def=Episode. Supports source and target data domains separation with some caveat - see Note.
Note
Due to current implementation sampling test episode actually requires sampling test TRIAL. To be improved.
Parameters: - filename – Str or list of str, file_names containing CSV historic data;
- episode_duration – dict, maximum episode duration in d:h:m, def={‘days’: 0, ‘hours’: 23, ‘minutes’: 55}, alias for sample_duration;
- time_gap – dict, data time gap allowed within sample in d:h:m, def={‘days’: 0, ‘hours’: 6};
- start_00 – bool, episode start point will be shifted back to first record; of the day (usually 00:00), def=False;
- start_weekdays – list, only weekdays from the list will be used for sample start, def=[0, 1, 2, 3, 4, 5, 6];
- target_period – domain test(aka target) period. def={‘days’: 0, ‘hours’: 0, ‘minutes’: 0}; setting this param to non-zero duration forces data separation to train/test subsets. Train data always precedes test one.
- parsing_params – csv parsing options, see base class description for details;
- name – str, instance name;
- log_level – int, logbook.level;
- **kwargs – deprecated kwargs;
-
class
BTgymSimpleTrial
(filename=None, parsing_params=None, sampling_params=None, name=None, data_names=('default_asset', ), frozen_time_split=None, task=0, log_level=13, _config_stack=None)[source]¶ Truncated Trial without test period: always samples from train, sampled episode inherits tarin/test metadata of parent trail.
Parameters: - filename – not used;
- sampling_params – dict, sample retrieving options, see base class description for details;
- task – int, optional;
- parsing_params – csv parsing options, see base class description for details;
- log_level – int, optional, logbook.level;
- _config_stack – dict, holding configuration for nested child samples;
-
trial_class_ref
¶ alias of
BTgymSimpleTrial
-
class
btgym.datafeed.derivative.
BTgymDataset2
(filename=None, dataframe=None, episode_duration=None, time_gap=None, start_00=False, start_weekdays=None, parsing_params=None, target_period=None, name='SimpleDataSet2', data_names=('default_asset', ), log_level=13, **kwargs)[source]¶ Simple top-level data class, implements direct random episode sampling from data set induced by csv file, i.e it is a special case for Trial=def=Episode.
Parameters: - filename – Str or list of str, file_names containing CSV historic data;
- dataframe – pd.dataframe or iterable of pd.dataframes containing historic data;
- episode_duration – dict, maximum episode duration in d:h:m, def={‘days’: 0, ‘hours’: 23, ‘minutes’: 55}, alias for sample_duration;
- time_gap – dict, data time gap allowed within sample in d:h:m, def={‘days’: 0, ‘hours’: 6};
- start_00 – bool, episode start point will be shifted back to first record; of the day (usually 00:00), def=False;
- start_weekdays – list, only weekdays from the list will be used for sample start, def=[0, 1, 2, 3, 4, 5, 6];
- target_period – domain test(aka target) period. def={‘days’: 0, ‘hours’: 0, ‘minutes’: 0}; setting this param to non-zero duration forces data separation to train/test subsets. Train data always precedes test one.
- parsing_params – csv parsing options, see base class description for details;
- name – str, instance name;
- log_level – int, logbook.level;
- **kwargs –
btgym.datafeed.casual module¶
-
class
btgym.datafeed.casual.
BTgymCasualTrial
(name='TimeTrial', **kwargs)[source]¶ Intermediate-level data class. Implements conception of Trial object. Supports exact data train/test separation by means of global_time Do not use directly.
Parameters: - filename – not used;
- sampling_params – dict, sample retrieving options, see base class description for details;
- task – int, optional;
- parsing_params – csv parsing options, see base class description for details;
- log_level – int, optional, logbook.level;
- _config_stack – dict, holding configuration for nested child samples;
-
set_global_timestamp
(timestamp)[source]¶ Performs validity checks and sets current global_time. :param timestamp: POSIX timestamp
Returns:
-
get_intervals
()[source]¶ Estimates exact sampling intervals such as test episode starts as close to current global time point as data consistency allows but no earlier;
Returns: dict of train and test sampling intervals for current global_time point
-
sample
(get_new=True, sample_type=0, timestamp=None, align_left=True, b_alpha=1.0, b_beta=1.0, **kwargs)[source]¶ Samples continuous subset of data.
Parameters: - get_new (bool) – sample new (True) or reuse (False) last made sample;
- sample_type (int or bool) – 0 (train) or 1 (test) - get sample from train or test data subsets respectively.
- timestamp – POSIX timestamp.
- align_left – bool, if True: set test interval as close to current timepoint as possible.
- b_alpha (float) – beta-distribution sampling alpha > 0, valid for train episodes.
- b_beta (float) – beta-distribution sampling beta > 0, valid for train episodes.
-
class
btgym.datafeed.casual.
BTgymCasualDataDomain
(filename, trial_params, episode_params, frozen_time_split=None, name='TimeDataDomain', data_names=('default_asset', ), **kwargs)[source]¶ Imitates online data stream by implementing conception of sliding current time point and enabling sampling control according to it.
- Objective is to enable proper train/evaluation/test data split and prevent data leakage by
- allowing training on known, past data only and testing on unknown, future data, providing realistic training cycle.
Source trials set is defined as all trials starting somewhere in past and ending no later than current time point, and target trials set as set of trials such as: trial test period starts somewhere in the past and ends at current time point and trial test period starts from now on for all time points in available dataset range.
Sampling control is defined by: - current time point is set arbitrary and is stateful in sense it can be only increased (no backward time); - source trials can be sampled from past (known) data multiply times; - target trial can only be sampled once according to current time point or later (unknown data); - as any sampled target trial is being evaluated by outer algorithm, current time should be incremented either by
providing ‘timestamp’ arg. to sample() method or calling set_global_timestamp() method, to match last evaluated record (marking all evaluated data as already known and making it available for training);Parameters: - filename – Str or list of str, file_names containing CSV historic data;
- parsing_params – csv parsing options, see base class description for details;
- trial_params – dict, describes trial parameters, should contain keys: {sample_duration, time_gap, start_00, start_weekdays, test_period, expanding};
- episode_params – dict, describes episode parameters, should contain keys: {sample_duration, time_gap, start_00, start_weekdays};
- name – str, optional
- task – int, optional
- log_level – int, logbook.level
-
trial_class_ref
¶ alias of
BTgymCasualTrial
-
episode_class_ref
¶ alias of
BTgymEpisode
-
set_global_timestamp
(timestamp)[source]¶ Performs validity checks and sets current global_time. :param timestamp: POSIX timestamp
Returns:
-
get_intervals
()[source]¶ Estimates exact sampling intervals such as train period of target trial overlaps by known up to date data
Returns: dict of train and test sampling intervals for current global_time point
-
sample
(get_new=True, sample_type=0, timestamp=None, b_alpha=1.0, b_beta=1.0, **kwargs)[source]¶ Samples from sequence of Trials.
Parameters: - get_new (bool) – sample new (True) or reuse (False) last made sample; n/a for target trials
- sample_type (int or bool) – 0 (train) or 1 (test) - get sample from source or target data subsets respectively;
- timestamp – POSIX timestamp indicating current global time of training loop
- b_alpha (float) – beta-distribution sampling alpha > 0, valid for train episodes.
- b_beta (float) – beta-distribution sampling beta > 0, valid for train episodes.
Returns: Trial as BTgymBaseDataTrial instance; None, if trial’s sequence is exhausted (global time is up).
btgym.datafeed.multi module¶
-
class
btgym.datafeed.multi.
BTgymMultiData
(data_class_ref=None, data_config=None, name='multi_data', data_names=None, task=0, log_level=13, **kwargs)[source]¶ Multiply data streams wrapper.
Parameters: - data_class_ref – one of BTgym single-stream datafeed classes
- data_config – nested dictionary of individual data streams sources, see notes below.
- kwargs – shared parameters for all data streams, see base dataclass
Notes
Data_config specifies all data sources consumed by strategy:
data_config = { data_line_name_0: { filename: [source csv filename string or list of strings], [config: {optional dict of individual stream config. params},] }, ..., data_line_name_n : {...} }
Example:
data_config = { 'usd': {'filename': '.../DAT_ASCII_EURUSD_M1_2017.csv'}, 'gbp': {'filename': '.../DAT_ASCII_EURGBP_M1_2017.csv'}, 'jpy': {'filename': '.../DAT_ASCII_EURJPY_M1_2017.csv'}, 'chf': {'filename': '.../DAT_ASCII_EURCHF_M1_2017.csv'}, } It is user responsibility to correctly choose historic data conversion rates wrt cash currency (here - EUR).
-
btgym.datafeed.multi.
random_beta
()¶ beta(a, b, size=None)
Draw samples from a Beta distribution.
The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function
\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]where the normalisation, B, is the beta function,
\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]It is often seen in Bayesian inference and order statistics.
Parameters: - a (float or array_like of floats) – Alpha, non-negative.
- b (float or array_like of floats) – Beta, non-negative.
- size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifa
andb
are both scalars. Otherwise,np.broadcast(a, b).size
samples are drawn.
Returns: out – Drawn samples from the parameterized beta distribution.
Return type: ndarray or scalar