SR Dataset
SRToolkit.dataset.sr_dataset
SR_dataset
SR_dataset(X: ndarray, symbol_library: SymbolLibrary, ranking_function: str = 'rmse', y: Optional[ndarray] = None, max_evaluations: int = -1, ground_truth: Optional[Union[List[str], Node, ndarray]] = None, original_equation: Optional[str] = None, success_threshold: Optional[float] = None, result_augmenters: Optional[List[ResultAugmenter]] = None, seed: Optional[int] = None, dataset_metadata: Optional[dict] = None, **kwargs)
Initializes an instance of the SR_dataset class.
Examples:
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> dataset = SR_dataset(X, SymbolLibrary.default_symbols(2), ground_truth=["X_0", "+", "X_1"],
... y=np.array([3, 7, 11]), max_evaluations=10000, original_equation="z = x + y", success_threshold=1e-6)
>>> evaluator = dataset.create_evaluator()
>>> evaluator.evaluate_expr(["sin", "(", "X_0", ")"]) < dataset.success_threshold
False
>>> evaluator.evaluate_expr(["u-", "C", "*", "X_1", "+", "X_0"]) < dataset.success_threshold
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
The input data to be used in calculation of the error/ranking function. We assume that X is a 2D array with the shape (n_samples, n_features). |
required |
symbol_library
|
SymbolLibrary
|
The symbol library to use. |
required |
ranking_function
|
str
|
The ranking function to use. Currently, "rmse" and "bed" are supported. RMSE is the standard ranking function in symbolic regression, calculating the error between the ground truth values and outputs of expressions with fitted free parameters. BED is a stochastic measure that calculates the behavioral distance between two expressions that can contain free parameters. Its advantage is that expressions with lots of parameters are less likely to overfit, and thus the measure focuses more on structure identification. |
'rmse'
|
y
|
Optional[ndarray]
|
The target values to be used in parameter estimation if the ranking function is "rmse". |
None
|
max_evaluations
|
int
|
The maximum number of expressions to evaluate. Less than 0 means no limit. |
-1
|
ground_truth
|
Optional[Union[List[str], Node, ndarray]]
|
The ground truth expression, represented as a list of tokens (strings) in the infix notation, a SRToolkit.utils.Node object, or a numpy array representing behavior (see SRToolkit.utils.create_behavior_matrix for more details). |
None
|
original_equation
|
Optional[str]
|
The original equation from which the ground truth expression was generated). |
None
|
success_threshold
|
Optional[float]
|
The threshold for determining whether an expression is successful or not. If None, |
None
|
result_augmenters
|
Optional[List[ResultAugmenter]]
|
Optional list of objects that augment the results returned by the "get_results" function. |
None
|
seed
|
Optional[int]
|
The seed to use for random number generation/reproducibility. Default is None, which means no seed is used. |
None
|
dataset_metadata
|
Optional[dict]
|
An optional dictionary containing metadata about this evaluation. This could include information such as the name of the dataset, a citation for the dataset, number of variables, etc. |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
method |
str
|
The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B". |
tol |
float
|
The tolerance for termination. Default is 1e-6. |
gtol |
float
|
The tolerance for the gradient norm. Default is 1e-3. |
max_iter |
int
|
The maximum number of iterations. Default is 100. |
constant_bounds |
Tuple[float, float]
|
A tuple of two elements, specifying the lower and upper bounds for the constant values. Default is (-5, 5). |
initialization |
str
|
The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random". |
max_constants |
int
|
The maximum number of constants allowed in the expression. Default is 8. |
max_expr_length |
int
|
The maximum length of the expression. Default is -1 (no limit). |
num_points_sampled |
int
|
The number of points to sample when estimating the behavior of an expression. Default is 64. If num_points_sampled==-1, then the number of points sampled is equal to the number of points in the dataset. |
bed_X |
Optional[ndarray]
|
Points used for BED evaluation. If None and domain_bounds are given, points are sampled from the domain. If None and domain_bounds are not givem, points are randomly selected from X. Default is None. |
num_consts_sampled |
int
|
Number of constants sampled for BED evaluation. Default is 32. |
domain_bounds |
Optional[List[Tuple[float, float]]]
|
Bounds for the domain to be used if bed_X is None to sample random points. Default is None. |
Source code in SRToolkit/dataset/sr_dataset.py
evaluate_approach
evaluate_approach(sr_approach: SR_approach, num_experiments: int = 1, top_k: int = 20, initial_seed: int = None, results: Optional[SR_results] = None) -> SR_results
Evaluates an SR_approach on this dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sr_approach
|
SR_approach
|
An instance of SR_approach that will be evaluated on this dataset. |
required |
num_experiments
|
int
|
The number of times the approach should be evaluated on this dataset. |
1
|
top_k
|
int
|
Number of the best expressions presented in the results |
20
|
seed
|
The seed used for random number generation. If None, the seed from the dataset is used. |
required | |
results
|
Optional[SR_results]
|
An optional SR_results object to which the results of the evaluation will be added. If None, a new SR_results object will be created. |
None
|
Returns:
| Type | Description |
|---|---|
SR_results
|
The results of the evaluation. |
Source code in SRToolkit/dataset/sr_dataset.py
create_evaluator
Creates an instance of the SR_evaluator class from this dataset.
Examples:
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> dataset = SR_dataset(X, SymbolLibrary.default_symbols(2), ground_truth=["X_0", "+", "X_1"],
... y=np.array([3, 7, 11]), max_evaluations=10000, original_equation="z = x + y", success_threshold=1e-6)
>>> evaluator = dataset.create_evaluator()
>>> evaluator.evaluate_expr(["sin", "(", "X_0", ")"])
8.056453977203414
>>> evaluator.evaluate_expr(["X_1", "+", "X_0"])
0.0
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc. |
None
|
seed
|
int
|
An optional seed to be used for the random number generator. If None, the seed from the dataset is used. |
None
|
Returns:
| Type | Description |
|---|---|
SR_evaluator
|
An instance of the SR_evaluator class. |
Raises:
| Type | Description |
|---|---|
Exception
|
if an error occurs when creating the evaluator. |
Source code in SRToolkit/dataset/sr_dataset.py
__str__
Returns a string describing this dataset.
The string describes the target expression, symbols that should be used, and the success threshold. It also includes any constraints that should be followed when evaluating a model on this dataset. These constraints include the maximum number of expressions to evaluate, the maximum length of the expression, and the maximum number of constants allowed in the expression. If the symbol library contains a symbol for constants, the string also includes the range of constants.
For other metadata, please refer to the attribute self.dataset_metadata.
Examples:
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> dataset = SR_dataset(X, SymbolLibrary.default_symbols(2), ground_truth=["X_0", "+", "X_1"],
... y=np.array([3, 7, 11]), max_evaluations=10000, original_equation="z = x + y", success_threshold=1e-6)
>>> str(dataset)
'Dataset for target expression z = x + y. When evaluating your model on this dataset, you should limit your generative model to only produce expressions using the following symbols: +, -, *, /, ^, u-, sqrt, sin, cos, exp, tan, arcsin, arccos, arctan, sinh, cosh, tanh, floor, ceil, ln, log, ^-1, ^2, ^3, ^4, ^5, pi, e, C, X_0, X_1.\nExpressions will be ranked based on the RMSE ranking function.\nExpressions are deemed successful if the root mean squared error is less than 1e-06. However, we advise that you check the best performing expressions manually to ensure they are correct.\nDataset uses the default limitations (extra arguments) from the SR_evaluator.The expressions in the dataset can contain constants/free parameters.\nFor other metadata, please refer to the attribute self.dataset_metadata.'
Returns:
| Type | Description |
|---|---|
str
|
A string describing this dataset. |
Source code in SRToolkit/dataset/sr_dataset.py
to_dict
Creates a dictionary representation of this dataset. This is mainly used for saving the dataset to disk.
Examples:
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> dataset = SR_dataset(X, SymbolLibrary.default_symbols(2), ground_truth=["X_0", "+", "X_1"],
... y=np.array([3, 7, 11]), max_evaluations=10000, original_equation="z = x + y", success_threshold=1e-6)
>>> dataset.to_dict("data/example_ds", "test_dataset")
{'symbol_library': {'type': 'SymbolLibrary', 'symbols': {'+': {'symbol': '+', 'type': 'op', 'precedence': 0, 'np_fn': '{} = {} + {}', 'latex_str': '{} + {}'}, '-': {'symbol': '-', 'type': 'op', 'precedence': 0, 'np_fn': '{} = {} - {}', 'latex_str': '{} - {}'}, '*': {'symbol': '*', 'type': 'op', 'precedence': 1, 'np_fn': '{} = {} * {}', 'latex_str': '{} \\cdot {}'}, '/': {'symbol': '/', 'type': 'op', 'precedence': 1, 'np_fn': '{} = {} / {}', 'latex_str': '\\frac{{{}}}{{{}}}'}, '^': {'symbol': '^', 'type': 'op', 'precedence': 2, 'np_fn': '{} = np.power({},{})', 'latex_str': '{}^{{{}}}'}, 'u-': {'symbol': 'u-', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = -{}', 'latex_str': '- {}'}, 'sqrt': {'symbol': 'sqrt', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sqrt({})', 'latex_str': '\\sqrt {{{}}}'}, 'sin': {'symbol': 'sin', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sin({})', 'latex_str': '\\sin {}'}, 'cos': {'symbol': 'cos', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.cos({})', 'latex_str': '\\cos {}'}, 'exp': {'symbol': 'exp', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.exp({})', 'latex_str': 'e^{{{}}}'}, 'tan': {'symbol': 'tan', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.tan({})', 'latex_str': '\\tan {}'}, 'arcsin': {'symbol': 'arcsin', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arcsin({})', 'latex_str': '\\arcsin {}'}, 'arccos': {'symbol': 'arccos', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arccos({})', 'latex_str': '\\arccos {}'}, 'arctan': {'symbol': 'arctan', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arctan({})', 'latex_str': '\\arctan {}'}, 'sinh': {'symbol': 'sinh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sinh({})', 'latex_str': '\\sinh {}'}, 'cosh': {'symbol': 'cosh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.cosh({})', 'latex_str': '\\cosh {}'}, 'tanh': {'symbol': 'tanh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.tanh({})', 'latex_str': '\\tanh {}'}, 'floor': {'symbol': 'floor', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.floor({})', 'latex_str': '\\lfloor {} \\rfloor'}, 'ceil': {'symbol': 'ceil', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.ceil({})', 'latex_str': '\\lceil {} \\rceil'}, 'ln': {'symbol': 'ln', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.log({})', 'latex_str': '\\ln {}'}, 'log': {'symbol': 'log', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.log10({})', 'latex_str': '\\log_{{10}} {}'}, '^-1': {'symbol': '^-1', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = 1/{}', 'latex_str': '{}^{{-1}}'}, '^2': {'symbol': '^2', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**2', 'latex_str': '{}^2'}, '^3': {'symbol': '^3', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**3', 'latex_str': '{}^3'}, '^4': {'symbol': '^4', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**4', 'latex_str': '{}^4'}, '^5': {'symbol': '^5', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**5', 'latex_str': '{}^5'}, 'pi': {'symbol': 'pi', 'type': 'lit', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], np.pi)', 'latex_str': '\\pi'}, 'e': {'symbol': 'e', 'type': 'lit', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], np.e)', 'latex_str': 'e'}, 'C': {'symbol': 'C', 'type': 'const', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], C[{}])', 'latex_str': 'C_{{{}}}'}, 'X_0': {'symbol': 'X_0', 'type': 'var', 'precedence': 5, 'np_fn': 'X[:, 0]', 'latex_str': 'X_{0}'}, 'X_1': {'symbol': 'X_1', 'type': 'var', 'precedence': 5, 'np_fn': 'X[:, 1]', 'latex_str': 'X_{1}'}}, 'preamble': ['import numpy as np'], 'num_variables': 2}, 'ranking_function': 'rmse', 'max_evaluations': 10000, 'success_threshold': 1e-06, 'original_equation': 'z = x + y', 'seed': None, 'dataset_metadata': None, 'kwargs': {}, 'result_augmenters': None, 'ground_truth': ['X_0', '+', 'X_1'], 'dataset_path': 'data/example_ds/test_dataset.npz'}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
str
|
The path to the directory where the data in the dataset should be saved. |
required |
name
|
str
|
The name of the dataset. This will be used to name the files containing the dataset data. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representation of this dataset. |
Source code in SRToolkit/dataset/sr_dataset.py
from_dict
staticmethod
Creates an instance of the SR_dataset class from its dictionary representation. This is mainly used for loading the dataset from disk.
Examples:
>>> from SRToolkit.evaluation.result_augmentation import RESULT_AUGMENTERS
>>> dataset_dict = {'symbol_library': {'type': 'SymbolLibrary', 'symbols': {'+': {'symbol': '+', 'type': 'op', 'precedence': 0, 'np_fn': '{} = {} + {}', 'latex_str': '{} + {}'}, '-': {'symbol': '-', 'type': 'op', 'precedence': 0, 'np_fn': '{} = {} - {}', 'latex_str': '{} - {}'}, '*': {'symbol': '*', 'type': 'op', 'precedence': 1, 'np_fn': '{} = {} * {}', 'latex_str': '{} \cdot {}'}, '/': {'symbol': '/', 'type': 'op', 'precedence': 1, 'np_fn': '{} = {} / {}', 'latex_str': '\frac{{{}}}{{{}}}'}, '^': {'symbol': '^', 'type': 'op', 'precedence': 2, 'np_fn': '{} = np.power({},{})', 'latex_str': '{}^{{{}}}'}, 'u-': {'symbol': 'u-', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = -{}', 'latex_str': '- {}'}, 'sqrt': {'symbol': 'sqrt', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sqrt({})', 'latex_str': '\sqrt {{{}}}'}, 'sin': {'symbol': 'sin', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sin({})', 'latex_str': '\sin {}'}, 'cos': {'symbol': 'cos', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.cos({})', 'latex_str': '\cos {}'}, 'exp': {'symbol': 'exp', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.exp({})', 'latex_str': 'e^{{{}}}'}, 'tan': {'symbol': 'tan', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.tan({})', 'latex_str': '\tan {}'}, 'arcsin': {'symbol': 'arcsin', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arcsin({})', 'latex_str': '\arcsin {}'}, 'arccos': {'symbol': 'arccos', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arccos({})', 'latex_str': '\arccos {}'}, 'arctan': {'symbol': 'arctan', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.arctan({})', 'latex_str': '\arctan {}'}, 'sinh': {'symbol': 'sinh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.sinh({})', 'latex_str': '\sinh {}'}, 'cosh': {'symbol': 'cosh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.cosh({})', 'latex_str': '\cosh {}'}, 'tanh': {'symbol': 'tanh', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.tanh({})', 'latex_str': '\tanh {}'}, 'floor': {'symbol': 'floor', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.floor({})', 'latex_str': '\lfloor {} \rfloor'}, 'ceil': {'symbol': 'ceil', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.ceil({})', 'latex_str': '\lceil {} \rceil'}, 'ln': {'symbol': 'ln', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.log({})', 'latex_str': '\ln {}'}, 'log': {'symbol': 'log', 'type': 'fn', 'precedence': 5, 'np_fn': '{} = np.log10({})', 'latex_str': '\log_{{10}} {}'}, '^-1': {'symbol': '^-1', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = 1/{}', 'latex_str': '{}^{{-1}}'}, '^2': {'symbol': '^2', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**2', 'latex_str': '{}^2'}, '^3': {'symbol': '^3', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**3', 'latex_str': '{}^3'}, '^4': {'symbol': '^4', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**4', 'latex_str': '{}^4'}, '^5': {'symbol': '^5', 'type': 'fn', 'precedence': -1, 'np_fn': '{} = {}**5', 'latex_str': '{}^5'}, 'pi': {'symbol': 'pi', 'type': 'lit', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], np.pi)', 'latex_str': '\pi'}, 'e': {'symbol': 'e', 'type': 'lit', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], np.e)', 'latex_str': 'e'}, 'C': {'symbol': 'C', 'type': 'const', 'precedence': 5, 'np_fn': 'np.full(X.shape[0], C[{}])', 'latex_str': 'C_{{{}}}'}, 'X_0': {'symbol': 'X_0', 'type': 'var', 'precedence': 5, 'np_fn': 'X[:, 0]', 'latex_str': 'X_{0}'}, 'X_1': {'symbol': 'X_1', 'type': 'var', 'precedence': 5, 'np_fn': 'X[:, 1]', 'latex_str': 'X_{1}'}}, 'preamble': ['import numpy as np'], 'num_variables': 2}, 'ranking_function': 'rmse', 'max_evaluations': 10000, 'success_threshold': 1e-06, 'original_equation': 'z = x + y', 'seed': None, 'dataset_metadata': None, 'kwargs': {}, 'result_augmenters': None, 'ground_truth': ['X_0', '+', 'X_1'], 'dataset_path': 'data/example_ds/test_dataset.npz'}
>>> dataset = SR_dataset.from_dict(dataset_dict)
>>> dataset.X.shape
(3, 2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
dict
|
The dictionary representation of the dataset. |
required |
augmentation_map
|
Dict[str, Type[ResultAugmenter]]
|
A dictionary mapping the names of the result augmentation classes to their respective classes. When default value (None) is used, the SRToolit.evaluation.result_augmentation.RESULT_AUGMENTERS dictionary is used. |
None
|