Evaluation Module

`SRToolkit.evaluation`

This module contains classes and functions for evaluating symbolic regression approaches. Mainly it contains classes that can be used for parameter estimation and evaluation of mathematical expressions on some dataset.

Modules:

Name	Description
`parameter_estimator`	The module containing classes and functions for parameter estimation.
`sr_evaluator`	The module containing classes and functions for expressions on some dataset.

`ParameterEstimator`

Source code in SRToolkit/evaluation/parameter_estimator.py

class ParameterEstimator:
    def __init__(self, X: np.ndarray, y: np.ndarray, symbol_library: SymbolLibrary=SymbolLibrary.default_symbols(), **kwargs):
        """
        Initializes an instance of the ParameterEstimator class.

        Examples:
            >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
            >>> y = np.array([3, 0, 3, 11])
            >>> pe = ParameterEstimator(X, y)
            >>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
            >>> print(rmse < 1e-6)
            True
            >>> print(1.99 < constants[0] < 2.01)
            True

        Args:
            X: The input data to be used in parameter estimation for variables. We assume that X is a 2D array
                with shape (n_samples, n_features).
            y: The target values to be used in parameter estimation.
            symbol_library: The symbol library to use. Defaults to SymbolLibrary.default_symbols().

        Keyword Arguments:
            method str: The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
            tol float: The tolerance for termination. Default is 1e-6.
            gtol float: The tolerance for the gradient norm. Default is 1e-3.
            max_iter int: The maximum number of iterations. Default is 100.
            bounds List[float]: A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
            initialization str: The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values
                                sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
            max_constants int: The maximum number of constants allowed in the expression. Default is 8.
            max_expr_length int: The maximum length of the expression. Default is -1 (no limit).

        Methods:
            estimate_parameters(expr: List[str]): Estimates the parameters of an expression by minimizing the error between the predicted and actual values.
        """
        self.symbol_library = symbol_library
        self.X = X
        self.y = y
        # self.stats = {"success": 0, "failure": 0, "steps": dict(), "num_constants": dict(), "failed_constants": dict()}

        self.estimation_settings = {
                "method": "L-BFGS-B",
                "tol": 1e-6,
                "gtol": 1e-3,
                "max_iter": 100,
                "bounds": [-5, 5],
                "initialization": "random", # random, mean
                "max_constants": 8,
                "max_expr_length": -1
        }

        if kwargs:
            self.estimation_settings.update(kwargs)

    def estimate_parameters(self, expr: List[str]) -> Tuple[float, np.ndarray]:
        """
        Estimates the parameters of an expression by minimizing the error between the predicted and actual values.

        Examples:
            >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
            >>> y = np.array([3, 0, 3, 11])
            >>> pe = ParameterEstimator(X, y)
            >>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
            >>> print(rmse < 1e-6)
            True
            >>> print(1.99 < constants[0] < 2.01)
            True

        Args:
            expr: A list of strings representing the expression to be evaluated. The expression should include the
                  symbol 'C' for constants whose values need to be estimated.

        Returns:
            the root mean square error (RMSE) of the optimized expression.
            An array containing the optimized constant values.

        Notes:
            if the length of the expression exceeds the maximum allowed, NaN and an empty array are returned.
            If the number of constants in the expression exceeds the maximum allowed, NaN and an empty array are returned.
            If there are no constants in the expression, the RMSE is calculated directly without optimization.
        """
        num_constants = sum([1 for t in expr if t == "C"])
        if 0 <= self.estimation_settings["max_constants"] < num_constants:
            return np.nan, np.array([])

        if 0 <= self.estimation_settings["max_expr_length"] < len(expr):
            return np.nan, np.array([])

        executable_error_fn = expr_to_error_function(expr, self.symbol_library)

        if num_constants == 0:
            rmse = executable_error_fn(self.X, np.array([]), self.y)
            return rmse, np.array([])
        else:
            return self._optimize_parameters(executable_error_fn, num_constants)

    def _optimize_parameters(self, executable_error_fn: callable, num_constants: int) -> Tuple[float, np.ndarray]:
        """
        Optimizes the parameters of a given expression by minimizing the root mean squared error between the predicted and actual values.

        Parameters
        ----------
        executable_error_fn : callable
            A function that takes in the input values, the constant values, and the target values and returns the root mean squared error.
        num_constants : int
            The number of constants in the expression.

        Returns
        -------
        float
            The root mean square error of the optimized expression.
        np.ndarray
            An array containing the optimized constant values.
        """
        if self.estimation_settings["initialization"] == "random":
            x0 = np.random.rand(num_constants) * (self.estimation_settings["bounds"][1] - self.estimation_settings["bounds"][0]) + self.estimation_settings["bounds"][0]
        else:
            x0 = np.array([np.mean(self.estimation_settings["bounds"]) for _ in range(num_constants)])

        res = minimize(lambda c: executable_error_fn(self.X, c, self.y), x0, method=self.estimation_settings["method"],
                       tol=self.estimation_settings["tol"],
                       options={
                           "maxiter": self.estimation_settings["max_iter"],
                           "gtol": self.estimation_settings["gtol"]
                                },
                       bounds=[(self.estimation_settings["bounds"][0], self.estimation_settings["bounds"][1]) for _ in range(num_constants)])

        # if res.success:
        #     self.stats["success"] += 1
        # else:
        #     self.stats["failure"] += 1
        #     if num_constants in self.stats["failed_constants"]:
        #         self.stats["failed_constants"][num_constants] += 1
        #     else:
        #         self.stats["failed_constants"][num_constants] = 1
        #
        # if res.nit in self.stats["steps"]:
        #     self.stats["steps"][res.nit] += 1
        # else:
        #     self.stats["steps"][res.nit] = 1
        #
        # if num_constants in self.stats["num_constants"]:
        #     self.stats["num_constants"][num_constants] += 1
        # else:
        #     self.stats["num_constants"][num_constants] = 1

        return res.fun, res.x

`init(X, y, symbol_library=SymbolLibrary.default_symbols(), **kwargs)`

Initializes an instance of the ParameterEstimator class.

Examples:

>>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
>>> y = np.array([3, 0, 3, 11])
>>> pe = ParameterEstimator(X, y)
>>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
>>> print(rmse < 1e-6)
True
>>> print(1.99 < constants[0] < 2.01)
True

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	The input data to be used in parameter estimation for variables. We assume that X is a 2D array with shape (n_samples, n_features).	required
`y`	`ndarray`	The target values to be used in parameter estimation.	required
`symbol_library`	`SymbolLibrary`	The symbol library to use. Defaults to SymbolLibrary.default_symbols().	`default_symbols()`

Other Parameters:

Name	Type	Description
`method`	`str`	The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
`tol`	`float`	The tolerance for termination. Default is 1e-6.
`gtol`	`float`	The tolerance for the gradient norm. Default is 1e-3.
`max_iter`	`int`	The maximum number of iterations. Default is 100.
`bounds`	`List[float]`	A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
`initialization`	`str`	The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
`max_constants`	`int`	The maximum number of constants allowed in the expression. Default is 8.
`max_expr_length`	`int`	The maximum length of the expression. Default is -1 (no limit).

Functions:

Name	Description
`estimate_parameters`	List[str]): Estimates the parameters of an expression by minimizing the error between the predicted and actual values.

Source code in SRToolkit/evaluation/parameter_estimator.py

def __init__(self, X: np.ndarray, y: np.ndarray, symbol_library: SymbolLibrary=SymbolLibrary.default_symbols(), **kwargs):
    """
    Initializes an instance of the ParameterEstimator class.

    Examples:
        >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
        >>> y = np.array([3, 0, 3, 11])
        >>> pe = ParameterEstimator(X, y)
        >>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
        >>> print(rmse < 1e-6)
        True
        >>> print(1.99 < constants[0] < 2.01)
        True

    Args:
        X: The input data to be used in parameter estimation for variables. We assume that X is a 2D array
            with shape (n_samples, n_features).
        y: The target values to be used in parameter estimation.
        symbol_library: The symbol library to use. Defaults to SymbolLibrary.default_symbols().

    Keyword Arguments:
        method str: The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
        tol float: The tolerance for termination. Default is 1e-6.
        gtol float: The tolerance for the gradient norm. Default is 1e-3.
        max_iter int: The maximum number of iterations. Default is 100.
        bounds List[float]: A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
        initialization str: The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values
                            sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
        max_constants int: The maximum number of constants allowed in the expression. Default is 8.
        max_expr_length int: The maximum length of the expression. Default is -1 (no limit).

    Methods:
        estimate_parameters(expr: List[str]): Estimates the parameters of an expression by minimizing the error between the predicted and actual values.
    """
    self.symbol_library = symbol_library
    self.X = X
    self.y = y
    # self.stats = {"success": 0, "failure": 0, "steps": dict(), "num_constants": dict(), "failed_constants": dict()}

    self.estimation_settings = {
            "method": "L-BFGS-B",
            "tol": 1e-6,
            "gtol": 1e-3,
            "max_iter": 100,
            "bounds": [-5, 5],
            "initialization": "random", # random, mean
            "max_constants": 8,
            "max_expr_length": -1
    }

    if kwargs:
        self.estimation_settings.update(kwargs)

`estimate_parameters(expr)`

Estimates the parameters of an expression by minimizing the error between the predicted and actual values.

Examples:

>>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
>>> y = np.array([3, 0, 3, 11])
>>> pe = ParameterEstimator(X, y)
>>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
>>> print(rmse < 1e-6)
True
>>> print(1.99 < constants[0] < 2.01)
True

Parameters:

Name	Type	Description	Default
`expr`	`List[str]`	A list of strings representing the expression to be evaluated. The expression should include the symbol 'C' for constants whose values need to be estimated.	required

Returns:

Type	Description
`float`	the root mean square error (RMSE) of the optimized expression.
`ndarray`	An array containing the optimized constant values.

Notes

if the length of the expression exceeds the maximum allowed, NaN and an empty array are returned. If the number of constants in the expression exceeds the maximum allowed, NaN and an empty array are returned. If there are no constants in the expression, the RMSE is calculated directly without optimization.

Source code in SRToolkit/evaluation/parameter_estimator.py

def estimate_parameters(self, expr: List[str]) -> Tuple[float, np.ndarray]:
    """
    Estimates the parameters of an expression by minimizing the error between the predicted and actual values.

    Examples:
        >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
        >>> y = np.array([3, 0, 3, 11])
        >>> pe = ParameterEstimator(X, y)
        >>> rmse, constants = pe.estimate_parameters(["C", "*", "X_1", "-", "X_0"])
        >>> print(rmse < 1e-6)
        True
        >>> print(1.99 < constants[0] < 2.01)
        True

    Args:
        expr: A list of strings representing the expression to be evaluated. The expression should include the
              symbol 'C' for constants whose values need to be estimated.

    Returns:
        the root mean square error (RMSE) of the optimized expression.
        An array containing the optimized constant values.

    Notes:
        if the length of the expression exceeds the maximum allowed, NaN and an empty array are returned.
        If the number of constants in the expression exceeds the maximum allowed, NaN and an empty array are returned.
        If there are no constants in the expression, the RMSE is calculated directly without optimization.
    """
    num_constants = sum([1 for t in expr if t == "C"])
    if 0 <= self.estimation_settings["max_constants"] < num_constants:
        return np.nan, np.array([])

    if 0 <= self.estimation_settings["max_expr_length"] < len(expr):
        return np.nan, np.array([])

    executable_error_fn = expr_to_error_function(expr, self.symbol_library)

    if num_constants == 0:
        rmse = executable_error_fn(self.X, np.array([]), self.y)
        return rmse, np.array([])
    else:
        return self._optimize_parameters(executable_error_fn, num_constants)

`SR_evaluator`

Source code in SRToolkit/evaluation/sr_evaluator.py

class SR_evaluator:
    def __init__(
        self,
        X: np.ndarray,
        y: np.ndarray,
        max_evaluations: int = -1,
        metadata: Optional[dict] = None,
        symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(),
        **kwargs
    ):
        """
        Initializes an instance of the SR_evaluator class. This class is used for evaluating symbolic regression approaches.

        Examples:
            >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
            >>> y = np.array([3, 0, 3, 11])
            >>> se = SR_evaluator(X, y)
            >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
            >>> print(rmse < 1e-6)
            True


        Attributes:
            models: A dictionary containing the results of previously evaluated expressions.
            max_evaluations: The maximum number of expressions to evaluate.
            metadata: An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.
            symbol_library: The symbol library to use.
            total_expressions: The total number of expressions considered.
            parameter_estimator: An instance of the ParameterEstimator class used for parameter estimation.

        Args:
            X: The input data to be used in parameter estimation for variables. We assume that X is a 2D array with shape (n_samples, n_features).
            y: The target values to be used in parameter estimation.
            max_evaluations: The maximum number of expressions to evaluate. Default is -1, which means no limit.
            metadata: An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.
            symbol_library: The symbol library to use.

        Keyword Arguments:
            method str: The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
            tol float: The tolerance for termination. Default is 1e-6.
            gtol float: The tolerance for the gradient norm. Default is 1e-3.
            max_iter int: The maximum number of iterations. Default is 100.
            bounds List[float]: A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
            initialization str: The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values
                                sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
            max_constants int: The maximum number of constants allowed in the expression. Default is 8.
            max_expr_length int: The maximum length of the expression. Default is -1 (no limit).

        Methods:
            evaluate_expr(expr): Evaluates an expression in infix notation and stores the result in memory to prevent re-evaluation.
            get_results(top_k): Returns the results of the evaluation.
        """
        self.models = dict()
        self.metadata = metadata
        self.symbol_library = symbol_library
        self.max_evaluations = max_evaluations
        self.total_expressions = 0
        self.parameter_estimator = ParameterEstimator(
            X, y, symbol_library=symbol_library, **kwargs)

    def evaluate_expr(self, expr: List[str]) -> float:
        """
        Evaluates an expression in infix notation and stores the result in
        memory to prevent re-evaluation.

        Examples:
            >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
            >>> y = np.array([3, 0, 3, 11])
            >>> se = SR_evaluator(X, y)
            >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
            >>> print(rmse < 1e-6)
            True

        Args:
            expr: A list of strings representing the expression in infix notation.

        Returns:
            The root mean square error of the expression.

        Warnings:
            Maximum number of evaluations reached: If the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.

        Notes:
            If the expression has already been evaluated, its stored value is returned instead of re-evaluating the expression.
            When the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.
        """
        self.total_expressions += 1

        if 0 <= self.max_evaluations < self.total_expressions:
            warnings.warn(
                f"Maximum number of evaluations ({self.max_evaluations}) reached. Stopping evaluation.")
            return np.nan
        else:
            expr_str = "".join(expr)
            if expr_str in self.models:
                # print(f"Already evaluated {expr_str}")
                # print(self.models[expr_str])
                return self.models[expr_str]["rmse"]
            else:
                rmse, parameters = self.parameter_estimator.estimate_parameters(expr)
                self.models[expr_str] = {
                    "rmse": rmse,
                    "parameters": parameters,
                    "expr": expr,
                }
                return rmse

    # def evaluate_exprs(
    #     self, exprs: List[List[str]], num_processes: int = 1
    # ) -> List[float]:
    #     if num_processes > 1:
    #         pool = Pool(num_processes)
    #         results = pool.map(self.evaluate_expr, exprs)
    #         pool.close()
    #         for r in results:
    #             self.models
    #         return results
    #     else:
    #         return [self.evaluate_expr(expr) for expr in exprs]

    def get_results(self, top_k: int = 20, success_threshold: float = 1e-7) -> dict:
        """
        Returns the results of the equation discovery/symbolic regression process/evaluation.

        Examples:
            >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
            >>> y = np.array([3, 0, 3, 11])
            >>> se = SR_evaluator(X, y)
            >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
            >>> results = se.get_results(top_k=1)
            >>> print(results["num_evaluated"])
            1
            >>> print(results["total_expressions"])
            1
            >>> print(results["best_expr"])
            C*X_1-X_0
            >>> print(results["min_rmse"] < 1e-6)
            True
            >>> print(1.99 < results["results"][0]["parameters"][0] < 2.01)
            True

        Args:
            top_k: The number of top results to include in the output. If `top_k`
                is greater than the number of evaluated expressions, all
                evaluated expressions are included. If `top_k` is less than 0,
                all evaluated expressions are included.
            success_threshold: The threshold below which the evaluation is
                considered successful. Default is 1e-7.

        Returns:
            A dictionary containing the results of the equation discovery/symbolic regression process. The keys are:

                - "metadata" : The metadata provided in the constructor.
                - "min_rmse" : The minimum root mean squared error.
                - "best_expr" : The expression with the minimum root mean
                  squared error.
                - "num_evaluated" : The number of evaluated expressions.
                - "total_expressions" : The total number of expressions
                  considered.
                - "success" : Whether the evaluation was successful.
                - "results" : A list of dictionaries, where each dictionary
                  contains the root mean squared error, the expression, and the
                  estimated parameters of the expression. The list is sorted in
                  ascending order of the root mean squared error.
        """
        if top_k > len(self.models) or top_k < 0:
            top_k = len(self.models)

        models = list(self.models.values())
        best_indices = np.argsort([v["rmse"] for v in models])

        results = {
            "metadata": self.metadata,
            "min_rmse": models[best_indices[0]]["rmse"],
            "best_expr": "".join(models[best_indices[0]]["expr"]),
            "num_evaluated": len(models),
            "total_expressions": self.total_expressions,
            "results": list(),
        }

        # Determine success based on the predefined success threshold
        if success_threshold is not None and results["min_rmse"] < success_threshold:
            results["success"] = True
        else:
            results["success"] = False

        for i in best_indices[:top_k]:
            results["results"].append(models[i])

        return results

`init(X, y, max_evaluations=-1, metadata=None, symbol_library=SymbolLibrary.default_symbols(), **kwargs)`

Initializes an instance of the SR_evaluator class. This class is used for evaluating symbolic regression approaches.

Examples:

>>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
>>> y = np.array([3, 0, 3, 11])
>>> se = SR_evaluator(X, y)
>>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
>>> print(rmse < 1e-6)
True

Attributes:

Name	Type	Description
`models`		A dictionary containing the results of previously evaluated expressions.
`max_evaluations`		The maximum number of expressions to evaluate.
`metadata`		An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.
`symbol_library`		The symbol library to use.
`total_expressions`		The total number of expressions considered.
`parameter_estimator`		An instance of the ParameterEstimator class used for parameter estimation.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	The input data to be used in parameter estimation for variables. We assume that X is a 2D array with shape (n_samples, n_features).	required
`y`	`ndarray`	The target values to be used in parameter estimation.	required
`max_evaluations`	`int`	The maximum number of expressions to evaluate. Default is -1, which means no limit.	`-1`
`metadata`	`Optional[dict]`	An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.	`None`
`symbol_library`	`SymbolLibrary`	The symbol library to use.	`default_symbols()`

Other Parameters:

Name	Type	Description
`method`	`str`	The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
`tol`	`float`	The tolerance for termination. Default is 1e-6.
`gtol`	`float`	The tolerance for the gradient norm. Default is 1e-3.
`max_iter`	`int`	The maximum number of iterations. Default is 100.
`bounds`	`List[float]`	A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
`initialization`	`str`	The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
`max_constants`	`int`	The maximum number of constants allowed in the expression. Default is 8.
`max_expr_length`	`int`	The maximum length of the expression. Default is -1 (no limit).

Functions:

Name	Description
`evaluate_expr`	Evaluates an expression in infix notation and stores the result in memory to prevent re-evaluation.
`get_results`	Returns the results of the evaluation.

Source code in SRToolkit/evaluation/sr_evaluator.py

def __init__(
    self,
    X: np.ndarray,
    y: np.ndarray,
    max_evaluations: int = -1,
    metadata: Optional[dict] = None,
    symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(),
    **kwargs
):
    """
    Initializes an instance of the SR_evaluator class. This class is used for evaluating symbolic regression approaches.

    Examples:
        >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
        >>> y = np.array([3, 0, 3, 11])
        >>> se = SR_evaluator(X, y)
        >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
        >>> print(rmse < 1e-6)
        True


    Attributes:
        models: A dictionary containing the results of previously evaluated expressions.
        max_evaluations: The maximum number of expressions to evaluate.
        metadata: An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.
        symbol_library: The symbol library to use.
        total_expressions: The total number of expressions considered.
        parameter_estimator: An instance of the ParameterEstimator class used for parameter estimation.

    Args:
        X: The input data to be used in parameter estimation for variables. We assume that X is a 2D array with shape (n_samples, n_features).
        y: The target values to be used in parameter estimation.
        max_evaluations: The maximum number of expressions to evaluate. Default is -1, which means no limit.
        metadata: An optional dictionary containing metadata about this evaluation. This could include information such as the dataset used, the model used, seed, etc.
        symbol_library: The symbol library to use.

    Keyword Arguments:
        method str: The method to be used for minimization. Currently, only "L-BFGS-B" is supported/tested. Default is "L-BFGS-B".
        tol float: The tolerance for termination. Default is 1e-6.
        gtol float: The tolerance for the gradient norm. Default is 1e-3.
        max_iter int: The maximum number of iterations. Default is 100.
        bounds List[float]: A list of two elements, specifying the lower and upper bounds for the constant values. Default is [-5, 5].
        initialization str: The method to use for initializing the constant values. Currently, only "random" and "mean" are supported. "random" creates a vector with random values
                            sampled within the bounds. "mean" creates a vector where all values are calculated as (lower_bound + upper_bound)/2. Default is "random".
        max_constants int: The maximum number of constants allowed in the expression. Default is 8.
        max_expr_length int: The maximum length of the expression. Default is -1 (no limit).

    Methods:
        evaluate_expr(expr): Evaluates an expression in infix notation and stores the result in memory to prevent re-evaluation.
        get_results(top_k): Returns the results of the evaluation.
    """
    self.models = dict()
    self.metadata = metadata
    self.symbol_library = symbol_library
    self.max_evaluations = max_evaluations
    self.total_expressions = 0
    self.parameter_estimator = ParameterEstimator(
        X, y, symbol_library=symbol_library, **kwargs)

`evaluate_expr(expr)`

Evaluates an expression in infix notation and stores the result in memory to prevent re-evaluation.

Examples:

>>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
>>> y = np.array([3, 0, 3, 11])
>>> se = SR_evaluator(X, y)
>>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
>>> print(rmse < 1e-6)
True

Parameters:

Name	Type	Description	Default
`expr`	`List[str]`	A list of strings representing the expression in infix notation.	required

Returns:

Type	Description
`float`	The root mean square error of the expression.

Warns:

Type	Description
`Maximum number of evaluations reached`	If the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.

Notes

If the expression has already been evaluated, its stored value is returned instead of re-evaluating the expression. When the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.

Source code in SRToolkit/evaluation/sr_evaluator.py

def evaluate_expr(self, expr: List[str]) -> float:
    """
    Evaluates an expression in infix notation and stores the result in
    memory to prevent re-evaluation.

    Examples:
        >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
        >>> y = np.array([3, 0, 3, 11])
        >>> se = SR_evaluator(X, y)
        >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
        >>> print(rmse < 1e-6)
        True

    Args:
        expr: A list of strings representing the expression in infix notation.

    Returns:
        The root mean square error of the expression.

    Warnings:
        Maximum number of evaluations reached: If the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.

    Notes:
        If the expression has already been evaluated, its stored value is returned instead of re-evaluating the expression.
        When the maximum number of evaluations has been reached, a warning is printed and np.nan is returned.
    """
    self.total_expressions += 1

    if 0 <= self.max_evaluations < self.total_expressions:
        warnings.warn(
            f"Maximum number of evaluations ({self.max_evaluations}) reached. Stopping evaluation.")
        return np.nan
    else:
        expr_str = "".join(expr)
        if expr_str in self.models:
            # print(f"Already evaluated {expr_str}")
            # print(self.models[expr_str])
            return self.models[expr_str]["rmse"]
        else:
            rmse, parameters = self.parameter_estimator.estimate_parameters(expr)
            self.models[expr_str] = {
                "rmse": rmse,
                "parameters": parameters,
                "expr": expr,
            }
            return rmse

`get_results(top_k=20, success_threshold=1e-07)`

Returns the results of the equation discovery/symbolic regression process/evaluation.

Examples:

>>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
>>> y = np.array([3, 0, 3, 11])
>>> se = SR_evaluator(X, y)
>>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
>>> results = se.get_results(top_k=1)
>>> print(results["num_evaluated"])
1
>>> print(results["total_expressions"])
1
>>> print(results["best_expr"])
C*X_1-X_0
>>> print(results["min_rmse"] < 1e-6)
True
>>> print(1.99 < results["results"][0]["parameters"][0] < 2.01)
True

Parameters:

Name	Type	Description	Default
`top_k`	`int`	The number of top results to include in the output. If `top_k` is greater than the number of evaluated expressions, all evaluated expressions are included. If `top_k` is less than 0, all evaluated expressions are included.	`20`
`success_threshold`	`float`	The threshold below which the evaluation is considered successful. Default is 1e-7.	`1e-07`

Returns:

Type Description

dict

A dictionary containing the results of the equation discovery/symbolic regression process. The keys are:

"metadata" : The metadata provided in the constructor.
"min_rmse" : The minimum root mean squared error.
"best_expr" : The expression with the minimum root mean squared error.
"num_evaluated" : The number of evaluated expressions.
"total_expressions" : The total number of expressions considered.
"success" : Whether the evaluation was successful.
"results" : A list of dictionaries, where each dictionary contains the root mean squared error, the expression, and the estimated parameters of the expression. The list is sorted in ascending order of the root mean squared error.

Source code in SRToolkit/evaluation/sr_evaluator.py

def get_results(self, top_k: int = 20, success_threshold: float = 1e-7) -> dict:
    """
    Returns the results of the equation discovery/symbolic regression process/evaluation.

    Examples:
        >>> X = np.array([[1, 2], [8, 4], [5, 4], [7, 9], ])
        >>> y = np.array([3, 0, 3, 11])
        >>> se = SR_evaluator(X, y)
        >>> rmse = se.evaluate_expr(["C", "*", "X_1", "-", "X_0"])
        >>> results = se.get_results(top_k=1)
        >>> print(results["num_evaluated"])
        1
        >>> print(results["total_expressions"])
        1
        >>> print(results["best_expr"])
        C*X_1-X_0
        >>> print(results["min_rmse"] < 1e-6)
        True
        >>> print(1.99 < results["results"][0]["parameters"][0] < 2.01)
        True

    Args:
        top_k: The number of top results to include in the output. If `top_k`
            is greater than the number of evaluated expressions, all
            evaluated expressions are included. If `top_k` is less than 0,
            all evaluated expressions are included.
        success_threshold: The threshold below which the evaluation is
            considered successful. Default is 1e-7.

    Returns:
        A dictionary containing the results of the equation discovery/symbolic regression process. The keys are:

            - "metadata" : The metadata provided in the constructor.
            - "min_rmse" : The minimum root mean squared error.
            - "best_expr" : The expression with the minimum root mean
              squared error.
            - "num_evaluated" : The number of evaluated expressions.
            - "total_expressions" : The total number of expressions
              considered.
            - "success" : Whether the evaluation was successful.
            - "results" : A list of dictionaries, where each dictionary
              contains the root mean squared error, the expression, and the
              estimated parameters of the expression. The list is sorted in
              ascending order of the root mean squared error.
    """
    if top_k > len(self.models) or top_k < 0:
        top_k = len(self.models)

    models = list(self.models.values())
    best_indices = np.argsort([v["rmse"] for v in models])

    results = {
        "metadata": self.metadata,
        "min_rmse": models[best_indices[0]]["rmse"],
        "best_expr": "".join(models[best_indices[0]]["expr"]),
        "num_evaluated": len(models),
        "total_expressions": self.total_expressions,
        "results": list(),
    }

    # Determine success based on the predefined success threshold
    if success_threshold is not None and results["min_rmse"] < success_threshold:
        results["success"] = True
    else:
        results["success"] = False

    for i in best_indices[:top_k]:
        results["results"].append(models[i])

    return results

Evaluation Module

SRToolkit.evaluation

ParameterEstimator

__init__(X, y, symbol_library=SymbolLibrary.default_symbols(), **kwargs)

estimate_parameters(expr)

SR_evaluator

__init__(X, y, max_evaluations=-1, metadata=None, symbol_library=SymbolLibrary.default_symbols(), **kwargs)

evaluate_expr(expr)

get_results(top_k=20, success_threshold=1e-07)

`SRToolkit.evaluation`

`ParameterEstimator`

`init(X, y, symbol_library=SymbolLibrary.default_symbols(), **kwargs)`

`estimate_parameters(expr)`

`SR_evaluator`

`init(X, y, max_evaluations=-1, metadata=None, symbol_library=SymbolLibrary.default_symbols(), **kwargs)`

`evaluate_expr(expr)`

`get_results(top_k=20, success_threshold=1e-07)`