Nguyen

SRToolkit.dataset.nguyen

Nguyen symbolic regression benchmark.

Nguyen

Nguyen(dataset_directory: str = os.path.join(user_data_dir('SRToolkit'), 'nguyen'))

Bases: SR_benchmark

The Nguyen symbolic regression benchmark.

Contains 10 expressions without constant parameters (first 4 are polynomials, first 8 use one variable, last 2 use two variables). The benchmark ships with pre-generated data.

For more information about the Nguyen benchmark, see: https://doi.org/10.1007/s10710-010-9121-2

Examples:

>>> benchmark = Nguyen()
>>> len(benchmark.list_datasets(verbose=False))
10

Parameters:

Name	Type	Description	Default
`dataset_directory`	`str`	Directory where dataset files are stored or will be downloaded to. Defaults to the platform-appropriate user data directory (e.g. `~/.local/share/SRToolkit/nguyen` on Linux).	`join(user_data_dir('SRToolkit'), 'nguyen')`

Source code in SRToolkit/dataset/nguyen.py

def __init__(self, dataset_directory: str = os.path.join(user_data_dir("SRToolkit"), "nguyen")):
    super().__init__("Nguyen", dataset_directory)
    self._populate()

resample

resample(dataset_name: str, n: int, seed: Optional[int] = None) -> Tuple[np.ndarray, np.ndarray]

Generate fresh data for a dataset by sampling new inputs and evaluating the ground truth.

Variable bounds are taken from _BOUNDS.

Examples:

>>> benchmark = Nguyen('data/nguyen/')
>>> X, y = benchmark.resample('NG-1', n=200, seed=42)
>>> X.shape
(200, 1)

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset to resample.	required
`n`	`int`	Number of new samples to generate.	required
`seed`	`Optional[int]`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[ndarray, ndarray]`	A tuple `(X, y)` of numpy arrays with shapes `(n, n_vars)` and `(n,)`.

Raises:

Type	Description
`ValueError`	If the dataset has no ground truth expression.

Source code in SRToolkit/dataset/nguyen.py

def resample(self, dataset_name: str, n: int, seed: Optional[int] = None) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate fresh data for a dataset by sampling new inputs and evaluating the ground truth.

    Variable bounds are taken from ``_BOUNDS``.

    Examples:
        >>> benchmark = Nguyen('data/nguyen/')
        >>> X, y = benchmark.resample('NG-1', n=200, seed=42)
        >>> X.shape
        (200, 1)

    Args:
        dataset_name: Name of the dataset to resample.
        n: Number of new samples to generate.
        seed: Random seed for reproducibility.

    Returns:
        A tuple ``(X, y)`` of numpy arrays with shapes ``(n, n_vars)`` and ``(n,)``.

    Raises:
        ValueError: If the dataset has no ground truth expression.
    """
    info = self.datasets[dataset_name]
    if info.get("ground_truth") is None:
        raise ValueError(f"Dataset '{dataset_name}' has no ground truth expression — cannot compute y.")
    bounds = _BOUNDS[dataset_name]
    lb = np.array([b[0] for b in bounds], dtype=float)
    ub = np.array([b[1] for b in bounds], dtype=float)
    rng = np.random.default_rng(seed)
    X_new = rng.uniform(lb, ub, size=(n, len(bounds)))
    sl = SymbolLibrary.from_dict(info["symbol_library"])
    f = expr_to_executable_function(info["ground_truth"], sl)
    y_new = f(X_new, np.array([]))
    return X_new, y_new