Skip to content

Nguyen

SRToolkit.dataset.nguyen

Nguyen symbolic regression benchmark.

Nguyen

Nguyen(dataset_directory: str = os.path.join(user_data_dir('SRToolkit'), 'nguyen'))

Bases: SR_benchmark

The Nguyen symbolic regression benchmark.

Contains 10 expressions without constant parameters (first 4 are polynomials, first 8 use one variable, last 2 use two variables). The benchmark ships with pre-generated data.

For more information about the Nguyen benchmark, see: https://doi.org/10.1007/s10710-010-9121-2

Examples:

>>> benchmark = Nguyen()
>>> len(benchmark.list_datasets(verbose=False))
10

Parameters:

Name Type Description Default
dataset_directory str

Directory where dataset files are stored or will be downloaded to. Defaults to the platform-appropriate user data directory (e.g. ~/.local/share/SRToolkit/nguyen on Linux).

join(user_data_dir('SRToolkit'), 'nguyen')
Source code in SRToolkit/dataset/nguyen.py
def __init__(self, dataset_directory: str = os.path.join(user_data_dir("SRToolkit"), "nguyen")):
    super().__init__("Nguyen", dataset_directory)
    self._populate()

resample

resample(dataset_name: str, n: int, seed: Optional[int] = None) -> Tuple[np.ndarray, np.ndarray]

Generate fresh data for a dataset by sampling new inputs and evaluating the ground truth.

Variable bounds are taken from _BOUNDS.

Examples:

>>> benchmark = Nguyen('data/nguyen/')
>>> X, y = benchmark.resample('NG-1', n=200, seed=42)
>>> X.shape
(200, 1)

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset to resample.

required
n int

Number of new samples to generate.

required
seed Optional[int]

Random seed for reproducibility.

None

Returns:

Type Description
Tuple[ndarray, ndarray]

A tuple (X, y) of numpy arrays with shapes (n, n_vars) and (n,).

Raises:

Type Description
ValueError

If the dataset has no ground truth expression.

Source code in SRToolkit/dataset/nguyen.py
def resample(self, dataset_name: str, n: int, seed: Optional[int] = None) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate fresh data for a dataset by sampling new inputs and evaluating the ground truth.

    Variable bounds are taken from ``_BOUNDS``.

    Examples:
        >>> benchmark = Nguyen('data/nguyen/')
        >>> X, y = benchmark.resample('NG-1', n=200, seed=42)
        >>> X.shape
        (200, 1)

    Args:
        dataset_name: Name of the dataset to resample.
        n: Number of new samples to generate.
        seed: Random seed for reproducibility.

    Returns:
        A tuple ``(X, y)`` of numpy arrays with shapes ``(n, n_vars)`` and ``(n,)``.

    Raises:
        ValueError: If the dataset has no ground truth expression.
    """
    info = self.datasets[dataset_name]
    if info.get("ground_truth") is None:
        raise ValueError(f"Dataset '{dataset_name}' has no ground truth expression — cannot compute y.")
    bounds = _BOUNDS[dataset_name]
    lb = np.array([b[0] for b in bounds], dtype=float)
    ub = np.array([b[1] for b in bounds], dtype=float)
    rng = np.random.default_rng(seed)
    X_new = rng.uniform(lb, ub, size=(n, len(bounds)))
    sl = SymbolLibrary.from_dict(info["symbol_library"])
    f = expr_to_executable_function(info["ground_truth"], sl)
    y_new = f(X_new, np.array([]))
    return X_new, y_new