Measures
SRToolkit.utils.measures
This module contains measures for evaluating the similarity between two expressions.
edit_distance
edit_distance(expr1: Union[List[str], Node], expr2: Union[List[str], Node], notation: str = 'postfix', symbol_library: SymbolLibrary = SymbolLibrary.default_symbols()) -> int
Calculates the edit distance between two expressions.
Examples:
>>> edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[List[str], Node]
|
Expression given as a list of tokens in the infix notation or as an instance of SRToolkit.utils.expression_tree.Node |
required |
expr2
|
Union[List[str], Node]
|
Expression given as a list of tokens in the infix notation or as an instance of SRToolkit.utils.expression_tree.Node |
required |
notation
|
str
|
The notation in which the distance between the two expressions is computed. Can be one of "infix", "postfix", or "prefix". By default, "postfix" is used to avoid inconsistencies that occur because of parenthesis. |
'postfix'
|
symbol_library
|
SymbolLibrary
|
The symbol library to use when converting the expressions to lists of tokens and vice versa. Defaults to SymbolLibrary.default_symbols(). |
default_symbols()
|
Returns:
| Type | Description |
|---|---|
int
|
The edit distance between the two expressions written in a given notation. |
Source code in SRToolkit/utils/measures.py
tree_edit_distance
tree_edit_distance(expr1: Union[Node, List[str]], expr2: Union[Node, List[str]], symbol_library: SymbolLibrary = SymbolLibrary.default_symbols()) -> int
Calculates the tree edit distance between two expressions.
Examples:
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> tree_edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str]]
|
Expression given as a list of tokens in the infix notation or as an instance of SRToolkit.utils.expression_tree.Node |
required |
expr2
|
Union[Node, List[str]]
|
Expression given as a list of tokens in the infix notation or as an instance of SRToolkit.utils.expression_tree.Node |
required |
symbol_library
|
SymbolLibrary
|
Symbol library to use when converting the lists of tokens into an instance of SRToolkit.utils.expression_tree.Node. |
default_symbols()
|
Returns:
| Type | Description |
|---|---|
int
|
The tree edit distance between the two expressions. |
Source code in SRToolkit/utils/measures.py
create_behavior_matrix
create_behavior_matrix(expr: Union[Node, List[str]], X: ndarray, num_consts_sampled: int = 32, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(), seed: int = None) -> np.ndarray
Creates a behavior matrix from an expression with free parameters. The shape of the matrix is (X.shape[0], num_consts_sampled).
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32).shape
(10, 32)
>>> mean_0_1 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(0, 1)))
>>> mean_1_5 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(1, 5)))
>>> mean_0_1 < mean_1_5
True
>>> # Deterministic expressions always produce the same behavior matrix
>>> bm1 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> bm2 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> np.array_equal(bm1, bm2)
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr
|
Union[Node, List[str]]
|
An expression given as a list of tokens in the infix notation. |
required |
X
|
ndarray
|
Points on which the expression is evaluated to determine the behavior |
required |
num_consts_sampled
|
int
|
Number of sets of constants sampled |
32
|
consts_bounds
|
Tuple[float, float]
|
Bounds between which constant values are sampled |
(-5, 5)
|
symbol_library
|
SymbolLibrary
|
Symbol library used to transform the expression into an executable function. |
default_symbols()
|
seed
|
int
|
Random seed. If None, generation will be random. |
None
|
Raises:
| Type | Description |
|---|---|
Exception
|
If expr is not a list of tokens or an instance of SRToolkit.utils.expression_tree.Node. |
Returns:
| Type | Description |
|---|---|
ndarray
|
A matrix of size (X.shape[0], num_consts_sampled) that represents the behavior of an expression. |
Source code in SRToolkit/utils/measures.py
bed
bed(expr1: Union[Node, List[str], ndarray], expr2: Union[Node, List[str], ndarray], X: Optional[ndarray] = None, num_consts_sampled: int = 32, num_points_sampled: int = 64, domain_bounds: Optional[List[Tuple[float, float]]] = None, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(), seed: int = None) -> float
Computes the Behavioral Embedding Distance (BED) between two expressions or behavior matrices over a given dataset or domain, using Wasserstein distance as a metric.
The BED is computed either by using precomputed behavior matrices or by sampling points from a domain and evaluating the expressions over them.
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> expr1 = ["X_0", "+", "C"] # instances of SRToolkit.utils.expression_tree.Node work as well
>>> expr2 = ["X_1", "+", "C"]
>>> bed(expr1, expr2, X) < 1
True
>>> # Changing the number of sampled constants
>>> bed(expr1, expr2, X, num_consts_sampled=128, consts_bounds=(-2, 2)) < 1
True
>>> # Sampling X instead of giving it directly by defining a domain
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)]) < 1
True
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)], num_points_sampled=128) < 1
True
>>> # You can use behavior matrices instead of expressions (this has potential computational advantages if same expression is used multiple times)
>>> bm1 = create_behavior_matrix(expr1, X)
>>> bed(bm1, expr2, X) < 1
True
>>> bm2 = create_behavior_matrix(expr2, X)
>>> bed(bm1, bm2) < 1
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str], ndarray]
|
The first expression or behavior matrix. If it is an expression, it must be provided as a Node or a list of string representations. If it is already a behavior matrix, it should be a numpy array of size (num_points_sampled, num_consts_sampled). |
required |
expr2
|
Union[Node, List[str], ndarray]
|
The second expression or behavior matrix. Similar to expr1, it should be either a Node, list of strings representing the expression, or a numpy array representing the behavior matrix. |
required |
X
|
Optional[ndarray]
|
Array of points over which behavior is evaluated. If not provided, the domain bounds parameter will be used to sample points. |
None
|
num_consts_sampled
|
int
|
Number of constants sampled for behavior evaluation if expressions are given as Nodes or lists rather than matrices. Default is 32. |
32
|
num_points_sampled
|
int
|
Number of points sampled from the domain if X is not provided. Default is 64. |
64
|
domain_bounds
|
Optional[List[Tuple[float, float]]]
|
The bounds of the domain for sampling points when X is not given. Each tuple represents the lower and upper bounds for a domain feature (e.g., [(0, 1), (0, 2)]). |
None
|
consts_bounds
|
Tuple[float, float]
|
The lower and upper bounds for sampling constants when evaluating expressions. Default is (-5, 5). |
(-5, 5)
|
symbol_library
|
SymbolLibrary
|
The library of symbols used to parse and evaluate expressions. Default is the default symbol library from SymbolLibrary. |
default_symbols()
|
seed
|
int
|
Seed for random number generation during sampling for deterministic results. Default is None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
The mean Wasserstein distance computed between the behaviors of the two expressions or |
float
|
matrices over the sampled points. |
Raises:
| Type | Description |
|---|---|
Exception
|
If X is not provided and domain_bounds is missing, this exception is raised to ensure proper sampling of points for behavior evaluation. |
AssertionError
|
Raised when the shapes of the behavior matrices or sampling points do not match the expected dimensions. |
Source code in SRToolkit/utils/measures.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |