Measures
SRToolkit.utils.measures
Distance and similarity measures between symbolic expressions: edit distance, tree edit distance, and Behavioral Embedding Distance (BED).
edit_distance
edit_distance(expr1: Union[List[str], Node], expr2: Union[List[str], Node], notation: str = 'postfix', symbol_library: Optional[SymbolLibrary] = None) -> int
Compute the edit distance between two expressions.
Both expressions are first converted to the requested notation, so the result is independent of whether a token list or a Node tree is passed. Levenshtein distance is then computed on the serialised token sequences.
Examples:
>>> edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[List[str], Node]
|
First expression as a token list or a Node tree. |
required |
expr2
|
Union[List[str], Node]
|
Second expression as a token list or a Node tree. |
required |
notation
|
str
|
Notation used for comparison: |
'postfix'
|
symbol_library
|
Optional[SymbolLibrary]
|
Symbol library used when converting expressions to the target notation. Defaults to SymbolLibrary.default_symbols. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Integer edit distance between the two serialised expressions. |
Source code in SRToolkit/utils/measures.py
tree_edit_distance
tree_edit_distance(expr1: Union[Node, List[str]], expr2: Union[Node, List[str]], symbol_library: Optional[SymbolLibrary] = None) -> int
Compute the Zhang-Shasha tree edit distance between two expressions.
Unlike edit_distance, which operates on flattened token sequences, tree edit distance considers the expression's hierarchical structure. The cost is the minimum number of node insertions, deletions, and relabellings needed to transform one tree into the other.
Examples:
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> tree_edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str]]
|
First expression as a token list or a Node tree. |
required |
expr2
|
Union[Node, List[str]]
|
Second expression as a token list or a Node tree. |
required |
symbol_library
|
Optional[SymbolLibrary]
|
Symbol library used when converting token lists to trees. Defaults to SymbolLibrary.default_symbols. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Integer tree edit distance. |
Source code in SRToolkit/utils/measures.py
create_behavior_matrix
create_behavior_matrix(expr: Union[Node, List[str]], X: ndarray, num_consts_sampled: int = 32, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: Optional[SymbolLibrary] = None, seed: Optional[int] = None) -> np.ndarray
Evaluate an expression over multiple constant samples to produce a behavior matrix.
For expressions with free constants, constants are drawn via Latin Hypercube Sampling
within consts_bounds. For constant-free expressions, all columns are identical.
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32).shape
(10, 32)
>>> mean_0_1 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(0, 1)))
>>> mean_1_5 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(1, 5)))
>>> print(bool(mean_0_1 < mean_1_5))
True
>>> # Deterministic expressions always produce the same behavior matrix
>>> bm1 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> bm2 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> print(bool(np.array_equal(bm1, bm2)))
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr
|
Union[Node, List[str]]
|
Expression as a token list or a Node tree. |
required |
X
|
ndarray
|
Input data of shape |
required |
num_consts_sampled
|
int
|
Number of constant vectors to sample; sets the number of
output columns. Default |
32
|
consts_bounds
|
Tuple[float, float]
|
|
(-5, 5)
|
symbol_library
|
Optional[SymbolLibrary]
|
Symbol library used to compile the expression. Defaults to SymbolLibrary.default_symbols. |
None
|
seed
|
Optional[int]
|
Random seed for reproducible constant sampling. Default |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Behavior matrix of shape |
Raises:
| Type | Description |
|---|---|
Exception
|
If |
Source code in SRToolkit/utils/measures.py
bed
bed(expr1: Union[Node, List[str], ndarray], expr2: Union[Node, List[str], ndarray], X: Optional[ndarray] = None, num_consts_sampled: int = 32, num_points_sampled: int = 64, domain_bounds: Optional[List[Tuple[float, float]]] = None, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: Optional[SymbolLibrary] = None, seed: Optional[int] = None) -> float
Compute the Behavior-aware Expression Distance (BED) between two expressions.
BED measures how similarly two expressions behave over a domain by comparing their output distributions point-by-point using the Wasserstein distance. Free constants are marginalised by sampling multiple constant vectors via Latin Hypercube Sampling.
Either X or domain_bounds must be provided when expressions are given as
token lists or Node trees. Pre-computed behavior matrices can be passed
directly to avoid redundant evaluation.
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> expr1 = ["X_0", "+", "C"] # instances of SRToolkit.utils.expression_tree.Node work as well
>>> expr2 = ["X_1", "+", "C"]
>>> bed(expr1, expr2, X) < 1
True
>>> # Changing the number of sampled constants
>>> bed(expr1, expr2, X, num_consts_sampled=128, consts_bounds=(-2, 2)) < 1
True
>>> # Sampling X instead of giving it directly by defining a domain
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)]) < 1
True
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)], num_points_sampled=128) < 1
True
>>> # You can use behavior matrices instead of expressions (this has potential computational advantages if same expression is used multiple times)
>>> bm1 = create_behavior_matrix(expr1, X)
>>> bed(bm1, expr2, X) < 1
True
>>> bm2 = create_behavior_matrix(expr2, X)
>>> bed(bm1, bm2) < 1
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str], ndarray]
|
First expression as a token list, a Node tree, or a pre-computed
behavior matrix of shape |
required |
expr2
|
Union[Node, List[str], ndarray]
|
Second expression in the same format as |
required |
X
|
Optional[ndarray]
|
Evaluation points of shape |
None
|
num_consts_sampled
|
int
|
Number of constant vectors sampled per expression. Default |
32
|
num_points_sampled
|
int
|
Number of points sampled from |
64
|
domain_bounds
|
Optional[List[Tuple[float, float]]]
|
Per-variable |
None
|
consts_bounds
|
Tuple[float, float]
|
|
(-5, 5)
|
symbol_library
|
Optional[SymbolLibrary]
|
Symbol library used to compile expressions. Defaults to SymbolLibrary.default_symbols. |
None
|
seed
|
Optional[int]
|
Random seed for reproducible sampling. Default |
None
|
Returns:
| Type | Description |
|---|---|
float
|
BED between the expressions as a non-negative float. A value of |
float
|
indicates identical behavior over the sampled domain; larger values indicate |
float
|
greater behavioral dissimilarity. Returns |
float
|
produces finite outputs for one expression but not the other. |
Raises:
| Type | Description |
|---|---|
Exception
|
If |
Exception
|
If |
ValueError
|
If any entry in |
ValueError
|
If the two behavior matrices have different numbers of rows. |
ValueError
|
If the behavior matrices have zero rows. |
Source code in SRToolkit/utils/measures.py
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | |