Measures
SRToolkit.utils.measures
Distance and similarity measures between symbolic expressions: edit distance, tree edit distance, and Behavioral Embedding Distance (BED).
edit_distance
edit_distance(expr1: Union[List[str], Node], expr2: Union[List[str], Node], notation: str = 'postfix', symbol_library: SymbolLibrary = SymbolLibrary.default_symbols()) -> int
Compute the edit distance between two expressions.
Both expressions are normalised to the requested notation before computing Levenshtein distance, making the result independent of input serialisation.
Examples:
>>> edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[List[str], Node]
|
First expression as a token list or a Node tree. |
required |
expr2
|
Union[List[str], Node]
|
Second expression as a token list or a Node tree. |
required |
notation
|
str
|
Notation used for comparison: |
'postfix'
|
symbol_library
|
SymbolLibrary
|
Symbol library used when converting expressions to the target notation. Defaults to SymbolLibrary.default_symbols. |
default_symbols()
|
Returns:
| Type | Description |
|---|---|
int
|
Integer edit distance between the two serialised expressions. |
Source code in SRToolkit/utils/measures.py
tree_edit_distance
tree_edit_distance(expr1: Union[Node, List[str]], expr2: Union[Node, List[str]], symbol_library: SymbolLibrary = SymbolLibrary.default_symbols()) -> int
Compute the Zhang-Shasha tree edit distance between two expressions.
Examples:
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "+", "1"])
0
>>> tree_edit_distance(["X_0", "+", "1"], ["X_0", "-", "1"])
1
>>> tree_edit_distance(tokens_to_tree(["X_0", "+", "1"], SymbolLibrary.default_symbols(1)), tokens_to_tree(["X_0", "-", "1"], SymbolLibrary.default_symbols(1)))
1
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str]]
|
First expression as a token list or a Node tree. |
required |
expr2
|
Union[Node, List[str]]
|
Second expression as a token list or a Node tree. |
required |
symbol_library
|
SymbolLibrary
|
Symbol library used when converting token lists to trees. Defaults to SymbolLibrary.default_symbols. |
default_symbols()
|
Returns:
| Type | Description |
|---|---|
int
|
Integer tree edit distance. |
Source code in SRToolkit/utils/measures.py
create_behavior_matrix
create_behavior_matrix(expr: Union[Node, List[str]], X: ndarray, num_consts_sampled: int = 32, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(), seed: Optional[int] = None) -> np.ndarray
Evaluate an expression over multiple constant samples to produce a behavior matrix.
For expressions with free constants, constants are drawn via Latin Hypercube Sampling
within consts_bounds. For constant-free expressions, all columns are identical.
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32).shape
(10, 32)
>>> mean_0_1 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(0, 1)))
>>> mean_1_5 = np.mean(create_behavior_matrix(["X_0", "+", "C"], X, num_consts_sampled=32, consts_bounds=(1, 5)))
>>> print(bool(mean_0_1 < mean_1_5))
True
>>> # Deterministic expressions always produce the same behavior matrix
>>> bm1 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> bm2 = create_behavior_matrix(["X_0", "+", "X_1"], X)
>>> print(bool(np.array_equal(bm1, bm2)))
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr
|
Union[Node, List[str]]
|
Expression as a token list or a Node tree. |
required |
X
|
ndarray
|
Input data of shape |
required |
num_consts_sampled
|
int
|
Number of constant vectors to sample; sets the number of
output columns. Default |
32
|
consts_bounds
|
Tuple[float, float]
|
|
(-5, 5)
|
symbol_library
|
SymbolLibrary
|
Symbol library used to compile the expression. Defaults to SymbolLibrary.default_symbols. |
default_symbols()
|
seed
|
Optional[int]
|
Random seed for reproducible constant sampling. Default |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Behavior matrix of shape |
Raises:
| Type | Description |
|---|---|
Exception
|
If |
Source code in SRToolkit/utils/measures.py
bed
bed(expr1: Union[Node, List[str], ndarray], expr2: Union[Node, List[str], ndarray], X: Optional[ndarray] = None, num_consts_sampled: int = 32, num_points_sampled: int = 64, domain_bounds: Optional[List[Tuple[float, float]]] = None, consts_bounds: Tuple[float, float] = (-5, 5), symbol_library: SymbolLibrary = SymbolLibrary.default_symbols(), seed: Optional[int] = None) -> float
Compute the Behavior-aware Expression Distance (BED) between two expressions.
BED measures how similarly two expressions behave over a domain by comparing their output distributions point-by-point using the Wasserstein distance. Free constants are marginalised by sampling multiple constant vectors via Latin Hypercube Sampling.
Either X or domain_bounds must be provided when expressions are given as
token lists or Node trees. Pre-computed behavior matrices can be passed
directly to avoid redundant evaluation.
Examples:
>>> X = np.random.rand(10, 2) - 0.5
>>> expr1 = ["X_0", "+", "C"] # instances of SRToolkit.utils.expression_tree.Node work as well
>>> expr2 = ["X_1", "+", "C"]
>>> bed(expr1, expr2, X) < 1
True
>>> # Changing the number of sampled constants
>>> bed(expr1, expr2, X, num_consts_sampled=128, consts_bounds=(-2, 2)) < 1
True
>>> # Sampling X instead of giving it directly by defining a domain
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)]) < 1
True
>>> bed(expr1, expr2, domain_bounds=[(0, 1), (0, 1)], num_points_sampled=128) < 1
True
>>> # You can use behavior matrices instead of expressions (this has potential computational advantages if same expression is used multiple times)
>>> bm1 = create_behavior_matrix(expr1, X)
>>> bed(bm1, expr2, X) < 1
True
>>> bm2 = create_behavior_matrix(expr2, X)
>>> bed(bm1, bm2) < 1
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr1
|
Union[Node, List[str], ndarray]
|
First expression as a token list, a Node tree, or a pre-computed
behavior matrix of shape |
required |
expr2
|
Union[Node, List[str], ndarray]
|
Second expression in the same format as |
required |
X
|
Optional[ndarray]
|
Evaluation points of shape |
None
|
num_consts_sampled
|
int
|
Number of constant vectors sampled per expression. Default |
32
|
num_points_sampled
|
int
|
Number of points sampled from |
64
|
domain_bounds
|
Optional[List[Tuple[float, float]]]
|
Per-variable |
None
|
consts_bounds
|
Tuple[float, float]
|
|
(-5, 5)
|
symbol_library
|
SymbolLibrary
|
Symbol library used to compile expressions. Defaults to SymbolLibrary.default_symbols. |
default_symbols()
|
seed
|
Optional[int]
|
Random seed for reproducible sampling. Default |
None
|
Returns:
| Type | Description |
|---|---|
float
|
BED between the expressions, given as a float. |
Raises:
| Type | Description |
|---|---|
Exception
|
If |
Source code in SRToolkit/utils/measures.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | |