Grammar
SRToolkit.utils.grammar.grammar
Context-free and probabilistic context-free grammar (CFG/PCFG) representation.
Provides Rule, ParseTreeNode, ParseTree, and Grammar. For stateful expression generation see Derivation.
Rule
dataclass
A single production rule in a grammar.
A rule is a CFG production when weights remain 1 for all rules in the grammar; when weights differ across rules, the grammar is treated as a PCFG and productions are sampled proportionally. Weights are unnormalised — the grammar normalises them at sampling time.
Examples:
>>> r = Rule("E", ["E", "+", "F"], weight=0.4, name="E_add_+")
>>> r.lhs
'E'
>>> r.rhs
['E', '+', 'F']
>>> r.weight
0.4
>>> r.name
'E_add_+'
>>> Rule("E", ["F"]).weight
1.0
Attributes:
| Name | Type | Description |
|---|---|---|
lhs |
str
|
The non-terminal being expanded, e.g. |
rhs |
list[str]
|
Ordered sequence of symbols the non-terminal expands to. Each
element is either a terminal token (e.g. |
weight |
float
|
Unnormalised sampling weight. Defaults to |
name |
Optional[str]
|
Optional stable identifier for this rule. Used by constraints
for scoping and identification. |
from_line
classmethod
Parse one NLTK production line into a list of Rule objects.
The line must have the form::
LHS -> RHS_1 | RHS_2 | ...
where each alternative is a space-separated sequence of symbols.
Terminals are enclosed in single quotes; unquoted tokens are non-terminals.
An optional weight in square brackets at the end of an alternative is parsed
as a float; alternatives without a weight default to 1.0.
Examples:
>>> Rule.from_line("E -> E '+' F [0.4] | F [0.6]")
[Rule(lhs='E', rhs=['E', '+', 'F'], weight=0.4, name=None), Rule(lhs='E', rhs=['F'], weight=0.6, name=None)]
>>> Rule.from_line("F -> 'x'")
[Rule(lhs='F', rhs=['x'], weight=1.0, name=None)]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
line
|
str
|
A single non-empty, non-comment production line. |
required |
Returns:
| Type | Description |
|---|---|
list['Rule']
|
List of Rule objects, one per alternative. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If the left-hand side is empty. |
ValueError
|
If a weight token cannot be converted to |
ValueError
|
If no alternatives are parsed from the right-hand side. |
Source code in SRToolkit/utils/grammar/grammar.py
ParseTreeNode
dataclass
A node in a derivation parse tree.
Terminal leaves have rule_applied=None and children=[].
Internal nodes store the rule that expanded the non-terminal at this position.
Attributes:
| Name | Type | Description |
|---|---|---|
symbol |
str
|
The grammar symbol at this node (terminal or non-terminal name). |
rule_applied |
Optional[Rule]
|
The Rule used to expand this node.
|
children |
list[ParseTreeNode]
|
Ordered child nodes corresponding to the symbols in
|
ParseTree
Full derivation history of an expression under a grammar.
Unlike Node, which holds only terminal symbols for expression evaluation, a ParseTree retains every non-terminal and every production applied during the derivation.
Examples:
>>> r = Rule("E", ["x"])
>>> leaf = ParseTreeNode("x", None)
>>> root = ParseTreeNode("E", r, [leaf])
>>> pt = ParseTree(root)
>>> pt.to_token_list()
['x']
>>> pt.productions_used()
[Rule(lhs='E', rhs=['x'], weight=1.0, name=None)]
Attributes:
| Name | Type | Description |
|---|---|---|
root |
The root node of the parse tree. |
Source code in SRToolkit/utils/grammar/grammar.py
to_token_list
Collect all terminal tokens in left-to-right order.
Examples:
>>> leaf1 = ParseTreeNode("a", None)
>>> leaf2 = ParseTreeNode("+", None)
>>> leaf3 = ParseTreeNode("b", None)
>>> root = ParseTreeNode("E", Rule("E", ["a", "+", "b"]), [leaf1, leaf2, leaf3])
>>> ParseTree(root).to_token_list()
['a', '+', 'b']
Returns:
| Type | Description |
|---|---|
list[str]
|
Flat list of terminal tokens in left-to-right order. |
Source code in SRToolkit/utils/grammar/grammar.py
productions_used
Return all rules applied in the derivation, in pre-order.
Examples:
>>> leaf = ParseTreeNode("x", None)
>>> r = Rule("E", ["x"])
>>> root = ParseTreeNode("E", r, [leaf])
>>> ParseTree(root).productions_used()
[Rule(lhs='E', rhs=['x'], weight=1.0, name=None)]
Returns:
| Type | Description |
|---|---|
list[Rule]
|
List of Rule objects in pre-order traversal order. |
Source code in SRToolkit/utils/grammar/grammar.py
Grammar
A context-free grammar (CFG) or probabilistic context-free grammar (PCFG).
Rules are added via add_rule. The set of
non-terminals is derived automatically: a symbol is a non-terminal if and only
if it appears as the lhs of at least one rule.
A grammar is a CFG when every rule carries the default weight (1.0), making
sampling uniform within each group. It is a PCFG when any rule has a weight
that differs from 1.0.
Constraints are registered via add_constraint. During a derivation, options returns only rules that every constraint's allows accepts.
Examples:
>>> g = Grammar([
... Rule("E", ["E", "+", "F"], weight=0.4),
... Rule("E", ["F"], weight=0.6),
... Rule("F", ["x"]),
... ])
>>> "E" in g.nonterminals
True
>>> g.is_pcfg()
True
>>> len(g.rules_for("E"))
2
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
Optional[list[Rule]]
|
None
|
|
start
|
Optional[str]
|
Default start non-terminal used by
start_derivation
when no |
None
|
Source code in SRToolkit/utils/grammar/grammar.py
nonterminals
property
add_rule
Add a production rule to the grammar.
Examples:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rule
|
Rule
|
The Rule to register. |
required |
Source code in SRToolkit/utils/grammar/grammar.py
add_constraint
Register a constraint applied at each derivation step.
A rule is offered as an option only when every registered constraint's
allows returns True
for it.
Examples:
>>> from SRToolkit.utils.grammar import MaxDepth
>>> g = Grammar([Rule("E", ["E", "+", "E"]), Rule("E", ["x"])])
>>> g.add_constraint(MaxDepth(0))
>>> d = g.start_derivation("E")
>>> [r.rhs for r in d.options()]
[['x']]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
constraint
|
'Constraint'
|
A Constraint instance — typically a built-in such as MaxDepth, MaxNodes, MaxOccurrences, NoNested, or DimensionalConsistency, or a user-defined subclass. |
required |
Source code in SRToolkit/utils/grammar/grammar.py
rules_for
Return all rules whose lhs matches nonterminal.
Examples:
>>> g = Grammar()
>>> g.add_rule(Rule("E", ["F"]))
>>> g.add_rule(Rule("E", ["x"]))
>>> len(g.rules_for("E"))
2
>>> g.rules_for("Z")
[]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nonterminal
|
str
|
Non-terminal to look up. |
required |
Returns:
| Type | Description |
|---|---|
list[Rule]
|
List of matching Rule objects in insertion order. |
Source code in SRToolkit/utils/grammar/grammar.py
is_pcfg
Return True if any rule deviates from the default weight of 1.0.
Examples:
>>> g = Grammar()
>>> g.add_rule(Rule("E", ["x"]))
>>> g.add_rule(Rule("E", ["y"]))
>>> g.is_pcfg()
False
>>> g.add_rule(Rule("F", ["z"], weight=0.3))
>>> g.is_pcfg()
True
Returns:
| Type | Description |
|---|---|
bool
|
|
Source code in SRToolkit/utils/grammar/grammar.py
validate
Validate that parse_tree is structurally valid, uses only productions
from this grammar, and satisfies every registered constraint.
By default the root of the tree may be any non-terminal in the grammar,
not just self.start. This lets you validate sub-trees (e.g. an
F-rooted fragment) in addition to full expressions. Pass
require_start=True to additionally enforce that the root symbol equals
self.start (useful when validating complete, top-level expressions).
The check replays the parse tree through a fresh
Derivation rooted at
parse_tree.root.symbol, walking nodes in leftmost order (mirroring the
derivation frontier). At each internal node it checks that:
- The applied rule is permitted at the current frontier position — meaning it exists in the grammar and every registered constraint accepts it.
- The number of children equals
len(rule.rhs). - Each
children[i].symbolequalsrule.rhs[i].
Terminal leaves are accepted when their symbol is not a non-terminal in this grammar. A non-terminal symbol appearing as a leaf (unexpanded) is rejected.
Examples:
>>> g = Grammar(start="E")
>>> r = Rule("E", ["x"])
>>> g.add_rule(r)
>>> leaf = ParseTreeNode("x", None)
>>> root = ParseTreeNode("E", r, [leaf])
>>> g.validate(ParseTree(root))
True
>>> foreign = Rule("E", ["y"])
>>> root2 = ParseTreeNode("E", foreign, [ParseTreeNode("y", None)])
>>> g.validate(ParseTree(root2))
False
>>> g.add_rule(Rule("F", ["x"]))
>>> r_f = Rule("F", ["x"])
>>> g.add_rule(r_f)
>>> f_root = ParseTreeNode("F", r_f, [ParseTreeNode("x", None)])
>>> g.validate(ParseTree(f_root))
True
>>> g.validate(ParseTree(f_root), require_start=True)
False
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parse_tree
|
ParseTree
|
The ParseTree to validate. |
required |
require_start
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
in this grammar, and all constraints would have permitted the derivation |
bool
|
(and the root matches |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in SRToolkit/utils/grammar/grammar.py
403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 | |
start_derivation
Begin a new derivation.
Examples:
>>> g = Grammar(start="E")
>>> g.add_rule(Rule("E", ["x"]))
>>> d = g.start_derivation()
>>> d.complete
False
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
Optional[str]
|
Start non-terminal. Defaults to
|
None
|
Returns:
| Type | Description |
|---|---|
'Derivation'
|
A Derivation at the first expansion step. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the resolved start symbol is not a non-terminal in this grammar. |
Source code in SRToolkit/utils/grammar/grammar.py
generate_one
generate_one(start: Optional[str] = None, max_steps: int = 1000, max_retries: int = 10) -> Optional[list[str]]
Generate a single expression by sampling the grammar to completion.
Convenience wrapper around start_derivation and Derivation.generate.
Each attempt runs a fresh derivation from scratch. When a derivation
exceeds max_steps rule applications (e.g. because random sampling
kept choosing recursive rules), the attempt is discarded and a new one
starts. None is returned only when every attempt fails, signalling
that the caller should either relax the grammar, add more liberal
constraints, or increase max_steps / max_retries.
Examples:
>>> from SRToolkit.utils.grammar import Grammar, Rule
>>> g = Grammar(start="E")
>>> g.add_rule(Rule("E", ["x"]))
>>> g.generate_one()
['x']
>>> g.generate_one(max_steps=0, max_retries=1) is None
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
Optional[str]
|
Start non-terminal. Defaults to |
None
|
max_steps
|
int
|
Maximum rule applications per attempt. A negative number means unlimited (no retry logic applies). |
1000
|
max_retries
|
int
|
Maximum number of fresh attempts before returning
|
10
|
Returns:
| Type | Description |
|---|---|
Optional[list[str]]
|
A list of terminal tokens in left-to-right order, or |
Optional[list[str]]
|
every attempt exceeded |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the resolved start symbol is not a non-terminal in this grammar. |
Source code in SRToolkit/utils/grammar/grammar.py
to_dict
Serialise this grammar to a JSON-safe dictionary.
Constraints are serialised via their own to_dict method. User-defined
constraint subclasses must implement to_dict/from_dict to survive the
round-trip; built-in constraints are fully supported.
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with keys |
Source code in SRToolkit/utils/grammar/grammar.py
from_dict
classmethod
Reconstruct a Grammar from a dictionary produced by to_dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
dict
|
Dictionary with keys |
required |
Returns:
| Type | Description |
|---|---|
'Grammar'
|
A new Grammar with all rules and |
'Grammar'
|
constraints registered. |
Source code in SRToolkit/utils/grammar/grammar.py
from_symbol_library
classmethod
from_symbol_library(symbol_library: Optional[SymbolLibrary] = None, start: Optional[str] = None) -> 'Grammar'
Build a PCFG from a SymbolLibrary using a generic operator-precedence non-terminal hierarchy.
One non-terminal is created per unique OP precedence level present in
the library. Known precedences map to readable names; custom precedences
fall back to L_{p}:
OP_ADDITIVE→EOP_MULTIPLICATIVE→FOP_POWER→B- custom precedence p →
L_{p}
The chain runs from the lowest-precedence level (the grammar start) down
to T, K, R, and V. Only levels that have at least one
operator are generated; there are no empty pass-through non-terminals.
Heuristic weights are calibrated against empirical expression distributions (e.g. Wikipedia mathematical formulae): E recursion ~30 %, F recursion ~40 %, B recursion ~5 %, custom-level recursion ~40 %.
Examples:
>>> sl = SymbolLibrary.from_symbol_list(["+", "-", "*", "sin", "^2"], 2)
>>> g = Grammar.from_symbol_library(sl)
>>> "E" in g.nonterminals and "V" in g.nonterminals
True
>>> g.is_pcfg()
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
symbol_library
|
Optional[SymbolLibrary]
|
Token vocabulary. Falls back to the active library
from the context manager when |
None
|
start
|
Optional[str]
|
Override the name of the lowest-precedence non-terminal (and
|
None
|
Returns:
| Type | Description |
|---|---|
'Grammar'
|
A Grammar with heuristic PCFG weights. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the symbol library contains no variables, constants, or literals, making it impossible to generate any terminal expression. |
Source code in SRToolkit/utils/grammar/grammar.py
623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 | |
from_grammar_string
classmethod
Construct a Grammar from a string in NLTK production-rule notation.
Each non-empty, non-comment line must have the form::
LHS -> RHS_1 | RHS_2 | ...
where each alternative is a space-separated sequence of symbols.
Terminals are enclosed in single quotes (e.g. '+'); unquoted
tokens are treated as non-terminals. An optional weight in square
brackets at the end of an alternative is parsed as a
float (e.g. E '+' F [0.4]); alternatives without a weight
default to 1.0.
The start symbol may be embedded as a comment header emitted by to_grammar_string::
# start: E
The explicit start parameter takes precedence over this comment.
When start is None, only the first # start: comment encountered
is used; subsequent ones are ignored along with all other comment lines.
Rule names are not preserved in NLTK notation and are set to None
on all returned Rule objects.
Examples:
>>> g = Grammar.from_grammar_string("E -> E '+' F | F\nF -> 'x'", start="E")
>>> sorted(g.nonterminals)
['E', 'F']
>>> g.rules_for("F")[0].rhs
['x']
>>> g.start
'E'
>>> g2 = Grammar.from_grammar_string(
... "# start: E\nE -> E '+' F [0.4] | F [0.6]\nF -> 'x' [1.0]"
... )
>>> g2.start
'E'
>>> g2.is_pcfg()
True
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Grammar specification in NLTK notation, optionally with a
|
required |
start
|
Optional[str]
|
Start non-terminal stored on the returned grammar. Takes
precedence over a |
None
|
Returns:
| Type | Description |
|---|---|
'Grammar'
|
A new Grammar populated with |
'Grammar'
|
the parsed rules. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a content line contains no |
ValueError
|
If a weight token cannot be converted to |
ValueError
|
If |
ValueError
|
If the start symbol cannot be determined from either
the parameter or the |
Source code in SRToolkit/utils/grammar/grammar.py
796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 | |
to_grammar_string
Serialise this grammar to a string in NLTK production-rule notation.
All rules sharing the same left-hand side are written on a single
line, separated by |. Symbols that are non-terminals (i.e.
appear as the lhs of at least one rule) are written unquoted;
all other symbols are enclosed in single quotes. When
is_pcfg returns True,
the probability of each alternative (its weight divided by the sum of
weights for that non-terminal) is appended in square brackets.
Rule names and registered constraints are not included in the output.
When self.start is set, a # start: <symbol> comment is
prepended so that
from_grammar_string
can reconstruct the grammar without requiring an explicit start
argument.
Examples:
>>> g = Grammar([
... Rule("E", ["E", "+", "F"]),
... Rule("E", ["F"]),
... Rule("F", ["x"]),
... ], start="E")
>>> print(g.to_grammar_string())
# start: E
E -> E '+' F | F
F -> 'x'
>>> gp = Grammar([
... Rule("E", ["x"], weight=1.0),
... Rule("E", ["y"], weight=3.0),
... ])
>>> print(gp.to_grammar_string())
E -> 'x' [0.25] | 'y' [0.75]
Returns:
| Type | Description |
|---|---|
str
|
Multi-line string in NLTK production-rule notation, optionally |
str
|
preceded by a |