Parsing a string of given format to build a binary decision tree
I'm trying to parse a string with parentheses of the form ((Question)(Left_Node)(right_node))
.
The question for example will be of the form "if segment size < 1.5,then choose left node, else right". The question can b开发者_C百科e a dictionary with a key and a value. The left and right node represent a complete left or right half tree, which will be traversed recursively until the leaf node is reached. In this manner we will build a binary decision tree.
This kind of syntax is really the target for pyparsing. The basic format is simple enough, and in pyparsing, it looks like:
decision = '(' + '('+question+')' + '('+action+')' + '('+action+')' + ')'
But once you start to add arithmetic expressions, and boolean operations and support for 'and' and 'or' operators, things get complicated. Also, this is a recursive grammar, since an action can itself be a nested decision.
Pyparsing has built-in support that simplifies definition of arithmetic and boolean expressions, including precedence of operations and grouping in parentheses, plus recursive expressions. Here is the pyparsing grammar in various pieces. First here are some of the basic parsing elements:
LPAR,RPAR = map(Suppress,"()")
# old pyparsing way
number = Regex(r"[+-]?\d+(\.\d*)?").setParseAction(lambda tokens:float(tokens[0]))
# new pyparsing way - parses many numeric formats, and uses a parse action
# to convert to float
number = pyparsing_common.fnumber()
varname = Word(alphas+'_', alphanums+'_')
The parse action attached to the number expression will automatically convert the parsed number to a float value at parse time. The Word class takes two strings: a string containing all valid leading characters, and a string of all valid body characters. This varname definition supports variable names similar to Python identifiers.
Pyparsing has the infixNotation
method that takes an expression for the basic operand definition, and a list of tuples to define each level of operators: an expression for the operators, an integer whether the operator is unary, binary, or ternary, and whether left- or right-associative. infixNotation
takes care of the recursive definition of arithmetic expressions nested in parentheses. This expression defines basic 4-function math:
operand = number | varname
arithExpr = infixNotation(operand,
[
(oneOf("+ -"), 1, opAssoc.RIGHT),
(oneOf("* /"), 2, opAssoc.LEFT),
(oneOf("+ -"), 2, opAssoc.LEFT),
])
Now here is the definition of a boolean condition (which we'll eventually use to define the decision question):
TRUE = CaselessKeyword("TRUE") | Keyword("T")
FALSE = CaselessKeyword("FALSE") | Keyword("F")
comparisonOp = oneOf("< > <= >= = !=")
boolLiteral = TRUE | FALSE
arithComparison = arithExpr + comparisonOp + arithExpr
boolOperand = boolLiteral | arithComparison
booleanExpr = infixNotation(boolOperand,
[
('not', 1, opAssoc.RIGHT),
('and', 2, opAssoc.LEFT),
('or', 2, opAssoc.LEFT),
])
Your definition of actions was a little sketchy, so I made up some possible statements: an assignment statement, a print
statement, and since this is a speech application, a say
statement which is very similar to print
.
rhs = booleanExpr | arithExpr
assignment = varname("assign_var") + '=' + Group(rhs)("assign_value")
print_cmd = Keyword("print") + Group(delimitedList(rhs | quotedString))
say_cmd = Keyword("say") + Group(delimitedList(rhs | quotedString))
cmd = print_cmd | say_cmd | assignment
In addition to the expression definitions, you'll notice that some expressions are followed by a quoted string, as if the expression is a function being called with that string. In fact, this "call" actually returns a copy of the expression, and the matched tokens are tagged with that name. These results names are very helpful at post-parsing time at picking out the individual matching elements (similar to named groups in regular expressions).
Finally, to put these pieces together into your decision expression, here are the question and action expressions:
IF = CaselessKeyword("IF")
question = LPAR + IF + Group(booleanExpr)("condition") + RPAR
action = LPAR + cmd + RPAR | Group(decision)
Note that the action definition can include a decision, but action is used to define decision. To break this chicken-and-egg dependency, we preface this section with defining decision
, but with a placeholder expression using the pyparsing Forward
class:
decision = Forward()
Then after question
and action
are defined, we use the '<<=' operator to "insert" the actual expression definition in the existing decision
variable:
decision <<= (LPAR
+ question
+ Group(action)("ifAction")
+ Optional(Group(action)("elseAction"))
+ RPAR)
Again, I took liberties with your defined format, thinking that an optional else-clause might be useful. If you don't want this to be optional, just remove the Optional
wrapper around the Group(action)("elseAction")
.
That defines the grammar, now here are some test cases. Using dump() on the results returned by parseString
is a nice way to print out the tokens, and any names attached to them.
tests = """\
((if TRUE)(a=99))
((if TRUE)(a = (a=99)))
((if a<100)(a = a + 1))
((if a<100)(a = a + 1)(a = a-1))
(
(if a<100)
(print a, "is too small")
( (if a>100) (print a,"is too big") (print a, "is just right") )
)
(
(if a<100)
(say a, "is too small!")
( (if a>100) (say a,"is too big!") (say a, "is just right!") )
)
"""
for d in decision.searchString(tests):
print d.dump()
print
Here's the output:
['IF', ['TRUE'], ['a', '=', [99.0]]]
- condition: ['TRUE']
- ifAction: ['a', '=', [99.0]]
- assign_value: [99.0]
- assign_var: a
['IF', ['TRUE'], ['a', '=', ['a', '=', 99.0]]]
- condition: ['TRUE']
- ifAction: ['a', '=', ['a', '=', 99.0]]
- assign_value: ['a', '=', 99.0]
- assign_var: a
['IF', ['a', '<', 100.0], ['a', '=', [['a', '+', 1.0]]]]
- condition: ['a', '<', 100.0]
- ifAction: ['a', '=', [['a', '+', 1.0]]]
- assign_value: [['a', '+', 1.0]]
- assign_var: a
['IF', ['a', '<', 100.0], ['a', '=', [['a', '+', 1.0]]],
['a', '=', [['a', '-', 1.0]]]]
- condition: ['a', '<', 100.0]
- elseAction: ['a', '=', [['a', '-', 1.0]]]
- assign_value: [['a', '-', 1.0]]
- assign_var: a
- ifAction: ['a', '=', [['a', '+', 1.0]]]
- assign_value: [['a', '+', 1.0]]
- assign_var: a
['IF', ['a', '<', 100.0], ['print', ['a', '"is too small"']],
[['IF', ['a', '>', 100.0], ['print', ['a', '"is too big"']],
['print', ['a', '"is just right"']]]]]
- condition: ['a', '<', 100.0]
- elseAction: [['IF', ['a', '>', 100.0], ['print', ['a', '"is too big"']],
['print', ['a', '"is just right"']]]]
- ifAction: ['print', ['a', '"is too small"']]
['IF', ['a', '<', 100.0], ['say', ['a', '"is too small!"']],
[['IF', ['a', '>', 100.0], ['say', ['a', '"is too big!"']],
['say', ['a', '"is just right!"']]]]]
- condition: ['a', '<', 100.0]
- elseAction: [['IF', ['a', '>', 100.0], ['say', ['a', '"is too big!"']],
['say', ['a', '"is just right!"']]]]
- ifAction: ['say', ['a', '"is too small!"']]
Here is a link to the full parsing program - http://pastebin.com/DnaNrx7j .
This is just the first stage, parsing the input. The next step is actually evaluating the expression by processing the returned tokens. The pyparsing example SimpleBool.py (https://github.com/pyparsing/pyparsing/blob/master/examples/simpleBool.py) includes an example of attaching parse actions to convert parsed tokens into callable class instances that simplify evaluating the results.
Hope this helps.
If you can decide on the details of the input format, then use raw python source code. For example you can store your tree in a python dictionary of nodes:
tree = {
"root": ("a", "b")
"a": ("c", "d"),
"b": ("e", "f"),
"c": (None, None), #No children, a leaf.
"d": (None, None),
"e": (None, None),
"f": (None, None),
}
Now you can parse this tree simply with the python parser.
from tree_data import tree #We're done parsing!
root_node = tree["root"]
left_child = tree[root_node[0]]
Well, if the input format is already specified, then we cannot help you if you don't share the details of that format.
精彩评论