Expanding a tree-like data structure
I am attempting to use Python to alter some text strings using the re module (i.e., re.sub). However, I think that my question is applicable to other languages that have regex implementations.
I have a number of strings that represent tree-like data structures. They look something like this:
(A,B)-C-D
A-B-(C,D)
A-(B,C开发者_开发百科,D-(E,F,G,H,I))
Each letter represents a branch or edge. Letters in parentheses represent branches coming into or out of another branch.
Everywhere that there is a 'plain' tuple of values (a tuple with only comma separated single letter), I would like to take the prefix (X-) or suffix (-X) of that tuple and apply it to each of the values in the tuple.
Under this transformation, the above strings would become
(A-C,B-C)-D
A-(B-C,B-D)
A-(B,C,(D-E,D-F,D-G,D-H,D-I))
Applying the methodology repeatedly would ultimately yield
(A-C-D,B-C-D)
(A-B-C,A-B-D)
(A-B,A-C,A-D-E,A-D-F,A-D-G,A-D-H,A-D-I)
The strings in these tuples then represent the paths through the tree starting at a root and ending at a leaf.
Any help accomplishing this task using regular expressions (or other approaches) would be greatly appreciated.
You could not do this with regular expressions, because you have to deal with nested structures. Instead you could use pyparsing's nestedExpr
The problem you are describing is one of enumerating paths within a graph.
You describe three graphs
A B
\ /
C
|
D
A
|
B
/ \
C D
A
/ | \
B C D
// | \\
E F G H I
and for each you want to enumerate paths. This involves distributing a value across an arbitrarily nested structure. If this could be done with regexes, and I am not certain that it can, it would have to be done, I believe, in several passes.
My sense of your problem though, is that it is best solved by parsing your string into a graph structure and then enumerating the paths. If you do not want to physically build the graph, you can probably generate strings within user-supplied actions to a parser generator.
A regex-based solution would have to know how to handle both
(A,B)-C
and
(A,B,C,D,E,F,G,H)-I
You can match these strings with
\([A-Z](,[A-Z])*\)-[A-Z]
but how would you "distribute" over all submatches without some logic? Since you need this logic anyway, you might as well perform it on a real graph structure. You can also do this on a string itself, but it would be better to do this under the auspices of a parser generator which can handle context-free or context-sensitive structures.
After posting my comment referring to pyparsing's invRegex example, I looked a little closer at your input, and it looked like you could interpret this as an infix notation, with ',' and '-' as binary operators. Pyparsing has a helper method awkwardly named operatorPrecedence
that parses expressions according to a precedence of operators, with grouping in parentheses. (This has a little more smarts to it than just using the nestedExpr
helper method, which matches expressions nested within grouping symbols.) So here is a getting-started version of a parser using operatorPrecedence
:
data = """\
(A,B)-C-D
A-B-(C,D)
A-(B,C,D-(E,F,G,H,I))""".splitlines()
from pyparsing import alphas, oneOf, operatorPrecedence, opAssoc
node = oneOf(list(alphas))
graphExpr = operatorPrecedence(node,
[
('-', 2, opAssoc.LEFT),
(',', 2, opAssoc.LEFT),
])
for d in data:
print graphExpr.parseString(d).asList()
Pyparsing actually returns a complex structure of type ParseResults which supports access to the parsed tokens as elements in a list, items in a dict, or attributes in an object. By calling asList
, we just get the elements in simple list form.
The output of the above shows that we look to be on the right track:
[[['A', ',', 'B'], '-', 'C', '-', 'D']]
[['A', '-', 'B', '-', ['C', ',', 'D']]]
[['A', '-', ['B', ',', 'C', ',', ['D', '-', ['E', ',', 'F', ',', 'G', ',', 'H', ',', 'I']]]]]
Pyparsing also allows you to attach callbacks or parse actions
to individual expressions, to be called at parse time. For instance, this parse action does parse-time conversion to integer:
def toInt(tokens):
return int(tokens[0])
integer = Word(nums).setParseAction(toInt)
When the value is returned in the ParseResults, it has already been converted to an integer.
Classes can also be specified as parse actions, and the ParseResults object is passed to the class's __init__
method and the resulting object returned. We can specify parse actions within operatorPrecedence by adding the parse action as a 4th element in each operator's descriptor tuple.
Here is base class for binary operators:
class BinOp(object):
def __init__(self, tokens):
self.tokens = tokens
def __str__(self):
return self.__class__.__name__ + str(self.tokens[0][::2])
__repr__ = __str__
From this base class, we can derive 2 subclasses, one for each operator -
and ,
:
class Path(BinOp):
pass
class Branch(BinOp):
pass
And add them to the operator definition tuples in operatorPrecedence:
node = oneOf(list(alphas))
graphExpr = operatorPrecedence(node,
[
('-', 2, opAssoc.LEFT, Path),
(',', 2, opAssoc.LEFT, Branch),
])
for d in data:
print graphExpr.parseString(d).asList()
This gives us a nested structure of objects for each input string:
[Path[Branch['A', 'B'], 'C', 'D']]
[Path['A', 'B', Branch['C', 'D']]]
[Path['A', Branch['B', 'C', Path['D', Branch['E', 'F', 'G', 'H', 'I']]]]]
The generation of paths from this structure is left as an exercise for the OP. (The pyparsing regex inverter does this using a tangle of generators - hopefully some simple recursion will be sufficient.)
精彩评论