How to make article spinner regex?
Let's sa开发者_C百科y I have teh following:
{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}
And I want that to turn into any of the following:
Hello world
Goodbye noobs
Hi earth
farewell n3wbz
// etc.
Paying attention to the way the "spinning" syntax is nested. It could be nested a billion layers deep for all we know.
I can do this easy, except once they're nested like the above example my regex messes up and the results are not correct.
Could someone show an example in either a .NET language or Python please?
A simple way with re.subn, which can also accept a function instead of a replacement string:
import re
from random import randint
def select(m):
choices = m.group(1).split('|')
return choices[randint(0, len(choices)-1)]
def spinner(s):
r = re.compile('{([^{}]*)}')
while True:
s, n = r.subn(select, s)
if n == 0: break
return s.strip()
It simply replaces all the deepest choices it meets, then iterates until no choice remains. subn
returns a tuple with the result and how many replacements were made, which is convenient to detect the end of the processing.
My version of select()
can be replaced by Bobince's that uses random.choice()
and is more elegant if you just want to stick to a random selector. If you want to build a choice tree, you could extend the above function, but you will need global variables to keep track of where you are, so moving the functions into a class would make sense. This is just a hint, I won't develop that idea since it was not really the orginial question.
Note finally that you should use r.subn(select, s, re.U)
if you need unicode strings (s = u"{...}"
)
Example:
>>> s = "{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}"
>>> print spinner(s)
'farewell n3wbz'
Edit: Replaced sub
by subn
to avoid infinite loop (thanks to Bobince to point it out) and make it more efficient, and replaced {([^{}]+)}
by {([^{}]*)}
to extract empty curly brackets as well. That should make it more robust to ill-formatted patterns.
For people who like to put as much as possible on one line (which I personally wouldn't encourage):
def spin(s):
while True:
s, n = re.subn('{([^{}]*)}',
lambda m: random.choice(m.group(1).split("|")),
s)
if n == 0: break
return s.strip()
Should be fairly simple, just disallow a brace set from including another, then repeatedly call doing replacements from the inner matches outwards:
def replacebrace(match):
return random.choice(match.group(1).split('|'))
def randomizebraces(s):
while True:
s1= re.sub(r'\{([^{}]*)\}', replacebrace, s)
if s1==s:
return s
s= s1
>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Hey world'
>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Goodbye noobs'
This regex inverter uses pyparsing to generate matching strings (with some restrictions - unlimited repetition symbols like + and * are not allowed). If you replace {}'s with ()'s to make your original string into a regex, the inverter generates this list:
Helloworld
Helloearth
Hiworld
Hiearth
Heyworld
Heyearth
Goodbyenoobs
Goodbyen3wbz
Goodbyen00blets
farewellnoobs
farewelln3wbz
farewelln00blets
(I know the spaces are collapsed out, but maybe this code will give you some ideas on how to attack this problem.)
I would use re.finditer and build a basic parse tree to determine the nesting level. To do it, I would use the span attribute of the regex match object:
text = '{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}'
import re
re_bracks = re.compile(r'{.+?}')
# subclass list for a basic tree datatype
class bracks(list):
def __init__(self, m):
self.m = m
# icky procedure to create the parse tree
# I hate these but don't know how else to do it
parse_tree = []
for m in re_bracks.finditer(text):
if not this_element:
# this first match
parse_tree.extend(element(m))
else:
# ... and all the rest
this_element = bracks(m)
this_start, this_end = m.span()
# if this match is nested in the old one ...
if this_start < previous_start and this_end > previous_end:
# nest it inside the previous one
previous_element.extend(this_element)
else:
# otherwise make it a child of the parse_tree
parse_tree.extend(element(m))
previous_element = this_element
previous_start, previous_end = this_start, this_end
This would give you the nesting depth of the bracketed expressions. Add some similar logic for the pipes and you'd be well on your way to solving the problem.
I'd recommend taking a look at the dada engine for inspiration.
I've done an implementation of something inspired by this in scheme and leveraged scheme's AST to express my needs.
Specifically, I'd recommend strongly against trying to use a regex as a parser in general.
精彩评论