How to make article spinner regex?

2022-12-12 22:58 问答作者：

Let's sa开发者_C百科y I have teh following:

{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}

And I want that to turn into any of the following:

Hello world 
Goodbye noobs 
Hi earth
farewell n3wbz 
// etc.

Paying attention to the way the "spinning" syntax is nested. It could be nested a billion layers deep for all we know.

I can do this easy, except once they're nested like the above example my regex messes up and the results are not correct.

Could someone show an example in either a .NET language or Python please?

A simple way with re.subn, which can also accept a function instead of a replacement string:

import re
from random import randint

def select(m):
    choices = m.group(1).split('|')
    return choices[randint(0, len(choices)-1)]

def spinner(s):
    r = re.compile('{([^{}]*)}')
    while True:
        s, n = r.subn(select, s)
        if n == 0: break
    return s.strip()

It simply replaces all the deepest choices it meets, then iterates until no choice remains. subn returns a tuple with the result and how many replacements were made, which is convenient to detect the end of the processing.

My version of select() can be replaced by Bobince's that uses random.choice() and is more elegant if you just want to stick to a random selector. If you want to build a choice tree, you could extend the above function, but you will need global variables to keep track of where you are, so moving the functions into a class would make sense. This is just a hint, I won't develop that idea since it was not really the orginial question.

Note finally that you should use r.subn(select, s, re.U) if you need unicode strings (s = u"{...}")

Example:

>>> s = "{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}"
>>> print spinner(s)
'farewell n3wbz'

Edit: Replaced sub by subn to avoid infinite loop (thanks to Bobince to point it out) and make it more efficient, and replaced {([^{}]+)} by {([^{}]*)} to extract empty curly brackets as well. That should make it more robust to ill-formatted patterns.

For people who like to put as much as possible on one line (which I personally wouldn't encourage):

def spin(s):
    while True:
        s, n = re.subn('{([^{}]*)}',
                       lambda m: random.choice(m.group(1).split("|")),
                       s)
        if n == 0: break
    return s.strip()

Should be fairly simple, just disallow a brace set from including another, then repeatedly call doing replacements from the inner matches outwards:

def replacebrace(match):
    return random.choice(match.group(1).split('|'))

def randomizebraces(s):
   while True:
       s1= re.sub(r'\{([^{}]*)\}', replacebrace, s)
       if s1==s:
           return s
       s= s1

>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Hey world'
>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Goodbye noobs'

This regex inverter uses pyparsing to generate matching strings (with some restrictions - unlimited repetition symbols like + and * are not allowed). If you replace {}'s with ()'s to make your original string into a regex, the inverter generates this list:

Helloworld
Helloearth
Hiworld
Hiearth
Heyworld
Heyearth
Goodbyenoobs
Goodbyen3wbz
Goodbyen00blets
farewellnoobs
farewelln3wbz
farewelln00blets

(I know the spaces are collapsed out, but maybe this code will give you some ideas on how to attack this problem.)

I would use re.finditer and build a basic parse tree to determine the nesting level. To do it, I would use the span attribute of the regex match object:

text = '{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}'

import re
re_bracks = re.compile(r'{.+?}')

# subclass list for a basic tree datatype
class bracks(list):
    def __init__(self, m):
        self.m = m

# icky procedure to create the parse tree
# I hate these but don't know how else to do it
parse_tree = []
for m in re_bracks.finditer(text):
    if not this_element:
        # this first match
        parse_tree.extend(element(m))
    else:
        # ... and all the rest
        this_element = bracks(m)
        this_start, this_end = m.span()

        # if this match is nested in the old one ...
        if this_start < previous_start and this_end > previous_end:
            # nest it inside the previous one
            previous_element.extend(this_element) 
        else:
            # otherwise make it a child of the parse_tree
            parse_tree.extend(element(m))

        previous_element = this_element
        previous_start, previous_end = this_start, this_end

This would give you the nesting depth of the bracketed expressions. Add some similar logic for the pipes and you'd be well on your way to solving the problem.

I'd recommend taking a look at the dada engine for inspiration.

I've done an implementation of something inspired by this in scheme and leveraged scheme's AST to express my needs.

Specifically, I'd recommend strongly against trying to use a regex as a parser in general.

继续阅读：.net article python spinner

How to make article spinner regex?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？