multi-line pattern matching in python

2022-12-28 20:07 问答作者：

A periodic computer generated message (simplified):

Hello user123,

- (604)7080900
- 152
- minutes

Regards

Using python, how can I extract "(604)7080900", "152", "minutes" (i.e. any text following a leading "- " pattern) between the two empty lines (empty line is the \n\n after "Hello user123" and the \n\n before "Regards"). Even better if the result string list are stored in an array. Thanks!

edit: the number of lines between two blank lines are not fixed.

2nd edit:

e.g.

hello

- x1
- x2
- x3

- x4

- x6
morning
- x7

world

x1 x2 x3 are good, as all lines are surrounded by 2 empty lines, x4 is also good for the same reason. x6 is not good because no blank line follows it, x7 is not good as no blank in front of it. x2 is good (not like x6, x7) because the line ahead is a good line and the line following it is also good.

this conditions might be not clear when I posted the question:

a continuous of good lines between 2 empty lines

good line must have leading "- "
good line must follow an empty li开发者_JAVA技巧ne or follow another good line
good line must be followed by an empty line or followed by another good line

thanks

>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>

The simplest approach is to go over these lines (assuming you have a list of lines, or a file, or split the string into a list of lines) until you see a line that's just '\n', then check that each line starts with '- ' (using the startswith string method) and slicing it off, storing the result, until you find another empty line. For example:

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

Edited: Since you elaborate on what you want to do, here's an updated version of the loops. It no longer loops twice, but instead collects data until it encounters a 'bad' line, and either saves or discards the collected lines when it encounters a block separator. It doesn't need an explicit iterator, because it doesn't restart iteration, so you can just pass it a list (or any iterable) of lines:

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

And here it is in action:

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']

l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

do this:

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

and have this:

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]

继续阅读：multiline python regex

multi-line pattern matching in python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？