开发者

python regex match line if exist or not

I have a little problem with a regex.

Here is a sample of the text to parse :

output = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zzzzzzz
continent : Asia
planet : Earth
-------
count开发者_C百科ry : Izbud
zzzzzzz
continent : Gladiora
zzzzzzz
zzzzzzz
planet : Mars
"""

I want to parse this and return the country, the continent and eventually the planet.

So i did a regex :

results = re.findall(
    r"""(?mx)
        ^country\s:\s*(.+)\s
        (?:^.+\s)*?
        ^continent\s:\s*(.+)\s
        (?:^.+\s)*?
        (?:^planet\s:\s*(.+)\s)*?
""",output)

but the return is :

[('USA', 'Americ', ''), ('China', 'Asia', ''), ('Izbud', 'Gladiora', '')]

And I don't know where my regex is wrong ?

If anyone has an idea, thanks.


I found a pattern that seems to work:

r"""(?mx)
    ^country\s:\s*(.+)\s
    (?:^.+\s)*?
    ^continent\s:\s*(.+)\s
    (?:^.+\s)*?
    (?:^(?:planet\s:\s*(.+)\s|-+\s|\Z))
"""

Basically, I changed the last part so that it has to match one of the following: the planet stuff, a bunch of -'s, or the end of the string. It's kind of ugly, but it was the only way that I could find to ensure that it got the planet stuff. One problem with my solution is that there has to be an empty line at the end of the string (as in your example) or it won't get the last match.

By the way, a partial solution is to fix the last line of the OP's pattern so that it just has a ? at the end rather than a *?. However, it will only match planet info that is the line following the continent info. The reason it wasn't getting anything before is that *? is lazy. It will avoid matching if possible.


I'm going to suggest what I would do, which would be something that tries to avoid using such complex regex's. Probably something like:

while true:
    line = readline()
    if line == "----------":
        # Do cleanup stuff
        continue
    elif 'country' in line.split():
        country = line.split()[2]
    elif 'continent' in line.split():
        continent = line.split()[2]
    # etc...
    # update your list or dict or w/e
    line = readline()


try this:

"""(?mx)
        ^country\s:\s*(.+)\s
        (?:^.+\s)*?
        ^continent\s:\s*(.+)\s
        (?:^.+\s)*?
        ^planet\s:\s*(.+)\s.*
"""


# seen this '\n' can break string into LIST of strings 

n_line = output.split('\n')

tempn_line = n_line[:]

# loop through the new List (without '\n')

for n_text in tempn_line:
    if ':' not in n_text:
        #print n_text
        n_line.remove(n_text)


for l_text in n_line:
    n_split = l_text.split(':')
    #print n_split
    if 'country' in n_split[0]:
        print n_split[1]
    elif 'continent' in n_split[0]:
        print n_split[1]
    elif 'planet' in n_split[0]:
        print n_split[1]


I strongly agree with everyone else who said that you shouldn't do this with a regexp. That said, you can make it work if you use a negative lookahead before consuming each "junk" line. E.g.:

print re.findall(r"""(?mx)
    ^country\s:\s*(.+)\s
    (?:^.+\s)*?
    ^continent\s:\s*(.+)\s
    (?:(?:(?!(?:planet|country|continent)\s:)^.+\s)*
       (?:^planet\s:\s*(.+)\s))?
""",output)


import re

pat = re.compile('country : (.+)\n.+\ncontinent : (.+)(?:\n.*)*?(?:\nplanet : (.+)|\n-+|\n?\Z)')

output1 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars """

output2 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
"""

output3 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug"""

output4 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
"""

output5 = """
country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora"""

for ch in (output1,output2,output3,output4,output5):
    print ch
    print
    print repr(ch)
    print
    print '\n'.join(repr(u) for u in pat.findall(ch))
    print '======================================================================'

Result:

country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : China
zziiiiiiiiiiiizz
continent : Asia
planet : Earth
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug
planet : Mars 

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : China\nzziiiiiiiiiiiizz\ncontinent : Asia\nplanet : Earth\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\nplanet : Mars '

('USA', 'Americ', '')
('China', 'Asia', 'Earth')
('Izbud', 'Gladiora', 'Mars ')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug


'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug\n'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora
zzzzzzz
uyututuug

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\nzzzzzzz\nuyututuug'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora


'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora\n'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================

country : USA
zzzzzzz
continent : Americ
eeeeeee
-------
country : Izbud
zzuuuuuuuuuuuuz
continent : Gladiora

'\ncountry : USA\nzzzzzzz\ncontinent : Americ\neeeeeee\n-------\ncountry : Izbud\nzzuuuuuuuuuuuuz\ncontinent : Gladiora'

('USA', 'Americ', '')
('Izbud', 'Gladiora', '')
======================================================================
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜