Creating Regular Expressions in Python
I'm trying to create regular expression that filters from the following partial text:
amd64 build of software 1:0.98.10-0.2svn20090909 in archive
what I want to extract is:
software 1:0.98.10-0.2svn20090909
How can I do this?? I've been trying and this is what I have so far:
p = re.compile('([a-zA-Z0-9\-\+\.]+)\ ([0-9\:\.\-]+)')
iterator = p.findit开发者_高级运维er("amd64 build of software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
print match.group()
with result:
software 1:0.98.10-0.2
(svn20090909
is missing)
Thanks a lot.
This will work:
p = re.compile(r'([a-zA-Z0-9\-\+\.]+)\ ([0-9][0-9a-zA-Z\:\.\-]+)')
iterator = p.finditer("amd64 build of dvdrip software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
print match.group()
# Prints: software 1:0.98.10-0.2svn20090909
That works by allowing the captured section to contain letters while still insisting that it starts with a number.
Without seeing all the other strings it needs to match, I can't be sure whether that's good enough.
If you have consistent lines, this is, if each entry is on one line and the first word you want is always before the numbers part (the 1:0.98 ... part) you don't need a regexp. Try this:
>>> s = 'amd64 build of software 1:0.98.10-0.2svn20090909 in archive'
>>> match = [s.split()[3], s.split()[4]]
>>> print match
['software', '1:0.98.10-0.2svn20090909']
>>> # alternatively
>>> match = s.split()[3:5] # for same result
what this is doing is the following: it first splits the line s
at the spaces (using the string method split()
) and selects the fourth and fifth elements of the resulting list; both are stored in the variable match
.
Again , this only works if you have one entry per line and if the 'software'
part always comes before the 1:0.98.10-0.2svn20090909
part.
I often avoid regexps when I can do with split lists. If the parsing becomes a nightmare, I use pyparsing.
Don't use a capturing group if you want everything in one piece.
精彩评论