How to improve my Python regex syntax?
I very new to Python, and fairly new to regex. (I have no Perl experience.)
I am able to use regular expressions in a way that works, but I'm not sure that my code is particularly Pythonic or consise.
For example, If I wanted to read in a text file and print out text that appears directly between the words 'foo' and 'bar' in each line (presuming this occurred one or zero times a line) I would write the follow开发者_JAVA技巧ing:
fileList = open(inFile, 'r')
pattern = re.compile(r'(foo)(.*)(bar)')
for line in fileList:
result = pattern.search(line)
if (result != None):
print result.groups()[1]
Is there a better way? The if
is necessary to avoid calling groups()
on None
. But I suspect there is a more concise way to obtain the matching String when there is one, without throwing errors when there isn't.
I'm not hoping for Perl-like unreadability. I just want to accomplish this common task in the commonest and simplest way.
I think it's fine.
Some minor points:-
- You can replace
result.groups()[x]
withresult.group(x+1)
. - If you don't need to capture
foo
andbar
, just user'foo(.*)bar'
. - If you're using Python 2.5+, try to use the
with
statement so even when there's exception the file can be closed properly.
BTW, as a 5-liner (not that I recommend this):
import re
pattern = re.compile(r'foo(.*)bar')
with open(inFile, 'r') as fileList:
searchResults = (pattern.search(line) for line in fileList)
groups = (result.group(1) for result in searchResults if result is not None)
print '\n'.join(groups)
There are two tricks to be had: the first is the re.finditer regular expression function (and method). The second is the use of the mmap module.
From the documentation on re.DOTALL, we can note that .
does not match newlines:
without this flag, '.' will match anything except a newline.
So if you look for all matches anywhere in the file (such as when read into a string using f.read()
), you can pretend each line is an isolated substring (note: it's not quite true, though. If you want the ^ and $ assertions to work this way, use re.MULTILINE). Now, because you noted that we assume there are only zero or one occurrences per line, we don't have to worry about re.finditer() matching more than it should (because it would!). So right away, you could replace all that with iterating over finditer() instead:
fileList = open(inFile, 'r')
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileList.read()):
print result.groups(1)
This isn't really nice though. The problem here is that the entire file is read into memory for your convenience. It'd be nice if there was a convenient way to do it without possibly breaking on larger files. And, well, there is! Enter the mmap module.
mmap lets you treat a file as if it were a string (a mutable string, no less!), and it doesn't load the whole thing into memory. The long and short of it is, you can use the following code instead:
fileList = open(inFile, 'r+b')
fileS = mmap.mmap(fileList.fileno(), 0)
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileS):
print result.groups(1)
and it will work just the same, but without consuming the whole file at once (hopefully).
you don't need regex. split your string on "bar", iterate them, find "foo", do a split on "foo" and get the results to the right. Of course, you can use other string manipulation like getting the index and stuff.
>>> s="w1 w2 foo what i want bar w3 w4 foowhatiwantbar w5"
>>> for item in s.split("bar"):
... if "foo" in item:
... print item.split("foo")[1:]
...
[' what i want ']
['whatiwant']
I have a few minor suggestions:
- Unless you're certain that
foo
andbar
can occur no more than once per line, it's better to use.*?
instead of.*
- If you need to make sure that
foo
andbar
should only be matched as entire words (as opposed tofoonly
andrebar
), you should add\b
anchors around them (\bfoo\b
etc.) - You could use lookaround to match only the match itself (
(?<=\bfoo\b).*?(?=\bbar\b)
), so nowresult.group(0)
will contain the match. But that's not really more readable :)
精彩评论