How to shorten this expression using regex
I have the following if statement:
if not fileName.startswith(".") and re.search("(.exe|.EXE)$", fileName) is not Non开发者_运维百科e and not fileName.endswith("-xyz.exe"):
pass
Essentially, I would like to check that the filename does not start with a period and ends with either .exe or .EXE extension, but not with the -xyz.exe extension. How can I get rid of the startswith and endswith and combine those two checks into the regex itself.
UPDATE: I ask because I want to learn more about regex. Depending on the readability, I will determine if that would be worth making it more concise or not.
UPDATE 2: I ran into this situation. I always look for opportunities to learn more about regex. This seems like a good opportunity, so I TRIED to do it MYSELF FIRST until I got stuck. Please do not give non-regex solution or echo Mark Pilgrim's statement about "now you have 2 problems", because anyone could have done that. Instead, prove it to me that now I have 2 problems, just like Mark Pilgrim continued with his lesson. Or show me that it's slick.
Learn to use elementary regular expressions before you start trying to "shorten" your code.
This piece re.search("(.exe|.EXE)$", fileName)
has THREE deficiencies:
(1) Should use raw strings by habit, even when it makes no difference, because then you (and your readers) don't need to spend time nutting out whether it matters or not.
(2) Unescaped .
matches ANY character (except a newline (in the default case)).
(3) $
matches before a newline at the end of a string; you should use \Z
instead. If you don't, foo.exe\n
(easy enough to get by mistake if your input was supplied by someone who didn't strip the \n
) will match.
What you need is re.search(r"(\.exe|\.EXE)\Z", fileName)
Update for the benefit of anyone who thinks that re.search("^blahblah", ...)
is a good idea:
>\python27\python -mtimeit -s"import re;s='x'*100" "re.match(r'foo',s)"
1000000 loops, best of 3: 1.2 usec per loop
>\python27\python -mtimeit -s"import re;s='x'*100" "re.search(r'^foo',s)"
100000 loops, best of 3: 2.91 usec per loop
>\python27\python -mtimeit -s"import re;s='x'*1000" "re.match(r'foo',s)"
1000000 loops, best of 3: 1.2 usec per loop
>\python27\python -mtimeit -s"import re;s='x'*1000" "re.search(r'^foo',s)"
100000 loops, best of 3: 18.5 usec per loop
This one is pretty simple actually:
if re.search(
r"""# Always use VERBOSE when composing non-trivial regex!
^ # Anchor to start of string.
# Apply multiple lookahead assertions from string start:
(?!\.) # Assert does NOT begin with dot.
(?=.*\.exe$) # Assert DOES end with .EXE
(?!.*-xyz\.exe$) # Assert does NOT end with -XYZ.EXE
.* # Ok to match the filename (optional).
""",
subject, re.IGNORECASE | re.VERBOSE):
# Successful match
else:
# Match attempt failed
Edit: After reading your question a bit closer, it appears you are concerned with the case of the EXE. In that case the regex can easily handle that, too:
if re.search(
r"""# Always use VERBOSE when composing non-trivial regex!
^ # Anchor to start of string.
# Apply multiple lookahead assertions from string start:
(?!\.) # Assert does not begin with dot.
(?=.*\.(?:exe|EXE)$) # Assert DOES end with .EXE or .exe
(?!.*-xyz\.exe$) # Assert does NOT end with -xyz-exe
.* # Ok to match the filename (optional).
""",
subject, re.VERBOSE):
# Successful match
else:
# Match attempt failed
Edit2: John Machin has pointed out that with Python, when you are looking for a match that can only occur at the start of a target string, then using the ^
start of string assertion with the re.search
method is much slower than using re.match
(and is considered bad practice). With that in mind, here is an improved version:
if re.match(
r"""# Always use VERBOSE when composing non-trivial regex!
# Apply multiple lookahead assertions from string start:
(?!\.) # Assert does not begin with dot.
(?=.*\.(?:exe|EXE)$) # Assert DOES end with .EXE or .exe
(?!.*-xyz\.exe$) # Assert does NOT end with -xyz-exe
.* # Ok to match the filename (optional).
""",
subject, re.VERBOSE):
# Successful match
else:
# Match attempt failed
I wouldn't use a regexp, just wrap that onto multiple lines and make it a little smarter:
if not filename.startswith(".") \
and filename.lower().endswith(".exe") \
and not filename.endswith("-xyz.exe"):
#do stuff
Do note that this is slightly different in that *.eXe
, *.eXE
and other mixed case versions of the extension would then be ignored as well, unlike in the original. But I'm betting that it doesn't really matter and that my test is better.
edit: fixed the ".exe"
part because I had the condition flipped, but if you're trying to learn regular expressions this is a weird contrived example and I think it's best not to try and shoehorn regular expressions as a solution to a problem where it's not a good solution
You need to use a negative lookbehind assertion:
import re
regex = '[^.].*(?:(?<!-xyz).exe|.EXE)'
vectors = (
'.123.dat',
'.123.exe',
'.123.EXE',
'123.dat',
'123.exe',
'123.EXE',
'.123-xyz.dat',
'.123-xyz.exe',
'.123-xyz.EXE',
'123-xyz.dat',
'123-xyz.exe',
'123-xyz.EXE',
)
for v in vectors:
print "%s: " % (v,),
if (bool(re.match(regex, v)) == (not v.startswith(".") and
re.search("(.exe|.EXE)$", v) is not None and
not v.endswith("-xyz.exe"))):
print 'PASS'
else:
print 'FAIL'
import re
pat = re.compile('(?!\.)'
'.+'
'\.'
'(?:(?<!-xyz\.)exe|EXE)'
'\Z')
names = ('.123.dat', '.123.exe', '.123.EXE',
'123.dat', '123.exe', '123.EXE',
'123-xyz.dat', '123-xyz.exe', '123-xyz.EXE', )
print '\n'.join(v.ljust(18)+str(bool(pat.match(v))) for v in names)
.
EDIT:
You are right, ridgerunner, [^.]
is better than (?!\.)
: it's more readable, more logic and slightly faster, -4 % (I tested)
.
I also compared '(?!\.).+?\.(?:EXE|(?<!-xyz)exe)\Z'
( there is .+?
instead of .+
)
With this RE, execution is longer. The additional time depends of the number of dots in the tested names.
On names like '78999.abc.juty.123.dat' it is around 15 % longer, and on names like '123.dat' it's 3 % longer. I think the reason is that the regex motor examine after each reading of a character if the read character is a dot or not.
On the contrary, with '.+\.'
the regex motor goes until the end, and then come back to search the last dot. I think it is a correct explanation because, if the RE '(?!\.).+?\.(?:EXE|(?<!-xyz)exe)\Z'
is tested on names like '123.teybertyhbeythbeytberyetynetynetnyetnydat' , the time is again longer (+ 30 %)
.
I realized that my RE is very similar to the Ignacio's one, and I wondered why I wrote this RE as it seems to have no particular interest. In fact, at the beginning, my idea was to write '(?!\.).+?(?<=.EXE|(?<!-xyz).exe)\Z'
and then I wrote another string. By the way, with this abandonned RE , the execution time is 25 % longer on short names and 74 % longer on long names.
Finally, when I tested the execution times, Ignacio's solution is 25 % longer on short names ( '123.dat' ) and 47 % longer on long names ( '78999.abc.juty.123.dat' )
Best regex is then
pat = re.compile('[^.]'
'.+'
'\.'
'(?:(?<!-xyz\.)exe|EXE)'
'\Z')
I let '.+'
, not replaced by '.*'
, because there must be at least 4 characters in the name if we want the name to end with '.exe'
or '.EXE'
That will involve accessing every file that doesn't start with '.' or ends with '-xyz.exe'. The regex module cannot parse stuff outside its namespace. I don't think it's possible, but have you tried checking the module's doc?
精彩评论