How to shorten this expression using regex

2023-02-18 09:15 问答作者：

I have the following if statement:

if not fileName.startswith(".") and re.search("(.exe|.EXE)$", fileName) is not Non开发者_运维百科e and not fileName.endswith("-xyz.exe"):
    pass

Essentially, I would like to check that the filename does not start with a period and ends with either .exe or .EXE extension, but not with the -xyz.exe extension. How can I get rid of the startswith and endswith and combine those two checks into the regex itself.

UPDATE: I ask because I want to learn more about regex. Depending on the readability, I will determine if that would be worth making it more concise or not.

UPDATE 2: I ran into this situation. I always look for opportunities to learn more about regex. This seems like a good opportunity, so I TRIED to do it MYSELF FIRST until I got stuck. Please do not give non-regex solution or echo Mark Pilgrim's statement about "now you have 2 problems", because anyone could have done that. Instead, prove it to me that now I have 2 problems, just like Mark Pilgrim continued with his lesson. Or show me that it's slick.

Learn to use elementary regular expressions before you start trying to "shorten" your code.

This piece re.search("(.exe|.EXE)$", fileName) has THREE deficiencies:

(1) Should use raw strings by habit, even when it makes no difference, because then you (and your readers) don't need to spend time nutting out whether it matters or not.

(2) Unescaped . matches ANY character (except a newline (in the default case)).

(3) $ matches before a newline at the end of a string; you should use \Z instead. If you don't, foo.exe\n (easy enough to get by mistake if your input was supplied by someone who didn't strip the \n) will match.

What you need is re.search(r"(\.exe|\.EXE)\Z", fileName)

Update for the benefit of anyone who thinks that re.search("^blahblah", ...) is a good idea:

>\python27\python -mtimeit -s"import re;s='x'*100" "re.match(r'foo',s)"
1000000 loops, best of 3: 1.2 usec per loop

>\python27\python -mtimeit -s"import re;s='x'*100" "re.search(r'^foo',s)"
100000 loops, best of 3: 2.91 usec per loop

>\python27\python -mtimeit -s"import re;s='x'*1000" "re.match(r'foo',s)"
1000000 loops, best of 3: 1.2 usec per loop

>\python27\python -mtimeit -s"import re;s='x'*1000" "re.search(r'^foo',s)"
100000 loops, best of 3: 18.5 usec per loop

This one is pretty simple actually:

if re.search(
    r"""# Always use VERBOSE when composing non-trivial regex!
    ^                 # Anchor to start of string.
    # Apply multiple lookahead assertions from string start:
    (?!\.)            # Assert does NOT begin with dot.
    (?=.*\.exe$)      # Assert DOES end with .EXE
    (?!.*-xyz\.exe$)  # Assert does NOT end with -XYZ.EXE
    .*                # Ok to match the filename (optional).
    """, 
    subject, re.IGNORECASE | re.VERBOSE):
    # Successful match
else:
    # Match attempt failed

Edit: After reading your question a bit closer, it appears you are concerned with the case of the EXE. In that case the regex can easily handle that, too:

if re.search(
    r"""# Always use VERBOSE when composing non-trivial regex!
    ^                     # Anchor to start of string.
    # Apply multiple lookahead assertions from string start:
    (?!\.)                # Assert does not begin with dot.
    (?=.*\.(?:exe|EXE)$)  # Assert DOES end with .EXE or .exe
    (?!.*-xyz\.exe$)      # Assert does NOT end with -xyz-exe
    .*                    # Ok to match the filename (optional).
    """, 
    subject, re.VERBOSE):
    # Successful match
else:
    # Match attempt failed

Edit2: John Machin has pointed out that with Python, when you are looking for a match that can only occur at the start of a target string, then using the ^ start of string assertion with the re.search method is much slower than using re.match (and is considered bad practice). With that in mind, here is an improved version:

if re.match(
    r"""# Always use VERBOSE when composing non-trivial regex!
    # Apply multiple lookahead assertions from string start:
    (?!\.)                # Assert does not begin with dot.
    (?=.*\.(?:exe|EXE)$)  # Assert DOES end with .EXE or .exe
    (?!.*-xyz\.exe$)      # Assert does NOT end with -xyz-exe
    .*                    # Ok to match the filename (optional).
    """, 
    subject, re.VERBOSE):
    # Successful match
else:
    # Match attempt failed

I wouldn't use a regexp, just wrap that onto multiple lines and make it a little smarter:

if not filename.startswith(".") \
   and filename.lower().endswith(".exe") \
   and not filename.endswith("-xyz.exe"):
    #do stuff

Do note that this is slightly different in that *.eXe, *.eXE and other mixed case versions of the extension would then be ignored as well, unlike in the original. But I'm betting that it doesn't really matter and that my test is better.

edit: fixed the ".exe" part because I had the condition flipped, but if you're trying to learn regular expressions this is a weird contrived example and I think it's best not to try and shoehorn regular expressions as a solution to a problem where it's not a good solution

You need to use a negative lookbehind assertion:

import re

regex = '[^.].*(?:(?<!-xyz).exe|.EXE)'

vectors = (
  '.123.dat',
  '.123.exe',
  '.123.EXE',
  '123.dat',
  '123.exe',
  '123.EXE',
  '.123-xyz.dat',
  '.123-xyz.exe',
  '.123-xyz.EXE',
  '123-xyz.dat',
  '123-xyz.exe',
  '123-xyz.EXE',
)

for v in vectors:
  print "%s: " % (v,),
  if (bool(re.match(regex, v)) == (not v.startswith(".") and
      re.search("(.exe|.EXE)$", v) is not None and
      not v.endswith("-xyz.exe"))):
    print 'PASS'
  else:
    print 'FAIL'

import re

pat = re.compile('(?!\.)'
                 '.+'
                 '\.'
                 '(?:(?<!-xyz\.)exe|EXE)'
                 '\Z')

names = ('.123.dat', '.123.exe', '.123.EXE',
         '123.dat', '123.exe', '123.EXE',
         '123-xyz.dat', '123-xyz.exe', '123-xyz.EXE', )

print '\n'.join(v.ljust(18)+str(bool(pat.match(v))) for v in names)

EDIT:

You are right, ridgerunner, [^.] is better than (?!\.) : it's more readable, more logic and slightly faster, -4 % (I tested)

I also compared '(?!\.).+?\.(?:EXE|(?<!-xyz)exe)\Z' ( there is .+? instead of .+)

With this RE, execution is longer. The additional time depends of the number of dots in the tested names.

On names like '78999.abc.juty.123.dat' it is around 15 % longer, and on names like '123.dat' it's 3 % longer. I think the reason is that the regex motor examine after each reading of a character if the read character is a dot or not.

On the contrary, with '.+\.' the regex motor goes until the end, and then come back to search the last dot. I think it is a correct explanation because, if the RE '(?!\.).+?\.(?:EXE|(?<!-xyz)exe)\Z' is tested on names like '123.teybertyhbeythbeytberyetynetynetnyetnydat' , the time is again longer (+ 30 %)

I realized that my RE is very similar to the Ignacio's one, and I wondered why I wrote this RE as it seems to have no particular interest. In fact, at the beginning, my idea was to write '(?!\.).+?(?<=.EXE|(?<!-xyz).exe)\Z' and then I wrote another string. By the way, with this abandonned RE , the execution time is 25 % longer on short names and 74 % longer on long names.

Finally, when I tested the execution times, Ignacio's solution is 25 % longer on short names ( '123.dat' ) and 47 % longer on long names ( '78999.abc.juty.123.dat' )

Best regex is then

pat = re.compile('[^.]'
                 '.+'
                 '\.'
                 '(?:(?<!-xyz\.)exe|EXE)'
                 '\Z')

I let '.+' , not replaced by '.*' , because there must be at least 4 characters in the name if we want the name to end with '.exe' or '.EXE'

That will involve accessing every file that doesn't start with '.' or ends with '-xyz.exe'. The regex module cannot parse stuff outside its namespace. I don't think it's possible, but have you tried checking the module's doc?

继续阅读：python regex

How to shorten this expression using regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？