开发者

Regular Expression and escape sequences

I have a file which contains the list of regular expressions to look for in db.

one such pattern is (/|\)cmd\.com$. But when i use it with re module, it throws up the below error. If i use the re pattern as (/|\\\\)cmd\.com$,it works.

So, the question is when i read from a f开发者_如何学JAVAile in to variable for EX: a, how do i convert it to a reg pattern with four backward slashes so that it starts working with python re module.

Also, how do we escape such escape sequences when reg pattern is assigned to a variable EX: "a" below.

Any help on this is appreciated.

import re
a='(/|\)cmd\.com$'
re.compile(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 245, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

Thx, Santhosh


First note that your original regex is invalid. It should be (/|\\)cmd\.com$. If such a string is coming from a database (or any other source other than a string literal in your code), then no additional manipulation needs to be done before the regex engine sees it -- the slashes are correct.

Full details and explanation:

Backslashes are special in that they escape other characters and give them different meanings.

a = '(/|\)cmd\.com$'

In this regular expression, the ) is special, indicating the end of a grouping expression; the backslash escapes it to make it interpreted as a literal ) instead, which is not what you want (and why you get the error about mismatched parentheses).

You need to escape the backslash to make it be interpreted as a literal \; this can be done using yet another backslash:

a = '(/|\\)cmd\.com$'

However even this will not work, since in Python there are two levels of processing going on (and thus two levels of escaping are needed): First, the string literal is evaluated, and the backslashes are interpreted specially (string-wise, where e.g. \. is not meaningful, and so evaluates to \. -- however \\ evaluates to \). Then, when the regex engine gets the string, it interprets any literal backslashes in that object specially (regex-wise, e.g. \. makes the . literal instead of "any character"). So you end up with:

a = '(/|\\\\)cmd\\.com$'    # Escaped version of (/|\\)cmd\.com$ which is what regex engine will see

Because this problem is so common, Python has a way of writing strings such that the backslash is not treated specially in the string-processing stage: "raw" string literals:

a = r'(/|\\)cmd\.com$'    # backslashes here will be interpreted as literal \ characters

The regex engine will still interpret the backslashes in the string specially (a raw string is just a way of writing the literal; it still results in a plain str object).


In your example above, you need to make the regex pattern a Python "raw" string, like so:

  re.compile(r'put the pattern here')

If you post your code I might be able to help with your question about loading patterns from a file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜