Regular Expression and escape sequences
I have a file which contains the list of regular expressions to look for in db.
one such pattern is (/|\)cmd\.com$
. But when i use it with re module, it throws up the below error. If i use the re pattern as (/|\\\\)cmd\.com$
,it works.
So, the question is when i read from a f开发者_如何学JAVAile in to variable for EX: a, how do i convert it to a reg pattern with four backward slashes so that it starts working with python re module.
Also, how do we escape such escape sequences when reg pattern is assigned to a variable EX: "a" below.
Any help on this is appreciated.
import re
a='(/|\)cmd\.com$'
re.compile(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis
Thx, Santhosh
First note that your original regex is invalid. It should be (/|\\)cmd\.com$
. If such a string is coming from a database (or any other source other than a string literal in your code), then no additional manipulation needs to be done before the regex engine sees it -- the slashes are correct.
Full details and explanation:
Backslashes are special in that they escape other characters and give them different meanings.
a = '(/|\)cmd\.com$'
In this regular expression, the )
is special, indicating the end of a grouping expression; the backslash escapes it to make it interpreted as a literal )
instead, which is not what you want (and why you get the error about mismatched parentheses).
You need to escape the backslash to make it be interpreted as a literal \
; this can be done using yet another backslash:
a = '(/|\\)cmd\.com$'
However even this will not work, since in Python there are two levels of processing going on (and thus two levels of escaping are needed): First, the string literal is evaluated, and the backslashes are interpreted specially (string-wise, where e.g. \.
is not meaningful, and so evaluates to \.
-- however \\
evaluates to \
). Then, when the regex engine gets the string, it interprets any literal backslashes in that object specially (regex-wise, e.g. \.
makes the .
literal instead of "any character"). So you end up with:
a = '(/|\\\\)cmd\\.com$' # Escaped version of (/|\\)cmd\.com$ which is what regex engine will see
Because this problem is so common, Python has a way of writing strings such that the backslash is not treated specially in the string-processing stage: "raw" string literals:
a = r'(/|\\)cmd\.com$' # backslashes here will be interpreted as literal \ characters
The regex engine will still interpret the backslashes in the string specially (a raw string is just a way of writing the literal; it still results in a plain str
object).
In your example above, you need to make the regex pattern a Python "raw" string, like so:
re.compile(r'put the pattern here')
If you post your code I might be able to help with your question about loading patterns from a file.
精彩评论