two question about python regular expressions
Q1. why we can not use word boundary and back reference without using r
at start of regex?
e.g. '\b[a-z]{5}\d{3}\b'
this not works but this r'\b[a-z]{5}\d{3}\b'
works
Q2. why python does not sup开发者_如何转开发port variable length negative look behind assertions
while it supports variable length negative look ahead assertion
, c#
support both and i think it is an excellent feature to have also variable length negative look behind
in python
.
please clear these two concepts. thanks
It does work without raw strings:
'\\b[a-z]{5}\\d{3}\\b'
You just need to double escape the special chars (actually, what you do is escape all backslashes).
Variable length assertions are one of those features that some implementations support and some don't. Check out the regex module on PyPI for a version with more features and better unicode support, which may eventually replace the standard library re
.
Edit: To make the version from your comment work without raw strings, use:
re.sub('[a-z]+(\d+)', '\\1', string)
Again, Python interprets backslashes. it thinks \1
means a byte value of 1
. If you actually mean \1
, you need to escape the backslash by doing \\1
, or use raw strings.
Edit 2: Adding the link from @Nate's comment to the list of Python escape sequences.
In regards to your first question, this is because the r
designates a "raw string". Without this r
, your backslashes are interpreted as escape codes. If you don't want to use raw strings, you can use '\\b[a-z]{5}\\d{3}\\b'
, although this is far less readable. You can read more detail about raw strings here.
In regards to your second question, you should take a look at this excellent question, which discusses the differences between various flavors of regular expression used by different languages (namely C#, Java, and Python).
Almost all information you can find in tutorial - which is your the best friend:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
It is quite hard to answer on your second question - i think that authors are implementing only those features which they think are required. They try to add code that is useful for the most of user but it is impossible to implement all the features fast.
精彩评论