开发者

python regex question, ? op

in the following examples, i am trying to craft a regex to find the last group of 1 or more consecutive digits in a line.

as far as i know, in python3, re.search() goes through the search string trying to match from left to right.

does that explain the behavior in the examples below? Specifically, is that the reason why '.*?' is needed before the capture block(when anchored to the front, as in the first two examples) in order for the capture block to capture both digits, while the '?' is optional when the regex is anchored to the end of the line(as in the last two examples?)

Python 开发者_如何学编程3.1.2 (release31-maint, Sep 17 2010, 20:27:33) 
>>> import re
>>> a = "hi there in the morning {23)"
>>> R = re.compile('^.*(\d+)', re.IGNORECASE); print(R.search(a).group(1))
3 
>>> R = re.compile('^.*?(\d+)', re.IGNORECASE); print(R.search(a).group(1))
23
>>> R = re.compile('(\d+).*$', re.IGNORECASE); print(R.search(a).group(1))
23
>>> R = re.compile('(\d+).*?$', re.IGNORECASE); print(R.search(a).group(1))
23


  • ^.*(\d+) - Match from the start. .* will match all the way to the end of the line, and then \d+ will make .* backtrack (cancel previous matches), as little as needed, so \d+ will only match the last digit.
  • ^.*?(\d+) - Match from the start. .*? matches nothing at first. \d+ will fail later (if the first character isn't a digit), make .*? backtrack, and match extra characters until it finds the first digit, and then + will match all digits after it. for abc123edf567, the pattern will match 123, the first set of digits.
  • (\d+).*$ - \d+ will match the first set of digits. .*$ will always succeed and match all the way to the end.
  • (\d+).*?$ - This one is actually the same (though arguably marginally slower). \d+ will match the first set of digits. .*?$ will match nothing at first, but then match more and more characters until it reaches $. Keep in mind the *? is lazy to the left, but it doesn't mean the engine will take as few characters as needed from the right.

What you're probably looking for is (\d+)\D*$ - Match a set of characters the is followed by non-characters and the end of the line. This will return the last set of digits.

See also: regular-expressions.info - Laziness Instead of Greediness


By itself, * is greedy; it will match as much as it can while allowing the regex as a whole to match, so .* will gobble up everything but the single digit necessary to match \d+.

*? uses a non-greedy match, so only the non-digits match.


As Wooble said, .* is greedy, so in your string for the 1st example, the greediest match for .*? would be hi there in the morning {2 ,since \d+ would be true with just one value, 3.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜