How do I find the shortest overlapping match using regular expressions?
I'm still relatively new to regex. I'm trying to开发者_运维技巧 find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)
for match in matches:
print match
prints:
A|B|A|B|C
but I'd want it to return:
A|B|C
Is there a way to do this without having to loop over each match to see if it contains a substring that matches?
Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:
>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C
findall()
will return all possible matches, so you need min()
to get the shortest one.
How this works:
- We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
- At each position, the regex engine looks ahead to see whether your regex would match at this position.
- If so, it will be captured by the capturing group.
- If not, it won't.
- In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
- Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.
No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.
Edit: Yes, I realize I said Perl above, but I believe it is true for Python.
This might be a useful application of sexegers. Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*?
skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.
Consider the following program that outputs A|B|C
as desired:
#! /usr/bin/env python
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'c.*?b.*?a'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string[::-1])
for match in matches:
print match[::-1]
Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:
my_pattern = 'a[^a]*?b[^ab]*?c'
Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.
Another regex solution; it finds only the last occurence of .*a.*b.*c:
my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'
a(?!.*a.*?b.*?c)
ensures that there is no 'a.*?b.*?c'
after first 'A'
strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated
b[^c]*c
ensures that after 'B' there is only one 'C'
strings like A|B|C|B|C or A|B|C|C in results are eliminated
So you have the smallest matching 'a.*?b.*?c'
The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.
Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.
I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post.
You might be able to write the regex in such a way that it can't contain smaller matches.
For your regex:
a.*?b.*?c
I think you can write this:
a[^ab]*b[^c]*c
It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)
A Python loop to look for the shortest match, by brute force testing each substring from left to right, picking the shortest:
shortest = None
for i in range(len(string)):
m = my_regex.match(string[i:])
if m:
mstr = m.group()
if shortest is None or len(mstr) < len(shortest):
shortest = mstr
print shortest
Another loop, this time letting re.findall do the hard work of searching for all possible matches, then brute force testing each match right-to-left looking for a shorter substring:
# find all matches using findall
matches = my_regex.findall(string)
# for each match, try to match right-hand substrings
shortest = None
for m in matches:
for i in range(-1,-len(m),-1):
mstr = m[i:]
if my_regex.match(mstr):
break
else:
mstr = m
if shortest is None or len(mstr) < len(shortest):
shortest = mstr
print shortest
No, there isn't in the Python regular expression engine.
My take for a custom function, though:
import re, itertools
# directly from itertools recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
for elem in b:
break
return itertools.izip(a, b)
def find_matches(rex, text):
"Find all matches, even overlapping ones"
matches= list(rex.finditer(text))
# first produce typical matches
for match in matches:
yield match.group(0)
# next, run it for any patterns included in matches
for match1, match2 in pairwise(matches):
subtext= text[match1.start()+1:match2.end()+1]
for result in find_matches(rex, subtext):
yield result
# also test the last match, if there was at least one
if matches:
subtext= text[matches[-1].start()+1:matches[-1].end()+1]
# perhaps the previous "matches[-1].end()+1" can be omitted
for result in find_matches(rex, subtext):
yield result
def shortest_match(rex, text):
"Find the shortest match"
return min(find_matches(rex, text), key=len)
if __name__ == "__main__":
pattern= re.compile('a.*?b.*?c', re.I)
searched_text= "A|B|A|B|C|D|E|F|G"
print (shortest_match(pattern, searched_text))
精彩评论