Python regex: how to check if a character in a string is within the span of a regex matched substring?
I have a regex pattern, which I used on a large piece of text (a single string). Several discontiguous regions of the original text matches the regexp. Now, I'm attempting to build a state machine, to iterate over the text and do different things based on the char at a position, and whether this position is within the span of a regex match.
With RE.finditer(text), I can find all sub开发者_如何学Pythonstrings, and extract their spans, thus I have a list of tuples to work with e.g.
(1, 5) (10, 15) (20, 55), etc.
With this information, given the index of the character in my string, I can write an algorithm to see if that character is a part of a regex string. For example, given character 6, i can go through the list of spans and determine that it is not part of a matched substring.
Is there a better way of doing this?
Thanks in advance,
JW
EDIT: It sounds like you want to write your own parser FSM which (among other things) tokenizes comma characters, only when they are not escaped. The following regex works for an identifier, possibly containing escaped commas. You could use this with antlr/lex:
input = r'aaaaa,bbbb/,ccccc,dddddd,'
pat = re.compile(r'((\w+|/,)+)')
for mat in re.finditer(pat, input):
... do stuff with mat.group(0)
(Original answer: That could be a good solution, but you're not giving us enough context to tell.
Does character occur once or multiply? If it occurs once, you could just check whether the index from string.find(char)
lies inside the spans of the regex matches.
Is character any arbitrary character - give us a specific example? Why are you doing this on a per-character basis? Presumably you're not sequentially checking multiple chars?
Is your desired result boolean ('Yes, char was found inside the span of some regex match')? and what you do for the case where char was found OUTside a regex match?
Edit
Here's a regex which will grab the text between ,
ignoring escaped ,
:
(?=<,)(?:[^,]|(?=</),)(?=,)
Original Answer Here is some pseudo python code that should do what you're looking for:
pattern = re.compile(...)
pos = 0
while (match = pattern.search(haystack, pos)) {
for (i in range(pos, match.start)
//These chars are outside the match.
for (i in group(0))
//The chars are in the match
pos = match.end
//Finish with the rest of the chars not matched
for (i in range(pos, len(haystack))
//These chars are outside the match.
精彩评论