Getting all subgroups with a regex match
Given the string:
© 2010 Women’s Flat Track Derby Association (WFTDA)
I want:
2010 -- Women's -- Flat
Women's -- Flat -- Track
Track -- Derby -- Association
I'm using regex:
([a-zA-Z]+)\s([A-Z][a-z]开发者_如何学Go*)\s([a-zA-Z]+)
It's only returning:
s -- Flat -- Track
This problem isn't straightforward, but to understand why, you need to understand how the regular expression engine operates on your string.
Let's consider the pattern [a-z]{3}
(match 3 successive characters between a and z) on the target string abcdef
. The engine starts from the left side of the string (before the a
), and sees that a
matches [a-z]
, so it advances one position. Then, it sees that b
matches [a-z]
and advances again. Finally, it sees that c
matches, advances again (to before d
) and returns abc
as a match.
If the engine is set up to return multiple matches, it will now try to match again, but it keeps its positional information (so, like above, it'll match and return def
).
Because the engine has already moved past the b
while matching abc
, bcd
will never be considered as a match. For this same reason, in your expression, once a group of words is matched, the engine will never consider words within the first match to be a part of the next one.
In order to get around this, you need to use capturing groups inside of lookaheads to collect matching words that appear later in the string:
var str = "2010 Women's Flat Track Derby Association",
regex = /([a-z0-9']+)(?=\s+([a-z0-9']+)\s+([a-z0-9']+))/ig,
match;
while (match = regex.exec(str))
{
var group1 = match[1], group2 = match[2], group3 = match[3];
console.log("Found match: " + group1 + " -- " + group2 + " -- " + group3);
}
This results in:
2010 -- Women's -- Flat
Women's -- Flat -- Track
Flat -- Track -- Derby
Track -- Derby -- Association
See this in action at http://jsfiddle.net/jRgXm/.
The regular expression searches for what you seem to be defining as a word ([a-z0-9']+)
, and captures it into subgroup 1, and then uses a lookahead (which is a zero-width assertion, so it doesn't advance the engine's cursor), that captures the next two words into subgroups 2 and 3.
However, if you are using the actual Javascript engine, you must RegExp.exec
and loop over the results (see this question for a discussion of why) or use the new matchAll
method (ES2020). I don't know how UltraEdit's engine is implemented, but hopefully it can do a global search and also collect subgroups.
Just for completeness, here's the example above using ES2020' matchAll
(the first element in each returned array is the total match, then the subsequent elements are the capture groups):
const str = "2010 Women's Flat Track Derby Association";
const regex = /([a-z0-9']+)(?=\s+([a-z0-9']+)\s+([a-z0-9']+))/ig;
console.log([...str.matchAll(regex)]);
I'm using some generic regex tester, so I can't guarantee it will work for you but...
([A-Z0-9][\w’]+)\s([A-Z][\w]+)\s([A-Z][\w]+)
Three words starting with a number or capital letter followed by letters/numbers or that funky apostrophe, separated by spaces. Works for me.
Edit: I assume you can loop through, repeating the matcher in JS i've never used it.
精彩评论