word boundary regex problem (overlap)
Given the following code:
var myList = new List<string> { "red shirt", "blue", "green", "red" };
Regex r = new Regex("\\b(" + string.Join("|", myList.ToArray()) + ")\\b");
MatchCollection m = r.Matches("Alfred has a red shirt and blue tie");
I want the result of m
to in开发者_JS百科clude "red shirt", "blue", "red"
since all those are in the string but I am only getting "red shirt", "blue"
. What can I do to include overlaps?
It seems to me that the regexp parser is removing the match string as soon as the first valid match is found. I don't have a windows compiler setup right now so I can't give an apples to apples comparison but I see similar results in perl.
I think your regex would look something like this after being joined.
'\b(red shirt|blue|green|red)\b'
Testing this regexp out I see the same result as "red shirt", "blue". By moving "red shirt" to the end of the regexp list.
'\b(red|blue|green|red shirt)\b'
I now see "red" , "blue".
By altering the regexp to a little bit of a more complicated approach you might be able to achieve the results you want.
\b(blue|green|(red) shirt)\b
This should match red as its own subgroup and red shirt as a group as well.
Returns "red shirt", "red", "blue"
The simpler way to do it would be to loop through your List of strings and match 1 at a time if you are going to have many word groups that will need multiple matches like red and red shirt.
Since there are so many ways to do regexp, I am probably missing an obvious and elegant solution.
精彩评论