Anything wrong with this RegEx?
I'm using a RegEx on an XML dump of a Wikipedia article.
The Regex is = {{[a-zA-Z0-9_\(\)\|\?\s\-\,\/\=\[\]\:.]+}}
I want to detect all the text wrapped with {{
and }}
.
But instead of detecting 56 matched which I got from simple search with {{
, it only detects 45.
a sample block it doesn't detect is, {{cite journal | last = Heeks | first = Richard | year = 2008 | title = Meet Marty Cooper - the inventor of the mobile phone | journal = BBC | volume = 41 | issue = 6 | url = http://news.bbc.co.uk/2/hi/programmes/click_online/8639590.stm | pages = 26–33 | doi = 10.1109/MC.2008.192 }}
..
but it detects, {{cite web | title = Of Cigarettes and Cellphones | last = Ulyseas | first = Mark | date = 2008-01-18 | url = http://www.thebalitimes.com/2008/01/18/of-cigarettes-and-cellphones/ | publisher = The Bali Times | acc开发者_JAVA技巧essdate = 2008-02-24 }}
can anyone please detect me the problem?
Some of the escaping is superfluous, but I don't think that's the real problem.
I recommend trying \w
instead of a-zA-Z0-9_
, especially because in .NET regex \w
also recognizes Unicode letter (unless it's in ECMAScript compliant mode).
Another alternative is that if the text part can not contain }
(which right now it can't anyway), you can also use simply {{[^}]+}}
.
The [^...]
is a negated character class. [^}]
matches anything but }
.
References
- regular-expressions.info/Character Class
Related questions
- .Net regex: what is the word character \w?
Your character class is...special. For starters, everything you're matching is covered by the .
at the end. Also, curly braces ({}
) are special characters, so they should be escaped. Finally, you'll want to force it not to be greedy by adding a ?
after that +
, otherwise it will match curly braces.
EDIT: I won't try to go back on what I said, but I would like to note that I was mistaken about pretty much everything in this post (other than that braces should be escaped, which is just a matter of good practice).
The regex {{(.*?)}}
works well for me in perl. It catches everything in between 2 nested braces.
精彩评论