开发者

What reg expression patten to I need to match everything between {{ and }}

What reg expression patten to I need to match everything between {{ and }}

I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. Here's my PHP script.

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{(开发者_开发百科[^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

I think the problem is I have nested {{ and }} e.g.

{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }}


You can use:

\{\{(.*?)\}\}

Most regex flavors treat the brace { as a literal character, unless it is part of a repetition operator like {x,y} which is not the case here. So you do not need to escape it with a backslash, though doing it will give the same result.

So you can also use:

{{(.*?)}}

Sample:

$ echo {{StackOverflow}} | perl -pe 's/{{(.*?)}}/$1/'
StackOverflow

Also note that the .* which matches any character(other than newline) is used here in non-greedy way. So it'll try to match as little as possible.

Example:

In the string '{{stack}}{{overflow}}' it will match 'stack' and not 'stack}}{{overflow'.
If you want the later behavior you can change .*? to .*, making the match greedy.


Your edit shows that you're trying to do a recursive match, which is very different from the original question. If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want:

$wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                       '', $wikicode);

After the first {{ matches an opening delimiter, (?:(?!{{|}}).)++ gobbles up everything until the next delimiter. If it's another opening delimiter, the (?R) takes over and applies the whole regex again, recursively.

(?R) is about as non-standard as regex features get. It's unique to the PCRE library, which is what powers PHP's regex flavor. Some other flavors have their own ways of matching recursive structures, all of them very different from each other.


Besides using a already mentioned non-greedy quantifier, you can also use this:

\{\{(([^}]|}[^}])*)}}

The inner ([^}]|}[^}])* is used to only match sequences of zero or more arbitrary characters that do not contain the sequence }}.


A greedy version to get the shortest match is

\{\{([^}]*(?:\}[^}]+)*)\}\}

(For comparison, with the string {{fd}sdfd}sf}x{dsf}}, the lazy version \{\{(.*?)\}\} takes 57 steps to match, while my version only takes 17 steps. This assumes the debug output of Regex Buddy can be trusted.)


\{{2}(.*)\}{2} or, cleaner, with lookarounds (?<=\{{2}).*(?=\}{2}), but only if your regex engine supports them.

If you want your match to stop at the first found }} (i.e. non-greedy) you should replace .* with .*?.

Also you should take into account the settings for single-line matching of your engine as in some of them . will not match new line characters by default. You can either enable single-line or use [.\r\n]* instead of .*.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜