PCRE: Lazy and Greedy at the same time (Possessive Quantifiers)
I am trying to match a series of text strings with PCRE on PHP, and am 开发者_如何学运维having trouble getting all the matches in between the first and second.
If anyone wonders why on Earth I would want to do this, it's because of Doc Comments. Oh, how I wish Zend would make native/plugin functions to read Doc Comments from a PHP file...
The following example (plain) text will be used for the problem. It will always be pure PHP code, with only one opening tag at the beginning of the file, no closing. You can assume that the syntax will always be correct.
<?php
class someClass extends someExample
{
function doSomething($someArg = 'someValue')
{
// Nested code blocks...
if($boolTest){}
}
private function killFurbies(){}
protected function runSomething(){}
}
abstract
class anotherClass
{
public function __construct(){}
abstract function saveTheWhales();
}
function globalFunc(){}
Problem
Trying to match all methods in a class; my RegEx does not find the method killFurbies()
at all. Letting it be greedy means it only matches the last method in a class, and letting it be lazy means it only matches the first method.
$part = '.*'; // Greedy
$part = '.*?'; // Lazy
$regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)'
. '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?'
. 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff'
. ']*)(?:\\n|\\r|\\s)*\\(%ms';
preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER);
var_dump($matches);
Results in:
// Lazy:
array(2) {
[0]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(9) "someClass"
[2]=>
string(0) ""
[3]=>
string(11) "doSomething"
}
[1]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(12) "anotherClass"
[2]=>
string(6) "public"
[3]=>
string(11) "__construct"
}
}
// Greedy:
array(2) {
[0]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(9) "someClass"
[2]=>
string(0) ""
[3]=>
string(13) "saveTheWhales"
}
[1]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(12) "anotherClass"
[2]=>
string(0) ""
[3]=>
string(13) "saveTheWhales"
}
}
How do I match all? :S
Any help would be gratefully appreciated, as I already feel this question is ridiculous as I'm typing it out. Anyone attempting to answer a question like this is braver than me!
Better use token_get_all
to get the tokens of a PHP code and iterate them. PHPDoc style comments tokens can be identified with T_DOC_COMMENT
.
Err, can't you just parse the source using token_get_all
and look for the tokens of type T_DOC_COMMENT
(changed from T_COMMENT to T_DOC_COMMENT, see Gumnbo's post)?
An example of how to use this token_get_all
function can be found here.
Solution
I've come up with a class to extract Doc Comments for classes and methods in a file. Thanks to all the people who answered this question, and the other on matching code blocks.
The average benchmarks for the following example is between 0.00495 and 0.00505 seconds.
<?php
$file = 'path/to/libraries/tokenizer.php';
include $file;
$tokenizer = new Tokenizer;
// Start Benchmarking here.
$tokenizer->load($file);
// End Benchmarking here.
// The following will output 'bool(false)'.
var_dump($tokenizer->get_doc('Tokenizer', 'get_tokens'));
// The following will output 'string(18) "/** load method */"'.
Tokenizer (yes, I still haven't thought of a better name for it...) Class:
<?php
class Tokenizer
{
private $compiled = false, $path = false, $tokens = false, $classes = array();
/** load method */
public function load($path)
{
$path = realpath($path);
if(!file_exists($path) || !function_exists('token_get_all'))
{
return false;
}
$this->compiled = false;
$this->classes = array();
$this->path = $path;
$this->tokens = false;
$this->get_tokens();
$this->get_classes();
$this->class_blocks();
$this->class_functions();
return true;
}
protected function get_tokens()
{
$tokens = token_get_all(file_get_contents($this->path));
$compiled = '';
foreach($tokens as $k => $t)
{
if(is_array($t) && $t[0] != T_WHITESPACE)
{
$compiled .= $k . ':' . $t[0] . ',';
}
else
{
if($t == '{' || $t == '}')
{
$compiled .= $t . ',';
}
}
}
$this->tokens = $tokens;
$this->compiled = trim($compiled, ',');
}
protected function get_classes()
{
if(!$this->compiled)
{
return false;
}
$regex = '%(?:(\\d+)\\:366,)?(?:\\d+\\:(?:345|344|353),)?\\d+\\:352,(\\d+)\\:307,(?:\\d+\\:(?:354|355),\\d+\\:307,)*{%';
preg_match_all($regex, $this->compiled, $classes, PREG_SET_ORDER);
if(is_array($classes))
{
foreach($classes as $class)
{
$this->classes[$this->tokens[$class[2]][1]] = array('token' => $class[2]);
$this->classes[$this->tokens[$class[2]][1]]['doc'] = isset($this->tokens[$class[1]][1]) ? $this->tokens[$class[1]][1] : false;
}
}
}
private function class_blocks()
{
if(!$this->compiled)
{
return false;
}
foreach($this->classes as $class_name => $class)
{
$this->classes[$class_name]['block'] = $this->get_block($class['token']);
}
}
protected function get_block($name_token)
{
if(!$this->compiled || ($pos = strpos($this->compiled, $name_token . ':')) === false)
{
return false;
}
$section= substr($this->compiled, $pos);
$len = strlen($section);
$block = '';
$opening = 1;
$closing = 0;
for($i = 0; $i < $len; $i++)
{
if($section[$i] == '{')
{
$opening++;
}
elseif($section[$i] == '}')
{
$closing++;
if($closing == $opening)
{
break;
}
}
if($opening > 0)
{
$block .= $section[$i];
}
}
return trim($block, ',');
}
protected function class_functions()
{
if(!$this->compiled)
{
return false;
}
foreach($this->classes as $class_name => $class)
{
$regex = '%(?:(\d+)\:366,)?(?:\d+\:(?:344|345),)?(?:\d+\:(?:341|342|343),)?\d+\:333,(\d+)\:307,\{%';
preg_match_all($regex, $class['block'], $functions, PREG_SET_ORDER);
foreach($functions as $function)
{
$function_name = $this->tokens[$function[2]][1];
$this->classes[$class_name]['functions'][$function_name] = array('token' => $function[2]);
$this->classes[$class_name]['functions'][$function_name]['doc'] = isset($this->tokens[$function[1]][1]) ? $this->tokens[$function[1]][1] : false;
$this->classes[$class_name]['functions'][$function_name]['block'] = $this->get_block($function[2]);
}
}
}
public function get_doc($class, $function = false)
{
if(!is_string($class) || !isset($this->classes[$class]))
{
return false;
}
if(!is_string($function))
{
return $this->classes[$class]['doc'];
}
else
{
if(!isset($this->classes[$class]['functions'][$function]))
{
return false;
}
return $this->classes[$class]['functions'][$function]['doc'];
}
}
}
Any thoughts or comments on this? All criticism welcome!
Thanks, mniz.
精彩评论