开发者

Condition inside regex pattern

I would like to remove any extra whitespace from my code, I'm parsi开发者_开发知识库ng a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.

Example, I use this to remove extra whitespace:

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code>

This code/string:

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into:

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this?


You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work?

  • (<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
  • ^ + - match and remove spaces on beginnings of lines.
  • [ ]+$ - match and remove spaces on ends of lines.
  • ( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.

Things to remember:

  • If $1 or $2 didn't capture, it will be replaced with an empty string.
  • Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.

Working example: http://ideone.com/HxbaV


When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)

p.s. Note to SO staff. The <!-- language: lang-none --> doesn't work.


What you will want is to parse it using some form of HTML parser.

For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes.

Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element.

To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode. You can then skip these lines. Reset the flag when you encounter </code>. You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea (why would you nest them)?

Mario came up with this before me.


Parsing HTML with regexes is a bad idea.

RegEx match open tags except XHTML self-contained tags

Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜