开发者

Convert recursive tags into array with regular expression

I have the following text

hello <?tag?> world <?tag2?> xx <?/tag2?> hello <?/tag?> world

And I need it converted into

array( 'hello ', array( ' world ', array( ' xx ' ), ' hello ' ), ' world' );

Tags are alpha-numeric, as long as they are closed with the matching tag, or <?/?>. Tags with same name may repeat, but wouldn't be inside each-other.

My question is which would be the most CPU-efficient way to go?

  • use recursive preg_replace with callback
  • use preg_match_all with PREG_OFFSET_CAPTURE
  • use 开发者_如何学Gopreg_split to flattern all tags (PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE), into linear array then walk through and group tags.

If you can also provide the expression, I would be really happy.


This turned out not so straightforward but hopefully this could be helpful to others. The biggest complication was returning non-string from callback function of preg_replace.

Thanks all who tried to help!

class Parser {
    public $ret=array();
    function loadTemplateFromString($str){
        $this->parsed_template=$this->tags=array();
        if(!$str){
            return;
        }
        var_dump($str);

        /* First expand self-closing tags <?$tag?> -> <?tag?><?/tag?> */
        $str=preg_replace('/<\?\$([\w]+)\?>/','<?\1?><?/\1?>',$str);

        /* Next fix short ending tag <?tag?>  <?/?> -> <?tag?>  <?/?> */
        $x=preg_replace_callback('/.*?<\?\/\?>/',function($x){
                return preg_replace('/(.*<\?([^\/][\w]+)\?>)(.*?)(<\?\/?\?>)/',
                   '\1\3<?/\2?>',$x[0]);
                },$str);

        /* Finally recursively build tag structure */
        $this->recursiveReplace($x);
    }
    function recursiveReplace($x){

        if(is_array($x)){
            // Called recursively
            $tmp2=$this->ret;$this->ret=array();
        }else{
            $x=array(4=>$x);
            $tmp2=null;
        }
        $y=preg_replace_callback('/(.*?)(<\?([^\/$][\w]+)\?>)(.*?)(<\?\/(\3)?\?>)(.*?)/',
            array($this,'recursiveReplace'),$x[4]);
        $this->ret[]=$y;
        if($tmp2===null)return;
        $tmp=$this->ret;
        $this->ret=$tmp2;

        $this->ret[]=$x[1];
        $this->ret[]=$tmp;
        return '';
    }
}
$p=new Parser();
$p->loadTemplateFromString('bla <?name?> name <?/name?> bla bla <?$surname?> bla '.
    '<?middle?> mm <?/?> blah <?outer?> you <?inner?> are <?/?> inside <?/outer?>'.
    ' bobobo');
var_dump($p->ret);

This outputs:

array
  0 => string 'bla ' (length=4)
  1 => 
    array
      0 => string ' name ' (length=6)
  2 => string ' bla bla ' (length=9)
  3 => 
    array
      0 => string '' (length=0)
  4 => string ' bla ' (length=5)
  5 => 
    array
      0 => string ' mm ' (length=4)
  6 => string ' blah ' (length=6)
  7 => 
    array
      0 => string ' you ' (length=5)
      1 => 
        array
          0 => string ' are ' (length=5)
      2 => string ' inside ' (length=8)
  8 => string ' bobobo' (length=7)


How about converting <?tagN?> to <elemN> and the parsing it as XML?

After you get a raw structure looking like the result you mentioned, you could/would verify it against your element structure (that is, ensure items are numerically inside each other etc).

Just add in a document element and you are set with this stylesheet:


Edit: After the fact that these tags are mixed with HTML came up, I thought I'd change my strategy. Please check out the following code first before a description:

$data = '<b>H</b>ello <?tag?> <b>W</b>orld <?/tag?>';

$conv1 = array(
//  original => entity
    '<?tag'  => '%START-BEGIN%',
    '<?/tag' => '%START-END%'
    '?>'     => '%END-END%'
);

$conv2 = array(
//  entity          => xml
    '%START-BEGIN%' => '<element',
    '%START-END%'   => '</element'
    '%END-END%'     => '>'
);

$data = str_replace(array_keys($conv1), array_values($conv1), data);

$data = htmlentities($data, ENT_QUOTES); // encode HTML characters

$data = str_replace(array_values($conv2), array_keys($conv2), data);

$xml = '<?xml version="1.0" encoding="UTF-8"?>'.$data;

// You must apply the following function to each output text
// html_entity_decode($data,ENT_QUOTES);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜