开发者

Complex edit xml file

For example, we have this xml:

<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>

and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.

The only solution i can think is that:

  1. Find in xml the text by using regex, for example: "/\[ID\].*?\[\/ID\]/". In our case, result will be "[ID]hello</y><y>world[/ID]"

  2. In result from prev step we need to find text without xml-tags by using this regex: "/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be "</y><y>"

  3. Made changes in original xml by doing smth like this:

    str_replace($step1string,$step2string,$xml);

is this correct way to do this? I just think that this "str_replace"'s things it's not best way to edit xml, so maybe 开发者_开发技巧you know better solution?


Removing the specific string is simple:

<?php
$xml = '<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>';

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
    $elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>

When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:

 $elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
 $elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);

Resulting in for your example:

<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>

However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:

An [/ID] higher in the DOM-tree:

<foo>[ID] foo
    <bar> lorem [/ID] ipsum </bar>
</foo>

An [/ID] lower in the DOM-tree

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    [/ID]
</foo>

And open/close spanning siblings, as per your example:

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
</foo>

And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?

<foo> foo
    <bar> lo  [ID] rem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
    [/ID]
</foo>

Without further knowledge how these case should be handled there is no real answer.


Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):

<?php
$xml = '<x>
    <y>some text</y>
    <y>
      <a> something </a>
      well [ID] hello
      <a> and then some</a>
    </y>
    <y>some text</y>
    <x>
      world
      <a> also </a>
        foobar [/ID] something
      <a> these nodes </a>
    </x>
    <y>some text</y>
    <y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
        //if this node also contains [/ID], replace and be done:
        if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
                $elm->replaceData($startpos, $endpos-$startpos + 5,'');
                var_dump($d->saveXML($elm));
                continue;
        }
        //delete all siblings of this textnode not being text and having [/ID]
        while($elm->nextSibling){
                if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
                        $elm->parentNode->removeChild($elm->nextSibling);
                } else {
                        //id found in same element, replace and go to next [ID]
                        $elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
                        $elm->parentNode->removeChild($elm->nextSibling);
                        continue 2;
                }
        }
        //siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
        while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
                //loop though childnodes and search a textnode with [/ID]
                while($child = $sibling->firstChild){
                        //delete if not a textnode
                        if(!($child instanceof DOMText)){
                                $sibling->removeChild($child);
                                continue;
                        }
                        //we have text, check for [/ID]
                        if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
                                //add remaining text in textnode:
                                $elm->appendData(substr($child->nodeValue,$pos+5));
                                //remove current textnode with match:
                                $sibling->removeChild($child);
                                //sanity check: [ID] was in <y>, is [/ID]?
                                if($sibling->tagName!= $elm->parentNode->tagname){
                                        trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
                                }
                                //add remaining childs of sibling to parent of [ID]:
                                while($sibling->firstChild){
                                        $elm->parentNode->appendChild($sibling->firstChild);
                                }
                                //delete the sibling that was found to hold [/ID]
                                $sibling->parentNode->removeChild($sibling);
                                //done: end both whiles
                                break 2;
                        }
                        //textnode, but no [/ID], so remove:
                        $sibling->removeChild($child);
                }
                //no child, no text, so no [/ID], remove:
                $elm->parentNode->parentNode->removeChild($sibling);
        }
}
var_dump($d->saveXML());
?>


For your entertainment and edification, you may want to read this: RegEx match open tags except XHTML self-contained tags

The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.


The only other option I can think of is if you could format the xml differently.

<x>
  <y>
    <z>[ID]</z>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜