Complex edit xml file

2023-01-04 04:28 问答作者：

For example, we have this xml:

<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>

and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.

The only solution i can think is that:

Find in xml the text by using regex, for example: "/\[ID\].*?\[\/ID\]/". In our case, result will be "[ID]hello</y><y>world[/ID]"
In result from prev step we need to find text without xml-tags by using this regex: "/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be "</y><y>"
Made changes in original xml by doing smth like this:

str_replace($step1string,$step2string,$xml);

is this correct way to do this? I just think that this "str_replace"'s things it's not best way to edit xml, so maybe 开发者_开发技巧you know better solution?

Removing the specific string is simple:

<?php
$xml = '<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>';

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
    $elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>

When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:

 $elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
 $elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);

Resulting in for your example:

<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>

However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:

An [/ID] higher in the DOM-tree:

<foo>[ID] foo
    <bar> lorem [/ID] ipsum </bar>
</foo>

An [/ID] lower in the DOM-tree

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    [/ID]
</foo>

And open/close spanning siblings, as per your example:

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
</foo>

And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?

<foo> foo
    <bar> lo  [ID] rem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
    [/ID]
</foo>

Without further knowledge how these case should be handled there is no real answer.

Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):

<?php
$xml = '<x>
    <y>some text</y>
    <y>
      <a> something </a>
      well [ID] hello
      <a> and then some</a>
    </y>
    <y>some text</y>
    <x>
      world
      <a> also </a>
        foobar [/ID] something
      <a> these nodes </a>
    </x>
    <y>some text</y>
    <y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
        //if this node also contains [/ID], replace and be done:
        if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
                $elm->replaceData($startpos, $endpos-$startpos + 5,'');
                var_dump($d->saveXML($elm));
                continue;
        }
        //delete all siblings of this textnode not being text and having [/ID]
        while($elm->nextSibling){
                if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
                        $elm->parentNode->removeChild($elm->nextSibling);
                } else {
                        //id found in same element, replace and go to next [ID]
                        $elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
                        $elm->parentNode->removeChild($elm->nextSibling);
                        continue 2;
                }
        }
        //siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
        while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
                //loop though childnodes and search a textnode with [/ID]
                while($child = $sibling->firstChild){
                        //delete if not a textnode
                        if(!($child instanceof DOMText)){
                                $sibling->removeChild($child);
                                continue;
                        }
                        //we have text, check for [/ID]
                        if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
                                //add remaining text in textnode:
                                $elm->appendData(substr($child->nodeValue,$pos+5));
                                //remove current textnode with match:
                                $sibling->removeChild($child);
                                //sanity check: [ID] was in <y>, is [/ID]?
                                if($sibling->tagName!= $elm->parentNode->tagname){
                                        trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
                                }
                                //add remaining childs of sibling to parent of [ID]:
                                while($sibling->firstChild){
                                        $elm->parentNode->appendChild($sibling->firstChild);
                                }
                                //delete the sibling that was found to hold [/ID]
                                $sibling->parentNode->removeChild($sibling);
                                //done: end both whiles
                                break 2;
                        }
                        //textnode, but no [/ID], so remove:
                        $sibling->removeChild($child);
                }
                //no child, no text, so no [/ID], remove:
                $elm->parentNode->parentNode->removeChild($sibling);
        }
}
var_dump($d->saveXML());
?>

For your entertainment and edification, you may want to read this: RegEx match open tags except XHTML self-contained tags

The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.

The only other option I can think of is if you could format the xml differently.

<x>
  <y>
    <z>[ID]</z>

继续阅读：php regex xml

Complex edit xml file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？