Complex edit xml file
For example, we have this xml:
<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>
and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.
The only solution i can think is that:
Find in xml the text by using regex, for example:
"/\[ID\].*?\[\/ID\]/"
. In our case, result will be"[ID]hello</y><y>world[/ID]"
In result from prev step we need to find text without xml-tags by using this regex:
"/(?<=^|>)[^><]+?(?=<|$)/"
, and delete this text. The result will be"</y><y>"
Made changes in original xml by doing smth like this:
str_replace($step1string,$step2string,$xml);
is this correct way to do this? I just think that this "str_replace"'s things it's not best way to edit xml, so maybe 开发者_开发技巧you know better solution?
Removing the specific string is simple:
<?php
$xml = '<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>';
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
$elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>
When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:
$elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
$elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);
Resulting in for your example:
<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>
However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:
An [/ID] higher in the DOM-tree:
<foo>[ID] foo
<bar> lorem [/ID] ipsum </bar>
</foo>
An [/ID] lower in the DOM-tree
<foo> foo
<bar> lorem [ID] ipsum </bar>
[/ID]
</foo>
And open/close spanning siblings, as per your example:
<foo> foo
<bar> lorem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
</foo>
And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?
<foo> foo
<bar> lo [ID] rem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
[/ID]
</foo>
Without further knowledge how these case should be handled there is no real answer.
Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):
<?php
$xml = '<x>
<y>some text</y>
<y>
<a> something </a>
well [ID] hello
<a> and then some</a>
</y>
<y>some text</y>
<x>
world
<a> also </a>
foobar [/ID] something
<a> these nodes </a>
</x>
<y>some text</y>
<y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
//if this node also contains [/ID], replace and be done:
if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
$elm->replaceData($startpos, $endpos-$startpos + 5,'');
var_dump($d->saveXML($elm));
continue;
}
//delete all siblings of this textnode not being text and having [/ID]
while($elm->nextSibling){
if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
$elm->parentNode->removeChild($elm->nextSibling);
} else {
//id found in same element, replace and go to next [ID]
$elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
$elm->parentNode->removeChild($elm->nextSibling);
continue 2;
}
}
//siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
//loop though childnodes and search a textnode with [/ID]
while($child = $sibling->firstChild){
//delete if not a textnode
if(!($child instanceof DOMText)){
$sibling->removeChild($child);
continue;
}
//we have text, check for [/ID]
if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
//add remaining text in textnode:
$elm->appendData(substr($child->nodeValue,$pos+5));
//remove current textnode with match:
$sibling->removeChild($child);
//sanity check: [ID] was in <y>, is [/ID]?
if($sibling->tagName!= $elm->parentNode->tagname){
trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
}
//add remaining childs of sibling to parent of [ID]:
while($sibling->firstChild){
$elm->parentNode->appendChild($sibling->firstChild);
}
//delete the sibling that was found to hold [/ID]
$sibling->parentNode->removeChild($sibling);
//done: end both whiles
break 2;
}
//textnode, but no [/ID], so remove:
$sibling->removeChild($child);
}
//no child, no text, so no [/ID], remove:
$elm->parentNode->parentNode->removeChild($sibling);
}
}
var_dump($d->saveXML());
?>
For your entertainment and edification, you may want to read this: RegEx match open tags except XHTML self-contained tags
The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]">
and the importance of defending against such cases, and weigh those factors against development time.
The only other option I can think of is if you could format the xml differently.
<x>
<y>
<z>[ID]</z>
精彩评论