In PHP, why do I have to execute the removeChild method twice on a DOMNode from which I have just removed the style-attribute?
I have a script in PHP which removes empty paragraphs from an HTML file. The empty paragraphs are those <p></p>
elements without textContent.
HTML File with Empty Paragraphs:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!--
This page is used with remove_empty_paragraphs.php script.
This page contains empty paragraphs. The script removes the empty paragraphs and
writes a new HTML file.
-->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<p>This is a paragraph.</p>
<!-- Below is an empty paragraph. -->
<p><span></span></p>
<p>This is another paragraph.</p>
<!-- Below is another empty paragraph. -->
<p class=MsoNormal><b></b></p>
<p style=''></p>
<p>
<span lang=EN-US style='font-size:5.0pt;color:navy;mso-ansi-language:EN-`US'></span>
</p>
</body>
</html>
First Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:
- remove empty paragraphs without the style-attribute,
but fails to:
- remove empty paragraphs with the style-attribute.
So I insert the removeStyleAttribute foreach-loop before the removeChild foreach-loop. (I do not mind removing the style-attributes of nonempty paragraphs.)
Second Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$htm开发者_JAVA百科l->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
if ($par->hasAttribute("style")) {
$par->removeAttribute("style");
}
}
/* removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds in:
- removing the style-attributes from empty paragraphs which have the style attribute.
- removing empty paragraphs that do not have the style-attributes.
But fails! to:
- remove those empty paragraphs from which the style-attributes were removed.
So I have to have two removeChild foreach-loops, one after the other.
Third Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
if ($par->hasAttribute("style")) {
$par->removeAttribute("style");
}
}
/* First removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
/* Second removeChild foreach-loop, identical to the first removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This works perfectly!, but it is weird to have two identical loops, one right after the other.
I also tried to use only one loop for everything.
Fourth Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
foreach ($pars as $par) {
if ($par->textContent == "") {
if ($par->hasAttribute("style")){
$par->removeAttribute("style");
}
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:
- remove empty paragraphs without the style-attribute,
but fails to:
- remove the style-attribute from empty paragraphs that have it.
- remove empty paragraphs with the style attribute.
The list returned by getElementsByTagName is dynamic: removing nodes from the document also removes them from the list. And since foreach doesn't know the list changed, it'll happily move to the next item - which is actually two items down because the DOMNodeList was rearranged. Some of the <p> tags were just plain skipped.
Solution: use a for loop (with $pars->item(X) and $pars->length) instead of a foreach, but only increment if a node was not deleted. (Or always increment and backtrack if one was deleted.)
Separately: the last <p> (with the large <span>) wasn't deleted because of the whitespace around the <span>. Use trim() to get rid of it.
See also my reply in http://forums.devnetwork.net/viewtopic.php?f=1&t=121114&p=623974.
Like Tomalak says it might have something to do with the whitespace. Try disabling "preserveWhiteSpace":
$html->preserveWhiteSpace = false
hmm I'm new here, how do I send in my answer as a comment and not as an answer?
精彩评论