How can I remove empty paragraphs from an HTML file using simple_html_dom.php?
I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".
This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:
<?php
/* Using the DOMDocument class */
/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");
/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");
/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");
echo "The initial number of paragraphs is " . $pars->length . ".<br />";
/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent) == ""){
$pars-&开发者_StackOverflow中文版gt;item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}
echo "The final number of paragraphs is " . $pars->length . ".<br />";
// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>
This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:
<?php
/* Using simple_html_dom.php */
include("simple_html_dom.php");
$html = file_get_html("HTML File With Empty Paragraphs.html");
$pars = $html->find("p");
for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext) == "") {
unset($pars[$i]);
$i--;
}
}
$html->save("HTML File without Empty Paragraphs.html");
?>
It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext) == "") {".
Does anyone know how I can fix this?
Thank you.
I also asked on php devnetwork.
Looking at the documentation for Simple HTML DOM Parser, I think this should do the trick:
include('simple_html_dom.php');
$html = file_get_html('HTML File With Empty Paragraphs.html');
$pars = $html->find('p');
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
// Remove an element, set it's outertext as an empty string
$par->outertext = '';
}
}
$html->save('HTML File without Empty Paragraphs.html');
I did a quick test and this works for me:
include('simple_html_dom.php');
$html = str_get_html('<html><body><h1>Test</h1><p></p><p>Test</p></body></html>');
$pars = $html->find("p");
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
$par->outertext = '';
}
}
echo $html;
// Output: <html><body><h1>Test</h1><p>Test</p></body></html>
Empty paragraphs looks like <p [attributes]> [spaces or newlines] </p>
(case-insensitive). You can use preg_replace (or str_replace) for removing empty paragraphs.
The following will only work if an empty paragraph is <p></p>
:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = str_replace('<p></p>', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);
This will also work on paragraphs with attributes, like <p class="msoNormal"> </p>
:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = preg_replace('#<p[^>]*>\s*</p>#i', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);
精彩评论