Fix HTML fragment
I'm trying to learn how to use PHP's DOM functions. As an exercise, I want to repair an invalid HTML fragment. So far, I've been able to produce a full document:
<?php
$fragment = '<div style="font-weight: bold">Lorem ipsum <div>开发者_StackOverflow;dolor sit amet,
<strong><em class=foo>luptate</strong></em>. Excepteur proident,
<div class="bar">sunt in culpa</div> officia est laborum.';
$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($fragment);
libxml_use_internal_errors(FALSE);
$doc->formatOutput = TRUE;
echo $doc->saveHTML();
?>
... which prints:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
<strong><em class="foo">luptate</em></strong>. Excepteur proident,
<div class="bar">sunt in culpa</div> officia est laborum.</div>
</div></body></html>
My questions:
- Is there a way to print only the HTML that corresponds to the original fragment?
- Is there a more appropriate built-in library for such task?
This should work, but a bit ugly
$doc->loadHTML($fragment);
echo simplexml_import_dom( $doc->getElementsByTagName('div')->item(0) )->asXML();
output:
<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
<strong><em class="foo">luptate</em></strong>. Excepteur proident,
<div class="bar">sunt in culpa</div> officia est laborum.</div></div>
slightly more elegant
$xpath = new DOMXPath($doc);
$query = '/html/body/*'; <-- always <html><body>...
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
echo simplexml_import_dom($entry)->asxml();
}
It seems that latest PHP versions finally implement this:
How to return outer html of DOMDocument?
That way we can do this:
if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
$body = $dom->documentElement->firstChild;
if( $body->hasChildNodes() ){
foreach($body->childNodes as $node){
echo $dom->saveHTML($node);
}
}
}
... or this:
if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
$body = $dom->getElementsByTagName('body')->item(0);
if( $body->hasChildNodes() ){
foreach($body->childNodes as $node){
echo $dom->saveHTML($node);
}
}
}
Too bad we still need an ugly workaround for older versions.
You could run a function to replace the parts that you don't want that always appear such as:
$result = $doc->saveHTML();
$result = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body>', '', $result);
$result = str_replace('</body></html>', '', $result);
You could always try this class:
http://www.barattalo.it/html-fixer/
Which apparently will be as easy as this:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It all depends on what you will be doing with the information.
Well, PHP >= 5.1 apparently also has a DocumentFragment
, which has an appendXML
function: http://php.net/manual/en/domdocumentfragment.appendxml.php. Maybe you can use that? I'm not sure if it has a string representation of itself, but who knows.
EDIT:
Well, that doesn't work :)
What you could do, though, is use SimpleXML, either directly or by creating a DOMElement
and then using simplexml_import_dom($domelement)->asXML()
: http://php.net/manual/en/function.simplexml-import-dom.php. Good luck! :)
精彩评论