PHP DOMDocument not formatting output correctly
I'm currently working on the sitemaps for a website, and I'm using SimpleXML to import and do some checks on the original XML file. after this I use simplexml_load_file("small.xml");
to convert it to DOMDocument to make it easier to precisely add and manipulate XML elements. Below is the test XML sitemap that i'm working from:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:52:32-Orouke.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:23-castle technology.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:38-banana split.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:42-Waveney.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:55:12-pure orange.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:57:54-tau press.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:21-E.f.m.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:31-apple.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:45-townhouse communications.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
</urlset>
Now. here is the test code I'm using to modify:
<?php
$root = simplexml_load_file("small.xml");
$domRoot = dom_import_simplexml($root);
$dom = $domRoot->ownerDocument;
$urlElement = $dom->createElement("url");
$locElement = $dom->createElement("loc");
$locElement->appendChild($dom->createTextNode("www.google.co.uk"));
$urlElement->appendChild($locElement);
$lastmodElement = $dom->createElement("lastmod");
$lastmodElement->appendChild($dom->createTextNode("2011-08-02"));
$urlElement->appendChild($lastmodElement);
$domRoot->appendChild($urlElement);
$dom->formatOutput = true;
echo $dom->saveXML();
?>
The main problem is, that no matter where i place $dom->formatOutput = true;
the existing XML that was imported from SimpleXML is formatted correctly, but anything new is formatted in the "all one line" style, as follows:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:52:32-Orouke.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:23-castle technology.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:38-banana split.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:42-Waveney.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:55:12-pure orange.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:57:54-tau press.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
开发者_运维知识库<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:21-E.f.m.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:31-apple.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url>
<loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:45-townhouse communications.html</loc>
<lastmod>2011-08-23</lastmod>
</url>
<url><loc>www.google.co.uk</loc><lastmod>2011-08-02</lastmod></url></urlset>
If anyone has an idea why this is happening and how to fix it I would be very grateful.
There is a workaround. You can force reformatting by saving your new xml to string first, then load it again after setting the formatOutput property, e.g.:
$strXml = $dom->saveXML();
$dom->formatOutput = true;
$dom->loadXML($strXml);
echo $dom->saveXML();
To format output nicely, you need to set the preserveWhiteSpace
variable to false
before loading as stated in the documentation
Example:
$Xhtml = "<div><span></span></div>";
$doc = new DOMDocument('1.0','UTF-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->loadXML($Xhtml);
$formattedXhtml = $doc->saveXML($doc->documentElement, LIBXML_NOXMLDECL);
$expectedFormatting =<<<EOF
<div>
<span/>
</div>
EOF;
$this->assertEquals($expectedFormatting,$formattedXhtml,"The XHTML is formatted");
Just for the visitor that comes here as this was the first answer on Google Search.
I had this same problem using code like Simon's.
Turns out that when you disable errors (either with $doc->loadHTML(..., LIBXML_NOERROR)
or libxml_use_internal_errors(true);
), it won't format anymore (example: https://3v4l.org/ur76E).
The solution is to not disable errors and suppress them on the PHP side (with @
).
Ugly, but it works: https://3v4l.org/BSJVu
The final silver bullet function looks like:
function beautifyDoc(DOMDocument $doc): void
{
$previousLibXmlState = libxml_use_internal_errors(false);
$previousErrorHandler = set_error_handler(null);
try {
$html = $doc->saveHTML();
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
@$doc->loadHTML($html);
} finally {
libxml_use_internal_errors($previousLibXmlState);
set_error_handler($previousErrorHandler);
}
}
// usage
$doc = new DOMDocument();
// ...load html and do stuff...
beautifyDoc($doc);
echo $doc->saveHTML(); // done
(it also takes care of the php error handler, if already set)
精彩评论