DOMDocument appendXML with special characters
I am retreiving some html strings from my database and I would like to parse these strings into my DOMDocument. The problem is, that the DOMDocument gives warnings at special characters.
Warning: DOMDocumentFragment::appendXML() [domdocumentfragment.appendxml]: Entity: line 2: parser error : Entity 'nbsp' not defined in page.php on line 189
I wonder why and I wonder how to solve this. This are some code fragments of my page. How can I fix these kind of warnings?
$doc = new DOMDocument();
// .. create some elements first, like some divs and a h1 ..
while($row = mysql_fetch_arr开发者_如何学Pythonay($result))
{
$messageEl = $doc->createDocumentFragment();
$messageEl->appendXML($row['message']); // gives it's warnings here!
$otherElement->appendChild($messageEl);
}
echo $doc->saveHTML();
I also found something about validation, but when I apply that, my page won't load anymore. The code I tried for that was something like this.
$implementation = new DOMImplementation();
$dtd = $implementation->createDocumentType('html','-//W3C//DTD XHTML 1.0 Transitional//EN','http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd');
$doc = $implementation->createDocument('','',$dtd);
$doc->validateOnParse = true;
$doc->formatOutput = true;
// in the same whileloop, I used the following:
$messageEl = $doc->createDocumentFragment();
$doc->validate(); // which stopped my code, but error- and warningless.
$messageEl->appendXml($row['message']);
Thanks in advance!
There is no
in XML. The only character entities that have an actual name defined (instead of using a numeric reference) are &
, <
, >
, "
and '
.
That means you have to use the numeric equivalent of a non-breaking space, which is  
or (in hex)  
.
If you are trying to save HTML into an XML container, then save it as text. HTML and XML may look similar but they are very distinct. appendXML()
expects well-formed XML as an argument. Use the nodeValue
property instead, it will XML-encode your HTML string without any warnings.
// document fragment is completely unnecessary
$otherElement->nodeValue = $row['message'];
That's a tricky one because it's actually multiple issues in one.
Like Tomalak points out, there is no
in XML. So you did the right thing specifying a DOMImplementation, because in XHTML there is
. But, for DOM to know that the document is XHTML, you have load and validate against the DTD. The DTD is located at
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
but because there is millions of requests to that page daily, the W3C decided to block access to the page, unless there is a UserAgent sent in the request. To supply a UserAgent you have to create a custom stream context.
In code:
// make sure DOM passes a User Agent when it fetches the DTD
libxml_set_streams_context(
stream_context_create(
array(
'http' => array(
'user_agent' => 'PHP libxml agent',
)
)
)
);
// specify the implementation
$imp = new DOMImplementation;
// create a DTD (here: for XHTML)
$dtd = $imp->createDocumentType(
'html',
'-//W3C//DTD XHTML 1.0 Transitional//EN',
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
);
// then create a DOMDocument with the configured DTD
$dom = $imp->createDocument(NULL, "html", $dtd);
$dom->encoding = 'UTF-8';
$dom->validate();
$fragment = $dom->createDocumentFragment();
$fragment->appendXML('
<head><title>XHTML test</title></head>
<body><p>Some text with a entity</p></body>
'
);
$dom->documentElement->appendChild($fragment);
$dom->formatOutput = TRUE;
echo $dom->saveXml();
This still takes some time to complete (dont ask me why) but in the end, you'll get (reformatted for SO)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>XHTML test</title>
</head>
<body>
<p>Some text with a entity</p>
</body>
</html>
Also see DOMDocument::validate() problem
I do see the problem in question, and also that the question has been answered, but if I may I'd like to suggest a thought from my past dealing with similar problems.
It just might be so that your task requires including tagged data from the database in the resulting XML, but may or may not require parsing. If it's merely data for inclusion, and not structured parts of your XML, you can place strings from the database in CDATA section(s), effectively bypassing all validation errors at this stage.
Here's another approach, because we did not want possibly slow network requests (or any network requests at all resulting from user input):
<?php
$document = new \DOMDocument();
$document->loadHTML('<html><body></body></html>');
$html = '<b>test </b>';
$fragment = $document->createDocumentFragment();
$html = '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document [
<!ENTITY nbsp " " >
]>
<document>'.$html.'</document>';
$newdom = new \DOMDocument();
$newdom->loadXML($html, LIBXML_HTML_NOIMPLIED | LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET | LIBXML_NOBLANKS);
foreach ($newdom->documentElement->childNodes as $childnode)
$fragment->appendChild($fragment->ownerDocument->importNode($childnode, TRUE));
$document->getElementsByTagName('body')[0]->appendChild($fragment);
echo $document->saveHTML();
Here we include the relevant part of the DTD, specifically the latin1 entity definitions as an internal DOCTYPE definition. Then the HTML content is wrapped in a document element to be able to process a sequence of child elements. The parsed nodes are then imported and added into the target DOM.
Our actual implementation uses file_get_contents to load the DTD containing all entity definitions from a local file.
While smarty might be a good bet (why invent the wheel for the 14th time?), etranger might have a point. There's situations in which you don't want to use something overkill like a complete new (and unstudied) package, but more like you want to post some data from a database that just happens to contain html stuff an XML parser has issues with.
Warning, the following is a simple solution, but don't do it unless you're SURE you can get away with it! (I did this when I had about 2 hours before a deadline and didn't have time to study, leave lone implement something like smarty...)
Before sticking the string into an appendXML function, run it through a preg_replace. For instance, replace all & nbsp; characters with [some_prefix]_nbsp. Then, on the page where you show the html, do it the other way around.
And Presto! =)
Example code: Code that puts text into a document fragment:
// add text tag to p tag.
// print("CCMSSelTextBody::getDOMObject: strText: ".$this->m_strText."<br>\n");
$this->m_strText = preg_replace("/ /", "__nbsp__", $this->m_strText);
$domTextFragment = $domDoc->createDocumentFragment();
$domTextFragment->appendXML(utf8_encode($this->m_strText));
$p->appendChild($domTextFragment);
// $p->appendChild(new DOMText(utf8_encode($this->m_strText)));
Code that parsed the string and writes the html:
// Instantiate template.
$pTemplate = new CTemplate($env, $pageID, $pUser, $strState);
// Parse tag-sets.
$pTemplate->parseTXTTags();
$pTemplate->parseCMSTags();
// present the html code.
$html = $pTemplate->getPageHTML();
$html = preg_replace("/__nbsp__/", " ", $html);
print($html);
It's probably a good idea to think up a stronger replace. (If you insist on being thorough: Do a md5 on a time() value, and hardcode the result of that as a prefix. So like in the first snippet:
$this->m_strText = preg_replace("/ /", "4597ee308cd90d78aa4655e76bf46ee0_nbsp", $this->m_strText);
And in the second:
$html = preg_replace("/4597ee308cd90d78aa4655e76bf46ee0_nbsp/", " ", $html);
Do the same for any other tags and stuff you need to circumvent.
This is a hack, and not good code by any stretch of the imagination. But it saved my live and wanted to share it with other people that run into this particular problem with minutes to spare.
Use the above at your own risk.
精彩评论