libxml2 fails to handle CDATA in HTML correctly
I'm using libxml2.2.7.3 to parse html pages and I'm having difficulties getting it work correctly with CDATA in HTML. Here's the code:
xmlDocPtr doc = htmlReadMemory(data, length, "", NULL开发者_运维百科, 0);
xmlBufferPtr buffer = xmlBufferCreate();
xmlNodeDump(buffer, doc, doc->children, 0, 0);
printf("%s", (char*)buffer->content);
and the HTML data:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<div>
<script type="text/javascript">
//<![CDATA[
document.write('</div>');
//]]>
</script>
</div>
</body></html>
The parser erroneously recognizes the </div> inside the quotes as a real html tag and prints out error messages as follows:
:8: HTML parser error : Unexpected end tag : script </script> ^ :9: HTML parser error : Unexpected end tag : div </div> ^
And the result printed out and debugging also imply that parsing went wrong:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><body> <div> <script type="text/javascript"><![CDATA[ //<![CDATA[ document.write(']]></script></div>'); //]]> </body></html>
So the question is, is this a bug of libxml2? Or am I doing something wrong?
Any insightful advices would be greatly appreciated. Thanks!In HTML, the <script>
element contains CDATA by definition, so <![CDATA[
has no effect.
In short, the source document is broken.
That section would be more properly written as:
<script type="text/javascript">
document.write('<\/div>');
</script>
精彩评论