开发者

libxml2 fails to handle CDATA in HTML correctly

I'm using libxml2.2.7.3 to parse html pages and I'm having difficulties getting it work correctly with CDATA in HTML. Here's the code:

xmlDocPtr doc = htmlReadMemory(data, length, "", NULL开发者_运维百科, 0);
xmlBufferPtr buffer = xmlBufferCreate();
xmlNodeDump(buffer, doc, doc->children, 0, 0);
printf("%s", (char*)buffer->content);

and the HTML data:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
  <div>
    <script type="text/javascript"> 
    //<![CDATA[
      document.write('</div>');
    //]]>
    </script>
  </div>
</body></html>

The parser erroneously recognizes the </div> inside the quotes as a real html tag and prints out error messages as follows:

:8: HTML parser error : Unexpected end tag : script
    </script>
             ^
:9: HTML parser error : Unexpected end tag : div
  </div>
        ^

And the result printed out and debugging also imply that parsing went wrong:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
  <div>
    <script type="text/javascript"><![CDATA[ 
    //<![CDATA[
      document.write(']]></script></div>');
    //]]>


</body></html>

So the question is, is this a bug of libxml2? Or am I doing something wrong?

Any insightful advices would be greatly appreciated. Thanks!


In HTML, the <script> element contains CDATA by definition, so <![CDATA[ has no effect.

In short, the source document is broken.

That section would be more properly written as:

<script type="text/javascript"> 
  document.write('<\/div>');
</script>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜