How can I get content of HTML <body>
when I have html:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
how can I get with DOM parser in JAVA content of body:
text
<div>
text2
<div>
text3
</div>
</div>
becasuse method getTextContent return:text text2 text3. - so开发者_Go百科 without tags.
It is possible with SAX, but it is possible with DOM, too?
The getTextContent
is behaving as I would expect - getting the textural content of the HTML fragment. Can you check the API docs for the DOM parser and see if there's a similar method with a name like getHtmlContent
?
You would need to parse the document into a DOM and serialise only the portion of the DOM you wanted. Using the DOM Level 3 LS interfaces you can serialise the outer-XML of a single node with:
LSSerializer serializer= implementation.createLSSerializer();
String html= serializer.writeToString(node);
To get the inner-XML you would need to writeToString
each child node in turn (eg. into a StringBuffer
).
Depending on what DOM implementation you are using there may be alternative non-standard methods. There may also be risks with serialising HTML as XML, if that's what you're doing... eg. a standard XML serialiser may output a self-closing tag for an empty tag, which can confuse browsers parsing the output as legacy-HTML.
精彩评论