Getting cleaned HTML in text from HtmlCleaner
I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it?
Thanks Na开发者_Python百科yn
Here's the sample code:
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode root = htmlCleaner.clean(url);
HtmlCleaner.getInnerHtml(root);
String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";
Use a subclass of org.htmlcleaner.XmlSerializer
, for example:
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(url);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String html = xmlSerializer.getAsString(rootTagNode);
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String html = xmlSerializer.getAsString(rootTagNode);
the method above has a problem,it will trim content in html label, for example,
this is paragraph1.
will become
this is paragraph1.
and it is getSingleLineOfChildren
function does the trim operation. So if we fetch data from website and want to keep the format like tuckunder.
PS:if a html label has children label,the parent label contetn will not be trimed,
for example <p> this is paragraph1. <a>www.xxxxx.com</a> </p>
will keep whitespace before "this is paragraph1"
精彩评论