开发者

Getting cleaned HTML in text from HtmlCleaner

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it?

Thanks Na开发者_Python百科yn


Here's the sample code:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";


Use a subclass of org.htmlcleaner.XmlSerializer, for example:

// get the element you want to serialize
HtmlCleaner cleaner     = new HtmlCleaner();
TagNode     rootTagNode = cleaner.clean(url);

// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);

// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String        html          = xmlSerializer.getAsString(rootTagNode);


XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);

String html = xmlSerializer.getAsString(rootTagNode);

the method above has a problem,it will trim content in html label, for example,

this is paragraph1.

 will become 

this is paragraph1.

and it is getSingleLineOfChildren function does the trim operation. So if we fetch data from website and want to keep the format like tuckunder.

PS:if a html label has children label,the parent label contetn will not be trimed,

for example <p> this is paragraph1. <a>www.xxxxx.com</a> </p> will keep whitespace before "this is paragraph1"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜