Handling special entities like & nbsp; , & pound; in HtmlCleaner

2023-01-27 10:30 问答作者：

I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations.

It is not able to handle special characters like &pound or quotes etc. For e.x. for url : http://www.basiceleganc开发者_StackOverflow社区efurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £

Is there any property which we can set in htmlcleaner for handling this or any other solution.

Thanks

Jitendra

No, I don't believe HtmlCleaner can do this. However, you can use Apache Commons StringEscapeUtils to "unescape" the html, like this:

StringEscapeUtils.unescapeHtml("&pound;679.00");

will produce £679.00.

Instead of HtmlCleaner, I would recommend you try JSoup.

The version of htmlcleaner I am using is 2.2, and org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true) is useful to me. While I have to use the string.replace(" ", " ") to make the html content I got be right completely.

This can now be done through org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true).

继续阅读：web-crawler web-scraping

Handling special entities like & nbsp; , & pound; in HtmlCleaner

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？