开发者

Java library for HTML analysis

(I've seen similar questions, but I think none of them cater to my specific needs, hence...)

I would like to know if there is a Java libra开发者_开发问答ry for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:

  • figuring out the most prominent color in an HTML chunk
  • changing that color to some other color (hence, has to support modification of the HTML as well)
  • pruning out unwanted tags
  • fixing up the HTML to result in a well formed HTML snippet

Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.

Thanks in advance!


You might want to check out TagSoup:

http://home.ccil.org/~cowan/XML/tagsoup/


Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.


Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.

You'll need something else for the colour changing stuff.


Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜