Retrieving well formed HTML using Jericho HTML parser in Java
I've looked at jTidy for converting a snipped of malformed/real-world HTML into well-formed HTML/XHTML. However, there's a bug in the latest version due to which I'm not able to use it. I'm looking at Jericho since it has a lot of positive reviews around the net.
However, its not immediately obvious to me how one would go about implementing a method like:
public String getValidHTML(String messedUpHTML)
For instance, if i开发者_如何学Pythont was passed <div>bar
, it would return <div>bar</div>
Any pointers would be helpful.
Thanks in advance!
Jericho's HTMLSanitiser sample might be a good start.
However, keep in mind that jericho's key strength is its ability to parse and manipulate malformed HTML, while keeping the original "bad" formatting. However, it'd be interesting to see how the library performs such a task.
精彩评论