Java equivalent to PHP Simple HTML DOM Parser
Since I have 开发者_Python百科to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML DOM as robustly, quickly and easily as in PHP Simple HTML DOM Parser. Do you know alternatives in Java that are as easy to use?
I went from Simple HTML DOM Parser to JSoup and I'm quite happy with it.
I can see that we have two challenges here:
Parsing of HTML that might not be well-formed XHTML that ease any and nice to parse. I'd recommend TagSoup library that can read ugly HTML and produce well-formed StaX stream that can be then used elsewhere.
Building of DOM representaion of HTML document and dealing with that. As you probably know in JDK there is full-blown implementation of XML DOM (
org.w3c.dom.*
). But I guess this is not the type of API you've been looking for. What about DOM4J or older JDOM that can wrap JDK Document and you can enjoy easy to use API?
I've successfully used TagSoup as a SAX parser to populate DOM4J Documents which I then query with XPath. It took me a while to work out the incantations - (Scala, but I'm sure that you can convert):
parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val reader = new SAXReader(parserFactory.newSAXParser.getXMLReader)
val doc = reader.read(new InputSource(new StringReader(page)))
精彩评论