开发者

Parsing HTML webpages in Java

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

开发者_Go百科

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?


Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.


I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.


A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜