Webcrawler, feedback?

2023-01-01 06:03 问答作者：

Hey folks, every once in a while I have the need to 开发者_如何转开发automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

Basic crawler class to easily and quickly interact with one website.

Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above? Any input would help immensely :)

http://pastebin.com/VtgC4qVE - Main.java

http://pastebin.com/gF4sLHEW - JCrawler.java

http://pastebin.com/VJ1grArt - HTMLUtils.java

Your crawler does not seem to respect the robots.txt in any way and uses a fake User-Agent string to show off as if it is a webbrowser. This may lead to legal trouble in the future. Keep this into account.

I have written a custom web-crawler in my company and I follow similar steps as you have mentioned and I found them perfect.The only add-on I want to say is that it should have a polling frequency to crawl after certain period of time.

So it should follow "Observer" design pattern so that if any new update is found on a given url after certain period of time then it will update or write to a file.

I would recommend open source JSpider as the start point for your crawler project, it covers all the major concerns of a web crawler, including robots.txt, and has a plug-in scheme that you can use to apply your own tasks to each page it visits.

This is a brief and slightly dated review of JSpider. The pages around this one review several other Java spidering applications.

http://www.mksearch.mkdoc.org/research/spiders/j-spider/

继续阅读：feedback optimization web-crawler

Webcrawler, feedback?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？