开发者

web indexer using Java

Is para开发者_开发技巧llel system or distributed system better for web site crawlers and web indexers when developed in Java? What are the available frameworks?


One of the best crawler/indexer combos you'll ever find for Java is Nutch, which is an Apache project now (see Wiki) and thus open source.

Features:

  1. Fetching, parsing and indexation in parallel and/ou distributed
  2. Plugins: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
  3. Ontology
  4. Clustering
  5. MapReduce
  6. Distributed filesystem (via Hadoop)
  7. Link-graph database
  8. NTLM authentication (Windows/Exchange/etc)


Nutch is unbeatable. Another more simple lib which I used successfully in projects is https://crawler.dev.java.net/. You find examples on https://crawler.dev.java.net/samples.html.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜