web indexer using Java
Is para开发者_开发技巧llel system or distributed system better for web site crawlers and web indexers when developed in Java? What are the available frameworks?
One of the best crawler/indexer combos you'll ever find for Java is Nutch, which is an Apache project now (see Wiki) and thus open source.
Features:
- Fetching, parsing and indexation in parallel and/ou distributed
- Plugins: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
- Ontology
- Clustering
- MapReduce
- Distributed filesystem (via Hadoop)
- Link-graph database
- NTLM authentication (Windows/Exchange/etc)
Nutch is unbeatable. Another more simple lib which I used successfully in projects is https://crawler.dev.java.net/. You find examples on https://crawler.dev.java.net/samples.html.
精彩评论