开发者

Java, lucene, searcher indexer problem how to do it?

I have to make something with lucene and java but I don't have an idea how to start with. I have to do servlet which has to receive from the browser, next make a searching and finally make page with the finded results. Browser should have possibility to choose between searching in names or in names and inside the pages. Browser should search html files in this direction /var/www/manual/. As a helper I a开发者_JAVA百科lready have two files: Indexer.java and Searcher.java.

Indexer

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Indexer {

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      throw new Exception("Usage: java " + Indexer.class.getName()
        + " <index dir> <data dir>");
    }
    File indexDir = new File(args[0]);
    File dataDir = new File(args[1]);

    long start = new Date().getTime();
    int numIndexed = index(indexDir, dataDir);
    long end = new Date().getTime();

    System.out.println("Indexing " + numIndexed + " files took "
      + (end - start) + " milliseconds");
  }

  public static int index(File indexDir, File dataDir)
    throws IOException {

    if (!dataDir.exists() || !dataDir.isDirectory()) {
      throw new IOException(dataDir
        + " does not exist or is not a directory");
    }

    IndexWriter writer = new IndexWriter(indexDir,
      new StandardAnalyzer(), true);
    writer.setUseCompoundFile(false);

    indexDirectory(writer, dataDir);

    int numIndexed = writer.docCount();
    writer.optimize();
    writer.close();
    return numIndexed;
  }

  private static void indexDirectory(IndexWriter writer, File dir)
    throws IOException {

    File[] files = dir.listFiles();

    for (int i = 0; i < files.length; i++) {
      File f = files[i];
      if (f.isDirectory()) {
        indexDirectory(writer, f);  // recurse
      } else if (f.getName().endsWith(".txt")) {
//      } else if (f.getName().endsWith(".html.en")) {
        indexFile(writer, f);
      }
    }
  }

  private static void indexFile(IndexWriter writer, File f)
    throws IOException {

    if (f.isHidden() || !f.exists() || !f.canRead()) {
      return;
    }

    System.out.println("Indexing " + f.getCanonicalPath());

    Document doc = new Document();
    doc.add(new Field("contents", new FileReader(f)));
    doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    //doc.add(new Field("filename", new StringReader(f.getCanonicalPath())));
    writer.addDocument(doc);
  }


}

Searcher

import java.io.File;
import java.io.FileReader;
import java.io.StringReader;
import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Searcher {

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      throw new Exception("Usage: java " + Searcher.class.getName()
        + " <index dir> <query>");
    }

    File indexDir = new File(args[0]);
    String q = args[1];

    if (!indexDir.exists() || !indexDir.isDirectory()) {
      throw new Exception(indexDir +
        " does not exist or is not a directory.");
    }

    search(indexDir, q);
  }

  public static void search(File indexDir, String q)
    throws Exception {
    Directory fsDir = FSDirectory.getDirectory(indexDir, false);
    IndexSearcher is = new IndexSearcher(fsDir);

//    Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());   DEPRECATED
    QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
    Query query = qp.parse(q);
    long start = new Date().getTime();
    Hits hits = is.search(query);
    long end = new Date().getTime();

    System.err.println("Found " + hits.length() +
      " document(s) (in " + (end - start) +
      " milliseconds) that matched query '" +
        q + "':");

    for (int i = 0; i < hits.length(); i++) {
      Document doc = hits.doc(i);
      System.out.println(doc.get("filename"));
    }
  }
}

One of the suggestions is to use HTMLDocument.java from lucene-demos for index html documents.

Could someone help me with this problem? Thank you for any advice.


I don't know if Lucene is a requirement for your project, but if you are interested by the full-text search capabilities of Lucene, then you may find easier to start with Solr (http://lucene.apache.org/solr/), a search engine based on Lucene. Solr is developed by the same people as Lucene, so you can be sure that everything is done the right way, and likely to be faster than code you could write.

Otherwise there is a nice "Getting started" guide on Lucene's website which will help you understand how to use Lucene (what is a Directory, how to read and write the index?) and the best practices (reuse IndexWriter instances, etc.) :

  • http://lucene.apache.org/java/3_3_0/gettingstarted.html#Getting Started
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜