problem with incremental update in lucene
I am creating a program that can index many text files in different folder. so that's mean every folder that has text files get indexed and its index are stored in another folder. so this another folder acts like a universal index of all files in my computer. and I am using lucene to achieve this because lucene fully supported incremental update. this is the source code into which I use it for indexing.
public class Si开发者_如何学JAVAmpleFileIndexer {
public static void main(String[] args) throws Exception {
int i=0;
while(i<2) {
File indexDir = new File("C:/Users/Raden/Documents/myindex");
File dataDir = new File("C:/Users/Raden/Documents/indexthis");
String suffix = "txt";
SimpleFileIndexer indexer = new SimpleFileIndexer();
int numIndex = indexer.index(indexDir, dataDir, suffix);
System.out.println("Total files indexed " + numIndex);
i++;
Thread.sleep(1000);
}
}
private int index(File indexDir, File dataDir, String suffix) throws Exception {
RAMDirectory ramDir = new RAMDirectory(); // 1
@SuppressWarnings("deprecation")
IndexWriter indexWriter = new IndexWriter(
ramDir, // 2
new StandardAnalyzer(Version.LUCENE_CURRENT),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.setUseCompoundFile(false);
indexDirectory(indexWriter, dataDir, suffix);
int numIndexed = indexWriter.maxDoc();
indexWriter.optimize();
indexWriter.close();
Directory.copy(ramDir, FSDirectory.open(indexDir), false); // 3
return numIndexed;
}
private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
File[] files = dataDir.listFiles();
for (int i = 0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(indexWriter, f, suffix);
}
else {
indexFileWithIndexWriter(indexWriter, f, suffix);
}
}
}
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f, String suffix) throws IOException {
if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
return;
}
if (suffix!=null && !f.getName().endsWith(suffix)) {
return;
}
System.out.println("Indexing file " + f.getCanonicalPath());
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.ANALYZED));
indexWriter.addDocument(doc);
} }
and this is the source code that I use for searching the lucene-created index
public class SimpleSearcher {
public static void main(String[] args) throws Exception {
File indexDir = new File("C:/Users/Raden/Documents/myindex");
String query = "revolution";
int hits = 100;
SimpleSearcher searcher = new SimpleSearcher();
searcher.searchIndex(indexDir, query, hits);
}
private void searchIndex(File indexDir, String queryStr, int maxHits) throws Exception {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory);
@SuppressWarnings("deprecation")
QueryParser parser = new QueryParser(Version.LUCENE_30, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(queryStr);
TopDocs topDocs = searcher.search(query, maxHits);
ScoreDoc[] hits = topDocs.scoreDocs;
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("filename"));
}
System.out.println("Found " + hits.length);
}
}
the problem I am having now is that the indexing program I created above seem can't do any incremental update. I mean I can search for a text file but only for the file that existed in the last folder to which I already indexed, and the other previous folder that I had already indexed seems to be missing in the search result and didn't get displayed. can you tell me what went wrong in my code? I just wanted to be able to have incremental update feature in my source code. so in essence, my program seems to be overwriting the existing index with the new one instead of merging it.
thanks though
Directory.copy()
overwrites the destination directory, you need to use IndexWriter.addIndexes()
to merge the new directory indices into the main one.
You can also just re-open the main index and add documents to it directly. A RAMDirectory isn't necessarily more efficient than properly tuned buffer and merge factor settings (see IndexWriter
docs).
Update: instead of Directory.copy()
you need to open ramDir
for reading and indexDir
for writing and call .addIndexes
on the indexDir
writer and pass it the ramDir
reader. Alternatively, you can use .addIndexesNoOptimize
and pass it ramDir
directly (without opening a reader) and optimize the index before closing.
But really, it's probably easier to just skip the RAMDir and open a writer on indexDir
in the first place. Will make it easier to update changed files as well.
Example
private int index(File indexDir, File dataDir, String suffix) throws Exception {
RAMDirectory ramDir = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDir,
new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.setUseCompoundFile(false);
indexDirectory(indexWriter, dataDir, suffix);
int numIndexed = indexWriter.maxDoc();
indexWriter.optimize();
indexWriter.close();
IndexWriter index = new IndexWriter(FSDirectory.open(indexDir),
new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
index.addIndexesNoOptimize(ramDir);
index.optimize();
index.close();
return numIndexed;
}
But, just this is fine too:
private int index(File indexDir, File dataDir, String suffix) throws Exception {
IndexWriter index = new IndexWriter(FSDirectory.open(indexDir),
new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
// tweak the settings for your hardware
index.setUseCompoundFile(false);
index.setRAMBufferSizeMB(256);
index.setMergeFactor(30);
indexDirectory(index, dataDir, suffix);
index.optimize();
int numIndexed = index.maxDoc();
index.close();
// you'll need to update indexDirectory() to keep track of indexed files
return numIndexed;
}
精彩评论