How to optimize Lucene.Net indexing
I need to index around 10GB of data. Each of my "documents" is pretty small, think basic info about a product, about 20 fields of data, most only a few words. Only 1 column is indexed, the rest are stored. I'm grabbing the data from text files, so that part is pretty fast.
Current indexing speed is only about 40mb per hour. I've heard other people say they have achieved 100x faster than this. For smaller files (around 20mb) the indexing goes quite fast (5 minutes). However, when I have it loop through all of my data files (about 50 files totalling 10gb), as time goes on the growth of the index seems to slow down a lot. Any ideas on how I can speed up th开发者_JS百科e indexing, or what the optimal indexing speed is?
On a side note, I've noticed the API in the .Net port does not seem to contain all of the same methods as the original in Java...
Update--here are snippets of the indexing C# code: First I set thing up:
directory = FSDirectory.GetDirectory(@txtIndexFolder.Text, true);
iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
iwriter.SetMergeFactor(1000);
iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));
Then read from a tab-delim data file:
using (System.IO.TextReader tr = System.IO.File.OpenText(File))
{
string line;
while ((line = tr.ReadLine()) != null)
{
string[] items = line.Split('\t');
Then create the fields and add the document to the index:
fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
doc.Add(fldName);
fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
doc.Add(fldUPC);
string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24];
fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
doc.Add(fldContents);
...
iwriter.AddDocument(doc);
Once its completely done indexing:
iwriter.Optimize();
iwriter.Close();
Apparently, I had downloaded a 3 yr old version of Lucene that is prominently linked to for some reason from the home page of the project...downloaded the most recent Lucene source code, compiled, used the new DLL, fixed about everything. The documentation kinda sucks, but the price is right and its real fast.
From a helpful blog
First things first, you have to add the Lucene libraries to your project. On the Lucene.NET web site, you’ll see the most recent release builds of Lucene. These are two years old. Do not grab them, they have some bugs. There has not been an official release of Lucene for some time, probably due to resource constraints of the maintainers. Use Subversion (or TortoiseSVN) to browse around and grab the most recently updated Lucene.NET code from the Apache SVN Repository. The solution and projects are Visual Studio 2005 and .NET 2.0, but I upgraded the projects to Visual Studio 2008 without any issues. I was able to build the solution without any errors. Go to the bin directory, grab the Lucene.Net dll and add it to your project.
Since I can't comment on the marked answer above related to a 3 year old version, I would highly recommend installing the Visual Studio extension for NuGet Package Manager when adding Lucene.NET to your projects. It should add the most recent DLL version for you unless you need a specific later version.
精彩评论