开发者

Search filters with Lucene.NET

I'm using Lucene.Net to create a website to search books, articles, etc, stored as PDFs. I need to be able to filter my search results based on author name, for example. Can this be done with just Lucene? Or do I need a DB to store the fi开发者_JS百科lter fields for each document?

Also, what's the best way to index my documents? I'll have about 50 documents to start with and periodically I'll have to add a bunch of documents to the index--may be through a web form. Should I use a DB to store the document paths?

Thanks.


Here is a list of what you need to do IMO:

  1. Extract raw text from PDF - please see this question which recommends iTextSharp for this purpose.
  2. For each PDF document, create a Lucene.net document that has several fields: author, title, document text and whatever you want to search. It is recommended to also have a unique id field per document. I suggest you also store a field with the path to the original PDF document.
  3. After indexing all the documents, you will have a Lucene index you can search by fields.
  4. You can add new documents by repeating step 2. It is easier to do this offline - incremental updates are tough.


Lucene has a couple of different Analyzers that can scrub out the noise and do "stemming" which is helpful when you want to do fulltext searching, but you're still going to need to store the PDF itself somewhere. Lucene.Net is happy to build an index on the file system, and you could add a field to the Document it builds called something like "PATH" with the path to the document.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜