开发者

How to go about indexing 300,000 text files for search?

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.

I think searching for words a开发者_高级运维nd phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?

I don't mind using existing software if such exists.


Consider Lucene http://lucene.apache.org/java/docs/index.html


There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:

Opensource:

elasticsearch - based on lucene

constellio - based on lucene

Sphinx - based on C++

Solr - built on top of lucene


You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx


http://blog.webdistortion.com/2011/05/29/open-source-search-engines/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜