How to go about indexing 300,000 text files for search?
I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words a开发者_高级运维nd phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.
Consider Lucene http://lucene.apache.org/java/docs/index.html
There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene
You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx
http://blog.webdistortion.com/2011/05/29/open-source-search-engines/
精彩评论