开发者

Best-suited text indexer for handling 10000s of (formatted) documents in python

I want to add a feature to search documents stored in a directory. The back end is developed in Python to additionally manipulate the search results. The documents are stored in a dedicated web server.

The established technologies (Lucene, Xapian, Whoosh) have mature python bindings. My colleagues have set up Apache, Lucene and PHP for their clients. I would choose Whoosh for being written in Python, but I am scared by reviews of its slow performance and la开发者_开发百科ck of "feature X".

My specific requirements are:

Support (makes me bite my nails)

  • well supported in Python
  • tech support of major hosts can easily set it up
  • scales well for upto 100000 documents
  • updating the index for 4 new files shouldnt slow down our dedicated server

Features (I am a newb here)

  • returns data in a format which I can manipulate by myself
  • can return highlighted text snippets
  • higher priority for certain files and words in title or bold


Solr, even though written in Java is an amazingly powerful search engine.

It has everything you need like highlighting, weight, ability to insert new items in the index relatively fast, and also a whole slew of other features like ability to provide autocomplete-like features.

It has json / xml / other response methonds, and a fairly good way in python to the search engine.


Sphinx is pretty easy to interact with because it works via a MySQL storage engine, which is an interface most programmers have touched at one point or another. Doubly so if you already have data in MySQL because then you can munge the data together trivially. Django-sphinx is an example of a fairly mature and easy to use means of interacting with Sphinx.

I know it's performant because I've used it in some high-load high-traffic situations and it's done very well. Supports all the semantics/features that I've ever found myself to need.

Lucene can be made more tolerable with Solr which is a REST interface to Lucene. The native bindings can be a bit arcane/alien to people not used to interacting with a search engine.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜