Search large amounts of constantly updated text in mysql
I have a mysql database where large amounts of text are constantly added. (10 pages of text per hour). Text is stored as plaintext in text fields. Every row contacts a page or two of texts.
I need to do full text search (search for a keyword in the text and do complex querie开发者_运维技巧s) on this database on a regular basis. I just need to search for newly added text. But it is very important for added text to be immediately searchable (within a minute or two).
From what i've read, fulltext with mysql is very inefficient. I know lucene is an option but i'm not sure yet how quickly it can index new text.
So what are my options? is there a way to make mysql more efficient? is lucene my best solution? anything more appropriate?
thanks
I have done benchmarking for Indexing Times for Sphinx & Solr. Sphinx is far ahead as compared to Solr with respect to Indexing Algorithms (super fast indexing times and small index size).
When you say 10 pages of text, it seems you even dont need Real Time Sphinx Indexing. You can follow the main + delta indexing scheme in Sphinx (you can find that on Sphinx Documentation). It would be superfast and near real time. If you want more help on this please feel free to ask, would be glad to explain you.
Solr is great but when it comes to optimized Algorithms Sphinx rocks!! Try Sphinx.
Coming to your questions in the comment, Solr/Lucene supports incremental indexing (known as delta imports in their terminology) and its quiet easy to configure however they are pretty slow as compared to the method used by Sphinx.
Main+Delta is quick enough because what you can do is create a temporary table store you new text in that and index that. According to the documentation:Sphinx supports "live" (almost real time) index updates and it could be implemented using so called "main+delta" scheme. The idea is to set up two sources and two indexes, with one "main" index for the data, and one "delta" for the new documents.
Say for example you have 10 Million records so you can keep that as the main index and all the new documents get added to a new table which will act as the delta. This new table can be indexed from time to time (say every 1hr) and the data gets searchable within very few seconds as you have 10 pages of text. Now after your new records are being searched you can merge the documents of the main table + delta table which can be carried out without interfering your search. When the documents are merged, empty the new table and again after an hour you can carry out the whole process again. I hope you got that else please feel free to ask any question.
You have a couple of options:
Sphinx Search: Can integrate directly with your MySQL DB. Has support for real-time indexes, with limitations
Solr/Lucene: Feed it data via JSON or XML from your DB. Has rich querying capabilities. Current versions are not real-timey w/o some edge builds. You have to re-index your data and commit it for changes to appear. Which depending on your amount of data, you could do a commit every 10 min. This wont be an issue until you have 100K / 1M+ documents as Lucene is very fast at indexing. 10 pages / hour is pretty trivial.
ElasticSearch: Is Java based like Solr/Lucene but appears to be the truly "near real time" enough. Its engineered out of the box to be distributed and support linear scale-out. You feed it data via JSON and query via JSON.
It really depends on your needs and capabilities. Sphinx might be the easiest to get started. But its Real Time index limitations might not work for you.
精彩评论