Setting Up an Easily Searchable MySQL Database for Word Searches
I have appx. 2TB of text that I want to turn into a searchable database, where I will usually be searching to see if 2-4 word expressions exist in the database (for instance I might do a search to see if the phrase "these are four words", or "three consecutive words" appears anywhere in the text).
These searches will happen very often so it is very important that I setup the database to use as little processing as possible. I'd also want to minimize the overhead as much as possible so I can lower the amount of database servers I'll need.
Does anybody have any suggestions as to how I should setup this database?
For instance I was thinking of doing a linked list that was organized |id|word1|word2| (with all three beings keys) so for the expression "these are four words", I'd first search "these are", then I'd search "are four", check to see if any matches for "these are" are 1 id lower than "are four", and then do the same thing for "four words". But I think there has to be a more efficient way of doing it.
EDIT: The ONLY thing I will be using this database for is doing these 2-4 word exact match searches, and it is meant for internal use. All I want this database to be able to do is let me know if a 2-4 word expression exists somewhere in all of m开发者_JAVA技巧y files of information, and nothing more.
Does anybody have any suggestions as to how I should setup this database?
Personally, I'd first rule out the possibility of using MySQL's full-text search, and every Open Source, full-text search engine. There's a list of Open Source search engines on Wikipedia. I'd also rule out using Google Custom Search. Heck, I'd even consider a commercial product before I'd try rolling my own.
At the very least, studying their code might give you some ideas about index structure.
If you're thinking of building a linked list in SQL, well, you might want to build a tiny test before you get too far into it. I don't think it will be practical, but I could be wrong.
It takes a lot of work to do full-text search really well. (Think about proximity searches—find "there are" within 3 words of "many ways to fail". ) Reinventing this wheel might not be the best use of your time.
精彩评论