What is an efficient way to check a word in a paragraph against the DB?
I have a table called keywords. It's simple, two columns - id and keyword. This table is always growing, we're always adding additional content to it.
Now, we have a user-submitted 2-3 sentence paragraph. We want to check the words, in this paragraph, against all the keywords i开发者_开发知识库n the keywords table.
What is the most efficient way to do this? My initial logic is below
// explode $paragraph into $words[]
// cycle through $words -- in each loops, SQL statement where $word = keyword
// output any matched keywords
Would this be the most efficient? Keep in mind, keywords table can be several thousand rows. Should I maybe use a Sphinx search solution?
- Explode user input into words.
Query:
SELECT id, keyword FROM tbl_name WHERE keyword IN ('word1', 'word2', 'word3', 'wordN');
Words that have not been returned have to be inserted.
- Merging retrieved IDs with newly inserted IDs gives you a full set of used IDs.
Have an array of excluded words. (Or pull it from the database, or whatever.)
$excludeThis = array(
'a',
'the',
'for',
);
Strip all the dots, slashes etc.
$text = preg_replace('/[^A-Za-z0-9]/', ' ', $text);
$text = str_replace(' ', ' ', $text);
Explode the text
$words = explode(' ', $text);
Build array of all the words you want to match against your keywords.
$array = array();
if ( ! empty($words)) {
foreach ($words as $word) {
if (in_array(strtolower(trim($word)), $excludeThis))
continue;
$array[] = strtolower(trim($word));
}
}
Pull your keywords from the db (you might actually want to cache this...) And just use array_intersect() to match your two arrays.
No idea if this is the fastest way, but maybe it helps.
I think that is the right way. But i would prefer a second table (much smaller) with excluded words you often have in your paragraphs (e.g. 'and', 'or', 'the' and so on...). Eliminate all of these words out of $words and you can save some milliseconds.
This is an excellent opportunity to use a stored procedure. Crozin's response will work, but I prefer to keep this sort of logic in the database.
Reasons:
If you decide to add/remove/rename columns at a later date - or if you're lucky enough to have a DBA on your project - it'll make things quite a bit easier to maintain.
If you're generating your word list from user input, it will help prevent SQL injection. I limit the rights of my site's db user account to the ability to execute stored procs.
In either case, the logic remains the same. Split the paragraph into words, and compare the keyword column against that list using "in".
Alternatively, if you're dealing with large amounts of text, it might be more efficient to split and pivot the resulting word list into rows (perhaps in a temporary table), and select from a join:
select *
from keywords
join #paragraph_words
on keywords.keyword=#paragraph_words.keyword;
(you'll have to check the syntax, as I don't have time to validate it just now).
The set operators (like IN
) likely won't perform as well as a join, though, as I said, this would be overkill if you're only dealing with a short word list.
Quering your database so many times will not be the most efficient way.
I guess you could use MySQL Full-text search capability to solve your problem. It will not scale on high load, but for most of the sites it will get the job done fast.
If you want the most optimized solution you should learn products like elasticsearch, solr, sphinx, lucene, etc. Because they were written in mind to solve this problem.
精彩评论