Efficient query to lookup stuff in a word index
I have two tables defined like this:
Page(id), Index(page_id, word)
page_id in Index is a foreign key to Page so that each Page is connected to a group of Index entries. The Index table is a index for the Page table so that you can do fast text searching. E.g:
SELECT page_id FROM Index where word = 'hello'
Would select all page_id's for all pages containing the word 'hello'. But now I want to select all page_id's for pages that contain all of the words 'word1', 'word2' and 'word3'. The best query I can come up with for this is:
SELECT page_id
FROM Index
WHERE word IN ('word1', 'word2', 'word3')
GROUP BY page_id
HAVING COUNT(1) = 3;
It works, but I wonder if someone can think of an alternative more efficient query?
[edit] The above example is开发者_开发问答 slightly simplified. In the real Index table word is replaced with a word_id column that references a Word table. But the basic approach is the same. The RDBMS is PostgreSQL and there is about 2m rows in the Index table and 20k rows in Page.
As a small thing, I wouldn't call a table Index
, that's bound to be confusing :)
Your query would find pages with 3 index entries for word1
. That might not be a problem, but you could change the having
to:
HAVING COUNT(DISTINCT word) = 3
to avoid this problem.
How this query performs compared to other depends on a lot of factors, like keyword density, the DBMS you use, and the total amount of rows. I would not worry about it until you encountered an actual performance problem.
精彩评论