mysql - filtering a list against keywords, both list and keywords > 20 million records (slow)

2022-12-28 21:46 问答作者：

I have two tables, both having more than 20 million records; table1 is a list of terms, and table2 is a list of keywords that may or may not appear in those terms. I need to identify the terms that contain a keyword.

The 'term' field is a VARCHAR(320) and the 'keyw开发者_StackOverfloword' field is a VARCHAR(64).

My current strategy is:

SELECT table1.term, table2.keyword FROM table1 INNER JOIN table2 ON table1.term 
LIKE CONCAT('%', table2.keyword, '%');

This is not working, it takes f o r e v e r.

It's not the server, afaict (see notes).

How might I rewrite this so that it runs in under a day?

I have entertained in-memory tables, or changing to innodb and making the buffer pool big enough to hold both tables. Unfortunately, each mysql thread is bound to one cpu, but I have 4 cores (well, "8" with hyperthreading); if I could distribute the workload, that would be fantastic.

Notes:

Regarding server optimization: both tables are myisam and have unique indexes on the matching fields; the myisam key buffer is greater than the sum of both index file sizes, and it is not even being fully taxed (key_blocks_unused is ... large); the server is a 2x dual core xeon 2U beast with fast sas drives and 8G of ram, tuned for the mysql workload.
I just remembered that I only index the first 80 characters of the 'term' field (to save disk space); not sure if this is hurting or helping.
MySQL 5.0.32, Debian Lenny x86_64

You want to set up a full-text index, then do a search against that. Right now, your unique index probably isn't helping the search at all (because of the leading '%' in the search).

That means, it's almost certainly running a full scan of table1 for each item in table2. Calling that grossly inefficient is putting it nicely. Building a full-text index is somewhat slow (though probably faster than what you're doing right now) but once that's done, the searching should go a lot faster.

As to what to use to do the full-text indexing: while MySQL has a built-in full-text indexing capability, I doubt it'll help you a lot -- with 20 million rows, its performance is pretty poor (at least in my experience). Sphinx is a bit more work to set up, but is a lot more likely to give you adequate performance.

for first you should normalize your schema: you should make 3rd table to keep relation between terms and keywords in the manner of term_id <-> keyword_id, not like you doing this now - in char field separated by spaces

mysql - filtering a list against keywords, both list and keywords > 20 million records (slow)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？