MySQL, need some performance suggestions on my match query

2023-01-18 16:33 问答作者：

I need some performance improvement guidance, my query takes several seconds to run and this is causin开发者_开发百科g problems on the server. This query runs on the most common page on my site. I think a radical rethink may be required.

~ EDIT ~ This query produces a list of records whose keywords match those of the program (record) being queried. My site is a software download directory. And this list is used on the program listing page to show other similar programs. PadID is the primary key of the program records in my database.

~ EDIT ~

Heres my query

 select match_keywords.PadID, count(match_keywords.Word) as matching_words 
 from keywords current_program_keywords 
 inner join keywords match_keywords on
       match_keywords.Word=current_program_keywords.Word 
 where match_keywords.Word IS NOT NULL 
 and current_program_keywords.PadID=44243 
 group by match_keywords.PadID 
 order by matching_words DESC 
 LIMIT 0,11;

Heres the query explained.

MySQL, need some performance suggestions on my match query

Heres some sample data, however I doubt you'd be able to see the effects of any performance tweaks without more data, which I can provide if you'd like.

 CREATE TABLE IF NOT EXISTS `keywords` (
   `Word` varchar(20) NOT NULL,
   `PadID` bigint(20) NOT NULL,
   `LetterIdx` varchar(1) NOT NULL,
   KEY `Word` (`Word`),
   KEY `LetterIdx` (`LetterIdx`),
   KEY `PadID_2` (`PadID`,`Word`)
 ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 INSERT INTO `keywords` (`Word`, `PadID`, `LetterIdx`) VALUES
 ('tv', 44243, 'T'),
 ('satellite tv', 44243, 'S'),
 ('satellite tv to pc', 44243, 'S'),
 ('satellite', 44243, 'S'),
 ('your', 44243, 'X'),
 ('computer', 44243, 'C'),
 ('pc', 44243, 'P'),
 ('soccer on your pc', 44243, 'S'),
 ('sports on your pc', 44243, 'S'),
 ('television', 44243, 'T');

I've tried adding an index, but this doesn't make much difference.

 ALTER TABLE `keywords` ADD INDEX ( `PadID` )

You might find this helpful if I understood you correctly. The solution takes advantage of innodb's clustered primary key indexes (http://pastie.org/1195127)

EDIT: here's some links that may prove of interest:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

drop table if exists programmes;
create table programmes
(
prog_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into programmes (name) values 
('prog1'),('prog2'),('prog3'),('prog4'),('prog5'),('prog6');


drop table if exists keywords;
create table keywords
(
keyword_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into keywords (name) values 
('tv'),('satellite tv'),('satellite tv to pc'),('pc'),('computer');


drop table if exists programme_keywords;
create table programme_keywords
(
keyword_id mediumint unsigned not null,
prog_id mediumint unsigned not null,
primary key (keyword_id, prog_id), -- note clustered composite primary key
key (prog_id)
)
engine=innodb;

insert into programme_keywords values 

-- keyword 1
(1,1),(1,5),

-- keyword 2
(2,2),(2,4),

-- keyword 3
(3,1),(3,2),(3,5),(3,6),

-- keyword 4
(4,2),

-- keyword 5
(5,2),(5,3),(5,4);

/*
efficiently list all other programmes whose keywords match that of the 
programme currently being queried (for instance prog_id = 1) 
*/


drop procedure if exists list_matching_programmes;

delimiter #

create procedure list_matching_programmes
(
in p_prog_id mediumint unsigned
)
proc_main:begin

select
 p.*
from
 programmes p
inner join
(
 select distinct -- other programmes with same keywords as current
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = p_prog_id
 ) current_programme -- the current program keywords
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

end proc_main #


delimiter ;

call list_matching_programmes(1);
call list_matching_programmes(6); 


explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = 1
 ) current_programme
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

EDIT: added char_idx functionality as requested

alter table keywords add column char_idx char(1) null after name;

update keywords set char_idx = upper(substring(name,1,1));

select * from keywords;

explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from keywords where char_idx = 'P' -- just change the driver query
 ) keywords_starting_with
 on pk.keyword_id = keywords_starting_with.keyword_id
) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

Try this approach, not sure if it will help but at least is different:

select PadID, count(Word) as matching_words
from keywords k
where Word in (
  select Word 
  from keywords
  where PadID=44243 )
group by PadID 
order by matching_words DESC 
LIMIT 0,11

Anyway the job you want to get done is heavy, and full of string comparison, maybe exporting keywords and storing only numeric ids in the keyword table can reduce the times.

Ok after reviewing you database I think there is not a lot of room to improve in the query, in fact on my test server with index on Word it only takes about 0.15s to complete, without the index it is almost 4x times slower.

Anyway I think that implementing the change in database sctructure f00 and I have told you it will improve the response time.

Also drop the index PadID_2 as it is now it is futile and it will only slow your writes. What you should do but it requise to clean the database is to avoid duplicate keyword-prodId pair first removing al duplicate ones currently in DB (around 90k in my test with 3/4 of your DB) that will reduce query time and give meaningfull results. If you ask for a progId that has the keyword ABC that is duplicated for progdID2 then progID2 will be on top o other progIDs with the same ABC keyword but not duplicated, on my tests I have seen a progID that get several more matches that the same progID I am querying. After dropping duplicates from the DB you will need to change your application to avoid this problem again in the future and just for being safe you could add a primary key (or index with unique activated) to Word + ProgID.

继续阅读：database sql

MySQL, need some performance suggestions on my match query

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？