mySQL: is it possible to make this query any faster?

2023-01-27 06:51 问答作者：

I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.

create table test 
(
    id      int not null,
    feature double not null,
    count   int not null
);

The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.

select 
    t1.id,
    t2.id,
    sum( least( t1.count, t2.count )) as priority 
from test as t1
inner join test as t2 
     on t2.feat开发者_StackOverflow中文版ure = t1.feature
where t1.id = {some user supplied id value} 
group by t1.id, t2.id 
order by priority desc
limit 500;

Any ideas on how to improve on this? The schema can be modified and indices added as needed.

With the current schema, this query hardly can be improved.

You already have an index on feature and this is the best you can do with the current schema design.

The problem is more similar than is not a relationship of order. If a is more similar to b than it is to c, it does not imply that c is less similar to a than it is to b. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2 entries long, where N is the number of items.

If you always need only top 500 items, you can limit your index to that figure (in which case it will hold 500 * N entries).

MySQL does not support indexed or materialized views, so you will have to do it yourself:

Create a table like this:

CREATE TABLE similarity
        (
        id1 INT NOT NULL,
        id2 INT NOT NULL,
        similarity DOUBLE NOT NULL,
        PRIMARY KEY (id1, id2),
        KEY (id1, similarity)
        )

Whenever you insert a new feature into the table, reflect the changes in the similarity:

INSERT
INTO    similarity
SELECT  @newid, id,
        LEAST(@newcount, count) AS ns
FROM    test
WHERE   feature = @newfeature
        AND id <> @newid
ON DUPLICATE KEY UPDATE
SET     similarity = similarity + ns;


INSERT
INTO    similarity
SELECT  @newid, id,
        LEAST(@newcount, count) AS ns
FROM    test
WHERE   feature = @newfeature
        AND id <> @newid
ON DUPLICATE KEY UPDATE
SET     similarity = similarity + ns;

On a timely basis, remove the excess similarities:

DELETE  s
FROM    (
        SELECT  id1,
                (
                SELECT  similarity
                FROM    similarity si
                WHERE   si.id1 = s.id1
                ORDER BY
                        si.id1 DESC, si.similarity DESC
                LIMIT 499, 1
                ) AS cs
        FROM    (
                SELECT  DISTINCT id1
                FROM    similarity
                ) s
        ) q
JOIN    similarity s
ON      s.id1 = q.id1
        AND s.similarity < q.cs

Query your data:

SELECT  id2
FROM    similarity
WHERE   id1 = @myid
ORDER BY
        similarity DESC
LIMIT 500

Having a floating point number as part of Primary Key (PK) is a killer. For that matter it should not be a part of any constraint - Unique Key (UK), Foreign Key (FK) etc.

To improve the performance of your SQL query many fold, try changing your schema as below:

CREATE TABLE test ( 
item_id      INTEGER,
feature_id INTEGER,
count   INTEGER );

CREATE TABLE features (
id   INTEGER, feature_value double not null );

CREATE TABLE items (
id   INTEGER, item_description varchar2(100) not null );

ALTER TABLE test ADD CONSTRAINT fk_test_item_id foreign key (item_id) references items(id);

ALTER TABLE test ADD CONSTRAINT fk_test_feature_id foreign key(feature_id) references features(id);

With your test table normalized as above, I have separated items and feature to its own separate tables and this becomes more than a mere mapping table bearing the count of each mapping.

Should you now fire the SQL query you have fired earlier with little modifications as mentioned below, you should see a significant/drastic improvement in the SQL query performance.

select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority 
from test as t1 inner join test as t2 on t2.feature_id = t1.feature_id 
where t1.id = {some user supplied id value}
group by t1.id, t2.id 
order by priority desc
limit 500;

Cheers!

One optimization would be to exclude the item itself from the self-join:

inner join test as t2 
     on t2.feature = t1.feature and t2.id <> t1.id
                                    ^^^^^^^^^^^^^^

For further speedup, create a covering index on (feature, id, count).

I would start with this... love to hear back on performance you are looking at. I don't think you needed the LEAST( of t1 vs t2 counts ). If you are first qualifying the where based on ID = {some value}, you will obviously get all those "features". Then via a self-join to itself only concerned with the matching "features", you get a count. Since you are breaking it down by by ID1 and ID2, each respective "feature" will be counted once. At the end of this query, since I'm not expclicitly excluding t2.ID equal to the {some user value}, It's count should be the EXACT SAME count of features in t1, and anything else under that would be your other closest matches.

I would ensure I had an index on ID and FEATURE.

select STRAIGHT_JOIN
      t1.id,
      t2.id, 
      count(*) as MatchedInBoth
   from 
      test as t1,
      test as t2
   where 
          t1.id = {some user value}
      and t1.feature = t2.feature
   group by
      t1.id,
      t2.id
   order by 
      MatchedInBoth desc 
   limit 
      500;

The result might give something like

t1            t2           MatchedInBoth
{user value}  {user value} 275
{user value}  Other ID 1   270
{user value}  Other ID 2   241
{user value}  Other ID 3   218
{user value}  Other ID 4   197
{user value}  Other ID 5   163, etc

Can you knock it down to just one table? Usinq subqueries you might be able to avoid the join and it will be a win if the subqueries are faster, indexed, and executed exactly once. Something like this (untested).

select t2.id, SUM( t2.count ) as priority from test as t2 where t2.id = {some user supplied id value} AND t2.count > (SELECT MIN(count) FROM test t1 WHERE id= {some user supplied value} ) AND t2.feature IN (SELECT feature FROM test t1 WHERE id= {some user supplied value} ) group by t1.id order by priority desc limit 500;

If that doesnt work Mysql is terrible at realizing the inner selects are constant tables and will re-execute them for each row. Wrapping them in a select again forces a constant table lookup. Heres a hack:

select t1.id, SUM( t2.count ) as priority from test as t2 where t2.id = {some user supplied id value} AND t2.count > ( SELECT * FROM ( SELECT MIN(count) FROM test t1 WHERE id= {some user supplied value} ) as const ) AND t2.feature IN ( SELECT * from ( SELECT feature FROM test t1 WHERE id= {some user supplied value} ) as const ) group by t1.id order by priority desc limit 500;

继续阅读：optimization query-optimization sql

mySQL: is it possible to make this query any faster?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？