开发者

mySQL: is it possible to make this query any faster?

I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.

create table test 
(
    id      int not null,
    feature double not null,
    count   int not null
);

The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.

select 
    t1.id,
    t2.id,
    sum( least( t1.count, t2.count )) as priority 
from test as t1
inner join test as t2 
     on t2.feat开发者_StackOverflow中文版ure = t1.feature
where t1.id = {some user supplied id value} 
group by t1.id, t2.id 
order by priority desc
limit 500;

Any ideas on how to improve on this? The schema can be modified and indices added as needed.


With the current schema, this query hardly can be improved.

You already have an index on feature and this is the best you can do with the current schema design.

The problem is more similar than is not a relationship of order. If a is more similar to b than it is to c, it does not imply that c is less similar to a than it is to b. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2 entries long, where N is the number of items.

If you always need only top 500 items, you can limit your index to that figure (in which case it will hold 500 * N entries).

MySQL does not support indexed or materialized views, so you will have to do it yourself:

  1. Create a table like this:

    CREATE TABLE similarity
            (
            id1 INT NOT NULL,
            id2 INT NOT NULL,
            similarity DOUBLE NOT NULL,
            PRIMARY KEY (id1, id2),
            KEY (id1, similarity)
            )
    
  2. Whenever you insert a new feature into the table, reflect the changes in the similarity:

    INSERT
    INTO    similarity
    SELECT  @newid, id,
            LEAST(@newcount, count) AS ns
    FROM    test
    WHERE   feature = @newfeature
            AND id <> @newid
    ON DUPLICATE KEY UPDATE
    SET     similarity = similarity + ns;
    
    
    INSERT
    INTO    similarity
    SELECT  @newid, id,
            LEAST(@newcount, count) AS ns
    FROM    test
    WHERE   feature = @newfeature
            AND id <> @newid
    ON DUPLICATE KEY UPDATE
    SET     similarity = similarity + ns;
    
  3. On a timely basis, remove the excess similarities:

    DELETE  s
    FROM    (
            SELECT  id1,
                    (
                    SELECT  similarity
                    FROM    similarity si
                    WHERE   si.id1 = s.id1
                    ORDER BY
                            si.id1 DESC, si.similarity DESC
                    LIMIT 499, 1
                    ) AS cs
            FROM    (
                    SELECT  DISTINCT id1
                    FROM    similarity
                    ) s
            ) q
    JOIN    similarity s
    ON      s.id1 = q.id1
            AND s.similarity < q.cs
    
  4. Query your data:

    SELECT  id2
    FROM    similarity
    WHERE   id1 = @myid
    ORDER BY
            similarity DESC
    LIMIT 500
    


Having a floating point number as part of Primary Key (PK) is a killer. For that matter it should not be a part of any constraint - Unique Key (UK), Foreign Key (FK) etc.

To improve the performance of your SQL query many fold, try changing your schema as below:

CREATE TABLE test ( 
item_id      INTEGER,
feature_id INTEGER,
count   INTEGER );

CREATE TABLE features (
id   INTEGER, feature_value double not null );

CREATE TABLE items (
id   INTEGER, item_description varchar2(100) not null );

ALTER TABLE test ADD CONSTRAINT fk_test_item_id foreign key (item_id) references items(id);

ALTER TABLE test ADD CONSTRAINT fk_test_feature_id foreign key(feature_id) references features(id);

With your test table normalized as above, I have separated items and feature to its own separate tables and this becomes more than a mere mapping table bearing the count of each mapping.

Should you now fire the SQL query you have fired earlier with little modifications as mentioned below, you should see a significant/drastic improvement in the SQL query performance.

select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority 
from test as t1 inner join test as t2 on t2.feature_id = t1.feature_id 
where t1.id = {some user supplied id value}
group by t1.id, t2.id 
order by priority desc
limit 500;

Cheers!


One optimization would be to exclude the item itself from the self-join:

inner join test as t2 
     on t2.feature = t1.feature and t2.id <> t1.id
                                    ^^^^^^^^^^^^^^

For further speedup, create a covering index on (feature, id, count).


I would start with this... love to hear back on performance you are looking at. I don't think you needed the LEAST( of t1 vs t2 counts ). If you are first qualifying the where based on ID = {some value}, you will obviously get all those "features". Then via a self-join to itself only concerned with the matching "features", you get a count. Since you are breaking it down by by ID1 and ID2, each respective "feature" will be counted once. At the end of this query, since I'm not expclicitly excluding t2.ID equal to the {some user value}, It's count should be the EXACT SAME count of features in t1, and anything else under that would be your other closest matches.

I would ensure I had an index on ID and FEATURE.

select STRAIGHT_JOIN
      t1.id,
      t2.id, 
      count(*) as MatchedInBoth
   from 
      test as t1,
      test as t2
   where 
          t1.id = {some user value}
      and t1.feature = t2.feature
   group by
      t1.id,
      t2.id
   order by 
      MatchedInBoth desc 
   limit 
      500; 

The result might give something like

t1            t2           MatchedInBoth
{user value}  {user value} 275
{user value}  Other ID 1   270
{user value}  Other ID 2   241
{user value}  Other ID 3   218
{user value}  Other ID 4   197
{user value}  Other ID 5   163, etc


Can you knock it down to just one table? Usinq subqueries you might be able to avoid the join and it will be a win if the subqueries are faster, indexed, and executed exactly once. Something like this (untested).

select
t2.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (SELECT MIN(count) FROM test t1 WHERE id= {some user supplied value} ) AND
t2.feature IN (SELECT feature FROM test t1 WHERE id= {some user supplied value} )
group by t1.id
order by priority desc
limit 500;

If that doesnt work Mysql is terrible at realizing the inner selects are constant tables and will re-execute them for each row. Wrapping them in a select again forces a constant table lookup. Heres a hack:


select
t1.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (
SELECT * FROM (
SELECT MIN(count) FROM test t1 WHERE id= {some user supplied
value} ) as const ) AND
t2.feature IN ( SELECT * from (
SELECT feature FROM test t1 WHERE id= {some user supplied value}
) as const )
group by t1.id
order by priority desc
limit 500;

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜