开发者

MySQL: simple schema, joining in a view and sorting on unrelated attribute causes unbearable performance hit

I'm creating a database model for use by a diverse amount of applications and different kinds of database servers (though I'm mostly testing on MySQL and SQLite now). It's a really simple model that basically consists of one central matches table and many attribute tables that have the match_id as their primary key and one other field (the attribute value itself). Said in other words, every match has exactly one of every type of attribute and every attribute is stored in a seperate table. After experiencing some rather bad performance whilst sorting and filtering on these attributes (FROM matches LEFT JOIN attributes_i_want on primary index) I decided to try to improve it. To this end I added an index on every attribute value column. Sorting and filtering performance increased a lot for easy queries.

This simple schema is basically a requirement for the application, so it is able to auto-discover and use attributes. Thus, to create more complex attributes that are actually based on other results, I decided to use VIEWs that turn one or more other tables that don't necessarily match up to the attribute-like schema into an attribute-schema. I call these meta-attributes (they aren't directly editable either). However, to the application this is all transparant, and so it happily joins in the VIEW as well when it wants to. The problem: it kills performance. When the VIEW is joined in without sorting on any attribute, performance is still acceptable, but combining a retrieval of the VIEW with sorting is unacceptably slow (on the order of 1s). Even after reading quite a bit of tutorials on indexing and some questions here on stack overflow, I can't seem to help it.

_Prerequisites for a solution: in one way or another, num_duplicates must exist as a table or view with the columns match_id and num_duplicates to look like an attribute. I can't change the way attributes are discovered and used. So if I want to see num_duplicates appear in the application it'll have to be as some kind of view or materialized table that makes a num_duplicates table._

Relevant parts of the schema

Main table:

CREATE TABLE `matches` (
  `match_id` int(11) NOT NULL,
  `source_name` text,
  `target_name` text,
  `transformation` text,
  PRIMARY KEY (`match_id`)
) ENGINE=InnoDB;

Example of a normal attribute (indexed):

CREATE TABLE `error` (
  `match_id` int(11) NOT NULL,
  `error` double DEFAULT NULL,
  PRIMARY KEY (`match_id`),
  KEY `error_index` (`error`)
) ENGINE=InnoDB;

(all normal attributes, like error, are basically the same)

Meta-attribute / VIEW:

CREATE VIEW num_duplicates 
AS SELECT duplicate AS match_id, COUNT(duplicate) AS num_duplicates 
    FROM duplicate 
    GROUP BY duplicate

(this is the only meta-attribute I'm using right now)

Simple query with indexing on the attribute value columns (the part improved by indexes)

SELECT matches.match_id, source_name, target_name, transformation FROM matches
    INNER JOIN error ON matches.match_id = error.match_id
ORDER BY error.error

(the performance on this query increased a lot because of the index on error) (the runtime of this query is on the order of 0.0001 sec)

Slightly more complex queries and their runtimes including the meta-attribute (the still bad part)

SELECT 
    matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates
FROM matches
    INNER JOIN STATUS ON matches.match_id = status.match_id
    INNER JOIN error ON matches.match_id = error.match_id
    LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id
    INNER JOIN volume ON matches.match_id = volume.match_id
    INNER JOIN COMMENT ON matches.match_id = comment.match_id

(runtime: 0.0263sec) <--- still acceptable

SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates
FROM matches
    INNER JOIN STATUS ON matches.match_id = status.match_id
    INNER JOIN error ON matches.match_id = error.match_id
    LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id
    INNER JOIN volume ON matches.match_id = volume.match_id
    INNER JOIN COMMENT ON matches.match_id = comment.match_id
ORDER BY error.error
LIMIT 20, 20

(runtime: 0.8866 sec) <--- not acceptable (the query speed is exactly the same with the LIMIT as without the LIMIT, note: if I could get the version with the LIMIT to be fast that would already be a big win. I presume it has to scan the entire table and so the limit doesn't matter too much)

EXPLAIN of the last query

Of course I tried to solve it myself before coming here, but I must admit I'm not that good at these things and haven't found a way to remove the offending performance killer yet. I know it's most likely the using filesort but I don't know how to get rid of it.

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY error   index   PRIMARY,match_id    error_index 9   NULL    53909   Using index; Using temporary; Using filesort
1   PRIMARY COMMENT eq_ref  PRIMARY PRIMARY 4   tangbig4.error.match_id 1    
1   PRIMARY STATUS  eq_ref  PRIMARY PRIMARY 4   tangbig4.COMMENT.match_id   1   Using where
1开发者_运维知识库   PRIMARY matches eq_ref  PRIMARY PRIMARY 4   tangbig4.COMMENT.match_id   1   Using where
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2    
1   PRIMARY volume  eq_ref  PRIMARY PRIMARY 4   tangbig4.matches.match_id   1   Using where
2   DERIVED duplicate   index   NULL    duplicate_index 5   NULL    49222   Using index

By the way, the query without the sort, which still runs acceptably, is EXPLAIN'ed like this:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY COMMENT ALL PRIMARY NULL    NULL    NULL    49610    
1   PRIMARY error   eq_ref  PRIMARY,match_id    PRIMARY 4       tangbig4.COMMENT.match_id   1    
1   PRIMARY matches eq_ref  PRIMARY PRIMARY 4   tangbig4.COMMENT.match_id   1    
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2    
1   PRIMARY STATUS  eq_ref  PRIMARY PRIMARY 4   tangbig4.COMMENT.match_id   1    
1   PRIMARY volume  eq_ref  PRIMARY PRIMARY 4   tangbig4.matches.match_id   1       Using where
2   DERIVED duplicate   index   NULL    duplicate_index 5   NULL    49222   Using index

Question

So, my question is if someone who know more about databases/MySQL is able to find me a way that I can use/research to increase the performance of my last query.

I've been thinking quite a lot about materialized views but they are not natively supported in MySQL and since I'm going for as wide a range of SQL servers as possible this might not be idea. I'm hoping maybe a change to the queries or views might help or possible an extra index.

EDIT: Some random thoughts I've been having about the query:

  • VERY FAST: joining all tables, excluding the VIEW, sorting
  • ACCEPTABLE: joining all tables, including the VIEW, no sorting
  • DOG SLOW: joining all tables, including the VIEW, sorting

But: the VIEW has no influence at all on the sorting, none of it's attributes or even the attributes in its constituent tables are used to sort. Why does includingg the sort impact performance that much then? Is there any way I can convince the database to sort first and then just join up the VIEW? Or can I convince it that the VIEW is not important for sorting?

EDIT2: Following the suggestion by @ace for creating a VIEW and then joining at first didn't seem to help:

DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
  SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
  FROM matches
  INNER JOIN STATUS ON matches.match_id = status.match_id
  INNER JOIN error ON matches.match_id = error.match_id
  INNER JOIN volume ON matches.match_id = volume.match_id
  INNER JOIN COMMENT ON matches.match_id = comment.match_id
  ORDER BY error.error
);

followed by:

SELECT matches_joined.*, num_duplicates
FROM matches_joined
LEFT JOIN num_duplicates ON matches_joined.match_id = num_duplicates.match_id

However, using LIMIT on the view did make a difference:

DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
  SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
  FROM matches
  INNER JOIN STATUS ON matches.match_id = status.match_id
  INNER JOIN error ON matches.match_id = error.match_id
  INNER JOIN volume ON matches.match_id = volume.match_id
  INNER JOIN COMMENT ON matches.match_id = comment.match_id
  ORDER BY error.error
  LIMIT 0, 20
);

Afterwards, the query ran at an acceptable speed. This is already a nice result. However, I feel that I'm jumping through hoops to force the database to do what I want and the reduction in time is probably only caused by the fact that it now only has to sort 20 rows. What if I have more rows? Is there any other way to force the database to see that joining in the num_duplicates VIEW doesn't influence the sorting in the least? Could I perhaps change the query that makes the VIEW a bit?


Some things that can be tested if you haven't tried them yet. Create a view for all joins with sorting.

DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
  SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
  FROM matches
  INNER JOIN STATUS ON matches.match_id = status.match_id
  INNER JOIN error ON matches.match_id = error.match_id
  INNER JOIN volume ON matches.match_id = volume.match_id
  INNER JOIN COMMENT ON matches.match_id = comment.match_id
  ORDER BY error.error
);

Then join them with num_duplicates

SELECT matches_joined.*, num_duplicates
FROM matches_joined
LEFT JOIN num_duplicates ON matches_joined.match_id = num_duplicates.match_id

I'm assuming that as pointed out in here, this query will utilize the order by clause in the view matches_joined.

Some information that may help on optimization.
MySQL :: MySQL 5.0 Reference Manual :: 7.3.1.11 ORDER BY Optimization


The problem was more or less solved by the "VIEW" suggestion that @ace made, but several other types of queries still had performance issues (notably large OFFSET's). In the end a large improvement on all queries of this form was had by simply forcing late-row lookup. Note that it is commonly claimed that this is only necessary for MySQL because MySQL always performs early-row lookup and that other databases like PostgreSQL don't suffer from this problem. However, extensive benchmarks of my application have pointed out that PostgreSQL benefits greatly from this approach as well.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜