How to properly index tables used in a query with multiple joins
I'm trying determine the best way to index my query below.
So far i have created composite/grouped indexes on fields used on joins, then followed by the order of which i use the where clause filters.
OR, should I create separated indexes on fields used on joins, and a separate grouped indexes on fields used in where/group by/order by clauses
SELECT [fields..]
FROM articles
INNER JOIN articles_to_geo
ON articles_to_geo.article_id = articles.article_id
INNER JOIN cities_whitelist
ON cities_whitelist.city_id = articles_to_geo.whitelist_city_id
INNER JOIN cities
ON cities.city_id = cities_whitelist.city_id
INNER JOIN articles_to_badges
ON articles_to_badges.article_id = articles.article_id
INNER JOIN badges
ON badges.id = articles_to_badges.badge_id
INNER JOIN sites
ON sites.id = arti开发者_如何学运维cles.site_id
WHERE articles.expirydate > '2010-07-12'
AND articles.dateadded > '2010-08-11'
AND articles.status >= 6
AND cities.city_id = 5794
AND cities.timezone = -7
AND cities_whitelist.published = 1
AND articles_to_badges.badge_id IN (1,3,8,7)
ORDER BY sites.sort_order";
for instance, my articles tables has a grouped index on:
index 1
article_id
site_id
expirydate
status
dateadded
OR should i have 2 indexes?
index 1 //used for the join clauses
article_id
index 2 //used for the where/order by /group by clauses
site_id
expirydate
status
dateadded
Note: my other tables have indexes on them as well.
Any help would be greatly appreciated
Note: SQL Server is what I use. If you're using something else - this may not apply. Also note: I'm going to discuss indexes to aid in accessing data from a table. Covering indexes are a separate topic that I am not addressing here.
When accessing a table, there's 3 ways to do it.
- Use Filtering Criteria.
- Use Relational Criteria from rows already read.
- Read the Whole Table!
I started by making a list of all tables, with filtering criteria and relational criteria.
articles
articles.expirydate > 'somedate'
articles.dateadded > 'somedate'
articles.status >= someint
articles.article_id <-> articles_to_geo.article_id
articles.article_id <-> articles_to_badges.article_id
articles.site_id <-> sites.id
articles_to_geo
articles_to_geo.article_id <-> articles.article_id
articles_to_geo.whitelist_city_id <-> cities_whitelist.city_id
cities_whitelist
cities_whitelist.published = someint
cities_whitelist.city_id <-> articles_to_geo.whitelist_city_id
cities_whiltelist.city_id <-> cities.city_id
cities
cities.city_id <-> cities_whiltelist.city_id
articles_to_badges
articles_to_badges.badge_id in (some ids)
articles_to_badges.article_id <-> articles.article_id
article_to_badges.badge_id <-> badges.id
badges
badges.id <-> article_to_badges.badge_id
sites
sites.id <-> articles.site_id
The clumsiest way to approach this is to simply make an index on each table that supports each relational and filtering critera... then let the optimizer choose which indexes it wants to use. This approach is great for IO performance, and simple to do... but it costs a lot of space in un-used indexes.
The next best way is to run the query with these options turned on:
SET STATISTICS IO ON
SET STATISTICS TIME ON
If a particular set of tables is using more IO, indexing efforts can be focused on them. To do this relies on the optimizer plan for the order in which the tables are access to already be pretty good.
If the optimizer can't make a good plan at all because of the lack of indexes, what I do is figure out which order I'd like the tables to be accessed, then add indexes that support those accesses.
Note: the first table accessed does not have the option of using relational criteria, as no records are yet read. First table must be accessed by Filtering Criteria or Read the Whole Table.
One possible order is the order in the query. This approach might be pretty bad because our Articles Filtering Criteria is based on 3 different ranges. There could be thousands of articles that meet that criteria and it's hard to formulate an index to support those ranges.
Articles (Filter)
Articles_to_Geo (Relational by Article_Id)
Cities_WhiteList (Relational by City_Id) (Filter)
Cities (Relational by City_Id) (Filter)
Articles_to_Badges (Relational by Article_Id) (Filter)
Badges (Relational by Badge_Id)
Sites (Relational by Article_Id)
Another possible order is Cities first. The Criteria for Cities is easily indexable and there might only be 1 row! Finding the articles for a City and then filtering by date should read fewer rows than finding the articles for dates and then filtering down to the City.
Cities (Filter)
Cities_WhiteList (Relational by City_Id) (Filter)
Articles_to_Geo (Relational by City_Id)
Articles (Relational by Article_Id) (Filter)
Articles_to_Badges (Relational by Article_Id) (Filter)
Badges (Relational by Badge_Id)
Sites (Relational by Article_Id)
A third approach could be Badges first. This would be best if articles rarely accumulate Badges and there aren't many Badges.
Badges (Read the Whole Table)
Articles_to_Badges (Relational by Badge_Id) (Filter)
Articles (Relational by Article_Id) (Filter)
Articles_to_Geo (Relational by Article_Id)
Cities_WhiteList (Relational by City_Id) (Filter)
Cities (Relational by City_Id) (Filter)
Sites (Relational by Article_Id)
I would suggest reading this: http://hackmysql.com/case4
Its a good explanation of when / what to index.
For a start i would create indexes for these:
- articles_to_geo.article_id
- cities_whitelist.city_id
- cities.city_id
- articles_to_badges.article_id
- articles_to_badges.badge_id
- badges.id
- sites.id
without the above, your joins + the IN() are going to take forever
Edit: I removed article_id
field from the articles index
Back in the old days RDBMS systems were not able to combine B-Tree indexes on one table. see this article http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/index-merge-performance. This means e.g. that if you had separate indexes for all the articles columns used in this query, then only one of these indexes would be used.
based solely on this query, you should have the following indexes:
articles
site_id
expirydate
status
dateadded
articles_to_geo
article_id
cities_whitelist
city_id
cities
sites is joined on sites.id = articles.site_id
Here I suppose id is primary key on sites hence no need for additional indexes on cities.city_id
and cities.timezone
as they will be part of a filter predicate anyway
articles_to_badges
article_id
badge_id (or this could be a second index of type Bitmap, refer to the article above)
badges also joined on primary key, no need for additional index if you have a unique index on id field
Notes on the articles index: The order of the fields in the index have nothing to do with the order of the fields appearing in the where clause. If you keep this order then the index can be used all those queryies where you specify
- and site_id
- site_id and expirydate
- etc.
But this cannot be used in those queries where you only specify
- expirydate
- expirydate and status
- etc.
精彩评论