开发者

How to properly index tables used in a query with multiple joins

I'm trying determine the best way to index my query below.

So far i have created composite/grouped indexes on fields used on joins, then followed by the order of which i use the where clause filters.

OR, should I create separated indexes on fields used on joins, and a separate grouped indexes on fields used in where/group by/order by clauses

  SELECT        [fields..]
  FROM          articles
  INNER JOIN    articles_to_geo 
    ON          articles_to_geo.article_id = articles.article_id 
  INNER JOIN    cities_whitelist 
    ON          cities_whitelist.city_id = articles_to_geo.whitelist_city_id
  INNER JOIN    cities 
    ON          cities.city_id = cities_whitelist.city_id
  INNER JOIN    articles_to_badges 
    ON          articles_to_badges.article_id = articles.article_id 
  INNER JOIN    badges 
    ON          badges.id = articles_to_badges.badge_id
  INNER JOIN    sites 
    ON          sites.id = arti开发者_如何学运维cles.site_id
  WHERE         articles.expirydate > '2010-07-12'
  AND           articles.dateadded > '2010-08-11'
  AND           articles.status >= 6 

  AND           cities.city_id = 5794
  AND           cities.timezone = -7
  AND           cities_whitelist.published = 1      

  AND           articles_to_badges.badge_id IN (1,3,8,7)  

  ORDER BY      sites.sort_order";

for instance, my articles tables has a grouped index on:

index 1

article_id
site_id
expirydate
status
dateadded

OR should i have 2 indexes?

index 1 //used for the join clauses

article_id

index 2 //used for the where/order by /group by clauses

site_id
expirydate
status
dateadded

Note: my other tables have indexes on them as well.

Any help would be greatly appreciated


Note: SQL Server is what I use. If you're using something else - this may not apply. Also note: I'm going to discuss indexes to aid in accessing data from a table. Covering indexes are a separate topic that I am not addressing here.

When accessing a table, there's 3 ways to do it.

  • Use Filtering Criteria.
  • Use Relational Criteria from rows already read.
  • Read the Whole Table!

I started by making a list of all tables, with filtering criteria and relational criteria.

articles

  articles.expirydate > 'somedate'
  articles.dateadded > 'somedate'
  articles.status >= someint

  articles.article_id <-> articles_to_geo.article_id
  articles.article_id <-> articles_to_badges.article_id
  articles.site_id <-> sites.id

articles_to_geo

  articles_to_geo.article_id <-> articles.article_id
  articles_to_geo.whitelist_city_id <-> cities_whitelist.city_id

cities_whitelist

  cities_whitelist.published = someint

  cities_whitelist.city_id <-> articles_to_geo.whitelist_city_id
  cities_whiltelist.city_id <-> cities.city_id

cities

  cities.city_id <-> cities_whiltelist.city_id

articles_to_badges

  articles_to_badges.badge_id in (some ids)

  articles_to_badges.article_id <-> articles.article_id
  article_to_badges.badge_id <-> badges.id

badges

  badges.id <-> article_to_badges.badge_id

sites

  sites.id <-> articles.site_id

The clumsiest way to approach this is to simply make an index on each table that supports each relational and filtering critera... then let the optimizer choose which indexes it wants to use. This approach is great for IO performance, and simple to do... but it costs a lot of space in un-used indexes.

The next best way is to run the query with these options turned on:

SET STATISTICS IO ON
SET STATISTICS TIME ON

If a particular set of tables is using more IO, indexing efforts can be focused on them. To do this relies on the optimizer plan for the order in which the tables are access to already be pretty good.


If the optimizer can't make a good plan at all because of the lack of indexes, what I do is figure out which order I'd like the tables to be accessed, then add indexes that support those accesses.

Note: the first table accessed does not have the option of using relational criteria, as no records are yet read. First table must be accessed by Filtering Criteria or Read the Whole Table.

One possible order is the order in the query. This approach might be pretty bad because our Articles Filtering Criteria is based on 3 different ranges. There could be thousands of articles that meet that criteria and it's hard to formulate an index to support those ranges.

Articles (Filter)
  Articles_to_Geo (Relational by Article_Id)
    Cities_WhiteList (Relational by City_Id) (Filter)
    Cities (Relational by City_Id) (Filter)
  Articles_to_Badges (Relational by Article_Id) (Filter)
    Badges (Relational by Badge_Id)
  Sites (Relational by Article_Id)

Another possible order is Cities first. The Criteria for Cities is easily indexable and there might only be 1 row! Finding the articles for a City and then filtering by date should read fewer rows than finding the articles for dates and then filtering down to the City.

Cities (Filter)
  Cities_WhiteList (Relational by City_Id) (Filter)
  Articles_to_Geo (Relational by City_Id)
    Articles (Relational by Article_Id) (Filter)
      Articles_to_Badges (Relational by Article_Id) (Filter)
        Badges (Relational by Badge_Id)
      Sites (Relational by Article_Id)

A third approach could be Badges first. This would be best if articles rarely accumulate Badges and there aren't many Badges.

Badges (Read the Whole Table)
  Articles_to_Badges (Relational by Badge_Id) (Filter)
    Articles (Relational by Article_Id) (Filter)
      Articles_to_Geo (Relational by Article_Id)
        Cities_WhiteList (Relational by City_Id) (Filter)
        Cities (Relational by City_Id) (Filter)
    Sites (Relational by Article_Id)


I would suggest reading this: http://hackmysql.com/case4

Its a good explanation of when / what to index.

For a start i would create indexes for these:

  1. articles_to_geo.article_id
  2. cities_whitelist.city_id
  3. cities.city_id
  4. articles_to_badges.article_id
  5. articles_to_badges.badge_id
  6. badges.id
  7. sites.id

without the above, your joins + the IN() are going to take forever


Edit: I removed article_id field from the articles index

Back in the old days RDBMS systems were not able to combine B-Tree indexes on one table. see this article http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/index-merge-performance. This means e.g. that if you had separate indexes for all the articles columns used in this query, then only one of these indexes would be used.

based solely on this query, you should have the following indexes:

articles

site_id
expirydate
status
dateadded

articles_to_geo

article_id

cities_whitelist

city_id

cities sites is joined on sites.id = articles.site_id Here I suppose id is primary key on sites hence no need for additional indexes on cities.city_id and cities.timezone as they will be part of a filter predicate anyway

articles_to_badges

article_id
badge_id (or this could be a second index of type Bitmap, refer to the article above)

badges also joined on primary key, no need for additional index if you have a unique index on id field

Notes on the articles index: The order of the fields in the index have nothing to do with the order of the fields appearing in the where clause. If you keep this order then the index can be used all those queryies where you specify

  • and site_id
  • site_id and expirydate
  • etc.

But this cannot be used in those queries where you only specify

  • expirydate
  • expirydate and status
  • etc.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜