How to properly index tables used in a query with multiple joins

2023-04-01 08:31 问答作者：

I'm trying determine the best way to index my query below.

So far i have created composite/grouped indexes on fields used on joins, then followed by the order of which i use the where clause filters.

OR, should I create separated indexes on fields used on joins, and a separate grouped indexes on fields used in where/group by/order by clauses

  SELECT        [fields..]
  FROM          articles
  INNER JOIN    articles_to_geo 
    ON          articles_to_geo.article_id = articles.article_id 
  INNER JOIN    cities_whitelist 
    ON          cities_whitelist.city_id = articles_to_geo.whitelist_city_id
  INNER JOIN    cities 
    ON          cities.city_id = cities_whitelist.city_id
  INNER JOIN    articles_to_badges 
    ON          articles_to_badges.article_id = articles.article_id 
  INNER JOIN    badges 
    ON          badges.id = articles_to_badges.badge_id
  INNER JOIN    sites 
    ON          sites.id = arti开发者_如何学运维cles.site_id
  WHERE         articles.expirydate > '2010-07-12'
  AND           articles.dateadded > '2010-08-11'
  AND           articles.status >= 6 

  AND           cities.city_id = 5794
  AND           cities.timezone = -7
  AND           cities_whitelist.published = 1      

  AND           articles_to_badges.badge_id IN (1,3,8,7)  

  ORDER BY      sites.sort_order";

for instance, my articles tables has a grouped index on:

index 1

article_id
site_id
expirydate
status
dateadded

OR should i have 2 indexes?

index 1 //used for the join clauses

article_id

index 2 //used for the where/order by /group by clauses

site_id
expirydate
status
dateadded

Note: my other tables have indexes on them as well.

Any help would be greatly appreciated

Note: SQL Server is what I use. If you're using something else - this may not apply. Also note: I'm going to discuss indexes to aid in accessing data from a table. Covering indexes are a separate topic that I am not addressing here.

When accessing a table, there's 3 ways to do it.

Use Filtering Criteria.
Use Relational Criteria from rows already read.
Read the Whole Table!

I started by making a list of all tables, with filtering criteria and relational criteria.

articles

  articles.expirydate > 'somedate'
  articles.dateadded > 'somedate'
  articles.status >= someint

  articles.article_id <-> articles_to_geo.article_id
  articles.article_id <-> articles_to_badges.article_id
  articles.site_id <-> sites.id

articles_to_geo

  articles_to_geo.article_id <-> articles.article_id
  articles_to_geo.whitelist_city_id <-> cities_whitelist.city_id

cities_whitelist

  cities_whitelist.published = someint

  cities_whitelist.city_id <-> articles_to_geo.whitelist_city_id
  cities_whiltelist.city_id <-> cities.city_id

cities

  cities.city_id <-> cities_whiltelist.city_id

articles_to_badges

  articles_to_badges.badge_id in (some ids)

  articles_to_badges.article_id <-> articles.article_id
  article_to_badges.badge_id <-> badges.id

badges

  badges.id <-> article_to_badges.badge_id

sites

  sites.id <-> articles.site_id

The clumsiest way to approach this is to simply make an index on each table that supports each relational and filtering critera... then let the optimizer choose which indexes it wants to use. This approach is great for IO performance, and simple to do... but it costs a lot of space in un-used indexes.

The next best way is to run the query with these options turned on:

SET STATISTICS IO ON
SET STATISTICS TIME ON

If a particular set of tables is using more IO, indexing efforts can be focused on them. To do this relies on the optimizer plan for the order in which the tables are access to already be pretty good.

If the optimizer can't make a good plan at all because of the lack of indexes, what I do is figure out which order I'd like the tables to be accessed, then add indexes that support those accesses.

Note: the first table accessed does not have the option of using relational criteria, as no records are yet read. First table must be accessed by Filtering Criteria or Read the Whole Table.

One possible order is the order in the query. This approach might be pretty bad because our Articles Filtering Criteria is based on 3 different ranges. There could be thousands of articles that meet that criteria and it's hard to formulate an index to support those ranges.

Articles (Filter)
  Articles_to_Geo (Relational by Article_Id)
    Cities_WhiteList (Relational by City_Id) (Filter)
    Cities (Relational by City_Id) (Filter)
  Articles_to_Badges (Relational by Article_Id) (Filter)
    Badges (Relational by Badge_Id)
  Sites (Relational by Article_Id)

Another possible order is Cities first. The Criteria for Cities is easily indexable and there might only be 1 row! Finding the articles for a City and then filtering by date should read fewer rows than finding the articles for dates and then filtering down to the City.

Cities (Filter)
  Cities_WhiteList (Relational by City_Id) (Filter)
  Articles_to_Geo (Relational by City_Id)
    Articles (Relational by Article_Id) (Filter)
      Articles_to_Badges (Relational by Article_Id) (Filter)
        Badges (Relational by Badge_Id)
      Sites (Relational by Article_Id)

A third approach could be Badges first. This would be best if articles rarely accumulate Badges and there aren't many Badges.

Badges (Read the Whole Table)
  Articles_to_Badges (Relational by Badge_Id) (Filter)
    Articles (Relational by Article_Id) (Filter)
      Articles_to_Geo (Relational by Article_Id)
        Cities_WhiteList (Relational by City_Id) (Filter)
        Cities (Relational by City_Id) (Filter)
    Sites (Relational by Article_Id)

I would suggest reading this: http://hackmysql.com/case4

Its a good explanation of when / what to index.

For a start i would create indexes for these:

articles_to_geo.article_id
cities_whitelist.city_id
cities.city_id
articles_to_badges.article_id
articles_to_badges.badge_id
badges.id
sites.id

without the above, your joins + the IN() are going to take forever

Edit: I removed article_id field from the articles index

Back in the old days RDBMS systems were not able to combine B-Tree indexes on one table. see this article http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/index-merge-performance. This means e.g. that if you had separate indexes for all the articles columns used in this query, then only one of these indexes would be used.

based solely on this query, you should have the following indexes:

articles

site_id
expirydate
status
dateadded

articles_to_geo

article_id

cities_whitelist

city_id

cities sites is joined on sites.id = articles.site_id Here I suppose id is primary key on sites hence no need for additional indexes on cities.city_id and cities.timezone as they will be part of a filter predicate anyway

articles_to_badges

article_id
badge_id (or this could be a second index of type Bitmap, refer to the article above)

badges also joined on primary key, no need for additional index if you have a unique index on id field

Notes on the articles index: The order of the fields in the index have nothing to do with the order of the fields appearing in the where clause. If you keep this order then the index can be used all those queryies where you specify

and site_id
site_id and expirydate
etc.

But this cannot be used in those queries where you only specify

expirydate
expirydate and status
etc.

继续阅读：database indexing sql

How to properly index tables used in a query with multiple joins

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？