Indexing & alternatives for low-selectivity columns

2023-01-24 23:30 问答作者：

What are the range of tactics available for selecting records on low selectivity columns?

An example mig开发者_如何学JAVAht be an orders table where, over many years, you build up a large number of completed orders but often need to select active orders. An order might go through a lifecycle such as placed, stock-allocated, picked from warehouse, despatched to customer, invoiced and paid. An order might additionally be cancelled, held, etc. The majority of records will eventually be in the final state (e.g. paid) but you might often need to select, say, allocated orders. In this case a sequential read would be slow.

Similar questions on indexing

MySQL: low cardinality/selectivity columns = how to index?

Do indexes suck in SQL?

What are indexes and how can I use them to optimize queries in my database?

Defining indexes: Which Columns, and Performance Impact?

and numerous others decreasingly related.

The approaches I have read about (in stackoverflow and elsewhere) include

Use a bitmap index
Use a partial index (create index x on t(c2) where c1='a')
Use a clustered index?
Don't index low selectivity columns, use sequential read
Partition the data (e.g. into several tables with identical schema)
Use a supplementary table (e.g. active_customers(customer_id)

My current DBMS doesn't support the first three options listed above and the remainder seem problematic - are there any other commonly used approaches?

Update: I've seen - index your low-selectivity column, but only ever select for high-selectivity values.

I agree with Unreason's However branch. But there are some things to know about this case.

This is called skew and skew kills. This is a perfect use for a partial index where you'd exclude the 95% of paid invoices and only index the more interesting and selective stats. But you don't have that. You can horizontally partition all the rows into separate table/partitions but then you need to account for row migration (moving from one status to another) and that's expensive. The DBMS has to perform an Update, a Delete and an insert to change the status. If you're a high volume system that will hurt.

Forget what you said about whether or not to index based on selectivity because putting an index on a rapidly changing column is also usually a bad idea. Your index will have hot blocks where all the step 1's are being removed and another where all the step 2's are being inserted and oh btw, some step 2's are being removed at the same time into step 3's. This won't scale well.

I would recommend vertically partitioning your status into a separate table(s).

Your invoice table will have a PK and all the columns except status.

Your status you can handle two ways. That table will have the PK value as an FK back to the invoice table, the Status and a timestamp for when you entered that status. The best is a horizontally partitioned table on status. You'll have a partition for each status possible. So finding all or one "Placed" status will partition prune and read only the partition it needs - which is a very small number of blocks. Because the row is so narrow, you might get 400 invoice statuses on a single block. Looking up that status of any one invoice is easy since there's a global index on the PK.

If your RDBMS doesn't support partitioning with row migration, you'll need to manage these partitions as tables and delete from one and insert into another. You'll encapsulate these movements in a transaction in a procedure, so you keep the data clean. Every invoice is in one and only one status table. The harder part is querying by invoice ID, you'll have to check every table to see where it is.

You have another choice You can either write paid statuses or not. If it's a partitioned table, you can just delete the invoice from the invoice status table when it moves to paid. (Of course you'll write a paid record to the history table mentioned in the bonus material). Then you'll do an outer join to the status table and nulls mean paid. If you almost never query for paid status, there's really no reason to make that a fast query.

Bonus Material

in either case you'll want to keep track of these movements in a reporting table. Everytime you update a status, you'll want to write that to a history table. Eventually you'll want to analyze what I call transit times. What's the average time from filled to paid, by month? Is that increasing as a result of the bad economy? what's the transit time from placed to filled, by month. Do the summer months take longer because of missing bodies on vacation? you get the point. By updating that column you're losing those answers, so you'll need to embed that history log into your procedures.

Out of all the approaches you have listed only one (use sequential read) is approach that has anything to do with low selectivity (well, clustered can qualify, too).

If you have low selectivity on a column this means that scans will perform better than lookup.

Index can be used to do

index lookups - check the index pointer, retrieve record, repeat
index scans - scan the index and get values directly from index

otherwise it is not very useful.

If the selectivity is low that means that a large part of the index would be read and, if using lookups, large part of the data would be then read, in some random order. This is inefficient if you cover a significant percentage of the underlying table, so the better method would be to do sequential read (which is also slow).

So if selectivity is low, there's nothing much you can do (clustering can help).

However, I am not convinced that you understand that in your example you do not have low selectivity. As you say most entries will be paid, and very little entries will be allocated. These (allocated) entries will have high selectivity. Especially if there are additional conditions and if there is a composite index containing those additional conditions.

So, you might be banging your head against a non problem.

Now, it is true that you might improve performance further by partitioning data or using supplementary table (if you need to).

Partitioning is an approach that stores the same table in separate areas based on data - SQL developers do not have to access separate tables.

I think it is ideal for the problem described - you can find more about it on Informix here: http://www.dbmag.intelligententerprise.com/blog/main/archives/2008/09/data_partitioni.html

If you can relax database normalization, and the number of possible states is low (eg: <5), you can add one nullable column per state and place indexes on those columns. Many engines (like MongoDB) will skip the rows with null values and index only the rows with actual data (sparse indexes). For instance:

Invoice#     Date     State       IsPlaced   IsPaid   IsFulfilled
    1       Apr-20    Fulfilled     (null)    (null)      yes
    2       Apr-20    Fulfilled     (null)    (null)      yes
    3       Apr-20    Fulfilled     (null)    (null)      yes
    4       Apr-21    Fulfilled     (null)    (null)      yes
    5       Apr-21    Fulfilled     (null)    (null)      yes
    6       Apr-21    Paid          (null)     yes       (null)
    7       Apr-21    Placed         yes      (null)     (null)
    8       Apr-22    Placed         yes      (null)     (null)
    9       Apr-22    Paid          (null)     yes       (null)
   10       Apr-22    Placed         yes      (null)     (null)

You could have this information in a separate table, and possibly driven by triggers, or at least checked with constraints.

This is not a universal solution, and in fact has poor scalability, but will allow you to use partitioning on columns that make better sense, like invoice date.

This kind of trick is often used in data warehouse designs, where efficiency in processing large datasets is more important than data normalization.

继续阅读：database-design indexing relational-database sql

Indexing & alternatives for low-selectivity columns

Bonus Material

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Bonus Material

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？