Indexing & alternatives for low-selectivity columns
What are the range of tactics available for selecting records on low selectivity columns?
An example mig开发者_如何学JAVAht be an orders table where, over many years, you build up a large number of completed orders but often need to select active orders. An order might go through a lifecycle such as placed, stock-allocated, picked from warehouse, despatched to customer, invoiced and paid. An order might additionally be cancelled, held, etc. The majority of records will eventually be in the final state (e.g. paid) but you might often need to select, say, allocated orders. In this case a sequential read would be slow.
Similar questions on indexing
MySQL: low cardinality/selectivity columns = how to index? Do indexes suck in SQL? What are indexes and how can I use them to optimize queries in my database? Defining indexes: Which Columns, and Performance Impact? and numerous others decreasingly related.The approaches I have read about (in stackoverflow and elsewhere) include
- Use a bitmap index
- Use a partial index (
create index x on t(c2) where c1='a'
) - Use a clustered index?
- Don't index low selectivity columns, use sequential read
- Partition the data (e.g. into several tables with identical schema)
- Use a supplementary table (e.g.
active_customers(customer_id)
My current DBMS doesn't support the first three options listed above and the remainder seem problematic - are there any other commonly used approaches?
Update: I've seen - index your low-selectivity column, but only ever select for high-selectivity values.
I agree with Unreason's However branch. But there are some things to know about this case.
This is called skew and skew kills. This is a perfect use for a partial index where you'd exclude the 95% of paid invoices and only index the more interesting and selective stats. But you don't have that. You can horizontally partition all the rows into separate table/partitions but then you need to account for row migration (moving from one status to another) and that's expensive. The DBMS has to perform an Update, a Delete and an insert to change the status. If you're a high volume system that will hurt.
Forget what you said about whether or not to index based on selectivity because putting an index on a rapidly changing column is also usually a bad idea. Your index will have hot blocks where all the step 1's are being removed and another where all the step 2's are being inserted and oh btw, some step 2's are being removed at the same time into step 3's. This won't scale well.
I would recommend vertically partitioning your status into a separate table(s).
Your invoice table will have a PK and all the columns except status.
Your status you can handle two ways. That table will have the PK value as an FK back to the invoice table, the Status and a timestamp for when you entered that status. The best is a horizontally partitioned table on status. You'll have a partition for each status possible. So finding all or one "Placed" status will partition prune and read only the partition it needs - which is a very small number of blocks. Because the row is so narrow, you might get 400 invoice statuses on a single block. Looking up that status of any one invoice is easy since there's a global index on the PK.
If your RDBMS doesn't support partitioning with row migration, you'll need to manage these partitions as tables and delete from one and insert into another. You'll encapsulate these movements in a transaction in a procedure, so you keep the data clean. Every invoice is in one and only one status table. The harder part is querying by invoice ID, you'll have to check every table to see where it is.
You have another choice You can either write paid statuses or not. If it's a partitioned table, you can just delete the invoice from the invoice status table when it moves to paid. (Of course you'll write a paid record to the history table mentioned in the bonus material). Then you'll do an outer join to the status table and nulls mean paid. If you almost never query for paid status, there's really no reason to make that a fast query.
Bonus Material
in either case you'll want to keep track of these movements in a reporting table. Everytime you update a status, you'll want to write that to a history table. Eventually you'll want to analyze what I call transit times. What's the average time from filled to paid, by month? Is that increasing as a result of the bad economy? what's the transit time from placed to filled, by month. Do the summer months take longer because of missing bodies on vacation? you get the point. By updating that column you're losing those answers, so you'll need to embed that history log into your procedures.
Out of all the approaches you have listed only one (use sequential read) is approach that has anything to do with low selectivity (well, clustered can qualify, too).
If you have low selectivity on a column this means that scans will perform better than lookup.
Index can be used to do
- index lookups - check the index pointer, retrieve record, repeat
- index scans - scan the index and get values directly from index
otherwise it is not very useful.
If the selectivity is low that means that a large part of the index would be read and, if using lookups, large part of the data would be then read, in some random order. This is inefficient if you cover a significant percentage of the underlying table, so the better method would be to do sequential read (which is also slow).
So if selectivity is low, there's nothing much you can do (clustering can help).
However, I am not convinced that you understand that in your example you do not have low selectivity. As you say most entries will be paid, and very little entries will be allocated. These (allocated) entries will have high selectivity. Especially if there are additional conditions and if there is a composite index containing those additional conditions.
So, you might be banging your head against a non problem.
Now, it is true that you might improve performance further by partitioning data or using supplementary table (if you need to).
Partitioning is an approach that stores the same table in separate areas based on data - SQL developers do not have to access separate tables.
I think it is ideal for the problem described - you can find more about it on Informix here: http://www.dbmag.intelligententerprise.com/blog/main/archives/2008/09/data_partitioni.html
If you can relax database normalization, and the number of possible states is low (eg: <5), you can add one nullable column per state and place indexes on those columns. Many engines (like MongoDB) will skip the rows with null values and index only the rows with actual data (sparse indexes). For instance:
Invoice# Date State IsPlaced IsPaid IsFulfilled
1 Apr-20 Fulfilled (null) (null) yes
2 Apr-20 Fulfilled (null) (null) yes
3 Apr-20 Fulfilled (null) (null) yes
4 Apr-21 Fulfilled (null) (null) yes
5 Apr-21 Fulfilled (null) (null) yes
6 Apr-21 Paid (null) yes (null)
7 Apr-21 Placed yes (null) (null)
8 Apr-22 Placed yes (null) (null)
9 Apr-22 Paid (null) yes (null)
10 Apr-22 Placed yes (null) (null)
You could have this information in a separate table, and possibly driven by triggers, or at least checked with constraints.
This is not a universal solution, and in fact has poor scalability, but will allow you to use partitioning on columns that make better sense, like invoice date.
This kind of trick is often used in data warehouse designs, where efficiency in processing large datasets is more important than data normalization.
精彩评论