How Database stores data internally in B-Tree/B+Tree
My question is that How database stores data and how it performs query internally.
Suppose we have following fields in o开发者_运维知识库ur table:
- ID
- Name
- Age
- Weight
- Manager
and we query select * from Table1 where age>50 and weight<100
I am just curious that how it perform query internally.
What will the Node of B-Tre/B+Tree contains in this example?
The example you have chosen is one of the few cases where a single Tree can't do the job (two independent ranges).
However, the first chapter of my work-in-progress e-Book explains the inner workings of B-Tree indexes: http://use-the-index-luke.com/anatomy/
EDIT for more details why two indexes might be useful for the above example.
The above query can be supported by three possible index configurations:
concatenated index on
AGE
and thenWEIGHT
(in this order).
In case, the query would read all recordsWHERE AGE > 50
and then filter byWEIGHT
.concatenated index on
WEIGHT
and thenAGE
(the other order).
That goes the different way: read all recordsWHERE WEIGHT < 100
and then filter byAGE
.
Whatever is more efficient depends on the data you have. If there are less records AGE > 50
than WEIGHT < 100
the first will be more efficient, otherwise the second. However, if you query with different values, the picture might change.
The reason that a concatenated index can't support the query well is that each index order is on one axis only. each index entry is before or after another one, but never next to it. All index entries build one chain.
A query that has two independent range queries would require two axes, not like a chain, but more like a chess board. one axis for AGE
the other for WEIGHT
. If that would be possible, your query would need to scan only one corner of the chess board.
However, a b-tree has only one axis, hence you must chose which criteria to use first. If you chose AGE
it means that starting with AGE 50
, the entire chain will be scanned until the end. Only some of the records stored at the side of the chain will also qualify for WEIGHT < 100
, the other records must be read but will be discarded.
So, a long story to explain why one tree can not support a query with two independent range clauses. On the other hand, one concatenated index can do the following quite well:
WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;
However, the problem arises if there are two inequality operators are used on two different columns.
So, what to do?
The third possible approach is to have two independent indexes on the two columns. That allows to have as many axes as you like (just create more indexes). However, there are a few huge problems with that. First of all, not all DB products support that. Whenever it is supported, it is a rather expansive operation. It works typically that way that each index is scanned, a bitmap index is built for each result. Those bitmap indexes are then joined to apply the AND
operator. That's a lot of data munging--it is only worth the effort if each condition is not very selective for it's own, but both together are very selective.
Wan't my advice?
If your query runs in an OLTP environment: use one concatenated index. Two independent indexes are an option of last resort only. However, if you are working in an OLAP environment, you might anyways need bitmap indexes.
ps.:
Indexing AGE
was an exercise in my book (with solution)--especially because storing AGE
is a bad practice, you should store the date of birth instead.
精彩评论