Storing and retrieving historical data using SQL / relational database
Given this table:
CREATE TABLE DeptPeopleHistory (
DEPT_ID INTEGER,
PERSON_ID INTEGER,
START_DATE INTEGER,
END_DATE INTEGER,
UNIQUE(DEPT_ID, START_DATE, PERSON_ID), -- works as sorted index.
UNIQUE(PERSON_ID, START_DATE),
UNIQUE(PERSON_ID, END_DATE),
CONSTRAINT (START_DATE < END_DATE)
);
I have two needs. The first is to get all people that works on a given department at a given date. Currently I use this (semantically correct) query:
SELECT PERSON_ID FROM DeptPeopleHistory
WHERE
DEPT_IT = :given_dept AND
START_开发者_Python百科DATE <= :given_date AND :given_date < END_DATE
This is fast for small history table or querying recent data, but is slow for big history tables and old data, because the optimizer uses only the first index and there's no good way to deal with END_DATE. I've tried to add END_DATE to the first index, but query performance is the same. I guess it's because the sub-filter (DEPT_IT=:given_dept AND START_DATE <= :given_date) when applied to a sorted index (DEPT_ID, START_DATE, END_DATE, PERSON_ID) results in data with unsorted END_DATE, so (:given_date < END_DATE) still requires a sequential scan on the result.
My other need is to enforce the following constraint: a person cannot work at two departments at same time, nor twice at the same department. This means the following:
-- This must work for previously empty data:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (1, 1, 20100501, 20100520);
-- This should cause constraint violation because the person already
-- works at dept 1 on days from 20100517 to 20100519:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (:any_dept, 1, 20100517, 20100523);
Another way to specify this constraint, is that for a given PERSON_ID, START_DATE must be the minimum or equals to END_DATE from another record.
Looking at those two needs, we actually need an efficient way for dealing with non-intersected ranges. Do you know some feature or construct in generic SQL or some specific database than can deal with these needs? Perhaps some "spatial database" feature?
The examples are in MySQL, but I need solutions that work on Oracle, SQL Server and FireBird. The solutions don't need to be portable across all such databases.
As a starting point, I recommend the book Developing Time-Oriented Database Applications in SQL by Rick Snodgrass, available as a free PDF download. Looks like you can jump right in a chapter 5 and read through chapters 6 and 7 (but don't dismiss the alternative approaches in later chapters).
As regards implementation, postgreSQL currently has good temporal support generally and support for deferrable constraints (which is vital -- in SQL! -- for concepts such as sequenced keys).
Note there are other models for temporal databases e.g. Date Darwen Lorentzos.
Have you tried adding another index on DEPT_ID and END_DATE? If you are using MySQL 5+, it may be able to do an index merge and use both that index and the DEPT_ID, START_DATE, PERSON_ID index.
As for your second question, I think the only way to enforce that type of constraint would be either via application logic or an insert/update trigger.
Would it be possible to change the structure of table DeptPeopleHistory
to?:
CREATE TABLE DeptPeopleHistoryDetail (
DEPT_ID INTEGER,
PERSON_ID INTEGER,
WORK_DATE INTEGER, --- why is that INT and not DATE by the way?
UNIQUE(WORK_DATE, PERSON_ID)
);
Pros:
- You don't need to enforce any of the previous
UNIQUE
constraints, nor theSTART_DATE < END_DATE
one. - The second complex constraint(s) are magically solved too.
Cons:
- The
(1, 1, 20100501, 20100520)
from the previous example is now split into 20 rows. Not a real problem, I'd say. Relational databases are designed to handle many rows. - To find
START_DATE
orEND_DATE
for a person in a department, a query has to be run. (if that is too slow, which I doubt, an additional table can be used)
Oh, and your slow query would be written as:
SELECT PERSON_ID FROM DeptPeopleHistoryDetail
WHERE
DEPT_IT = :given_dept AND
WORK_DATE = :given_date
With your current DeptPeopleHistory
design, can you try the performance of the following query?
SELECT H.PERSON_ID
FROM DeptPeopleHistory H
JOIN
( SELECT PERSON_ID
, MAX(START_DATE) AS LATEST_START_DATE
FROM DeptPeopleHistory
WHERE
DEPT_IT = :given_dept AND
START_DATE <= :given_date
GROUP BY
PERSON_ID
) AS grp
ON H.DEPT_IT = :given_dept
AND grp.PERSON_ID = H.PERSON_ID
AND grp.LATEST_START_DATE = H.START_DATE
WHERE
:given_date < H.END_DATE
精彩评论