开发者

Storing and retrieving historical data using SQL / relational database

Given this table:

CREATE TABLE DeptPeopleHistory (
  DEPT_ID INTEGER,
  PERSON_ID INTEGER,
  START_DATE INTEGER,
  END_DATE INTEGER,
  UNIQUE(DEPT_ID, START_DATE, PERSON_ID), -- works as sorted index.
  UNIQUE(PERSON_ID, START_DATE),
  UNIQUE(PERSON_ID, END_DATE),
  CONSTRAINT (START_DATE < END_DATE)
);

I have two needs. The first is to get all people that works on a given department at a given date. Currently I use this (semantically correct) query:

SELECT PERSON_ID FROM DeptPeopleHistory
WHERE
  DEPT_IT = :given_dept AND
  START_开发者_Python百科DATE <= :given_date AND :given_date < END_DATE

This is fast for small history table or querying recent data, but is slow for big history tables and old data, because the optimizer uses only the first index and there's no good way to deal with END_DATE. I've tried to add END_DATE to the first index, but query performance is the same. I guess it's because the sub-filter (DEPT_IT=:given_dept AND START_DATE <= :given_date) when applied to a sorted index (DEPT_ID, START_DATE, END_DATE, PERSON_ID) results in data with unsorted END_DATE, so (:given_date < END_DATE) still requires a sequential scan on the result.

My other need is to enforce the following constraint: a person cannot work at two departments at same time, nor twice at the same department. This means the following:

-- This must work for previously empty data:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
                      VALUES (1,       1,         20100501,   20100520);

-- This should cause constraint violation because the person already
-- works at dept 1 on days from 20100517 to 20100519:
INSERT INTO DeptPeopleHistory(DEPT_ID,   PERSON_ID, START_DATE, END_DATE)
                      VALUES (:any_dept, 1,         20100517,   20100523);

Another way to specify this constraint, is that for a given PERSON_ID, START_DATE must be the minimum or equals to END_DATE from another record.

Looking at those two needs, we actually need an efficient way for dealing with non-intersected ranges. Do you know some feature or construct in generic SQL or some specific database than can deal with these needs? Perhaps some "spatial database" feature?

The examples are in MySQL, but I need solutions that work on Oracle, SQL Server and FireBird. The solutions don't need to be portable across all such databases.


As a starting point, I recommend the book Developing Time-Oriented Database Applications in SQL by Rick Snodgrass, available as a free PDF download. Looks like you can jump right in a chapter 5 and read through chapters 6 and 7 (but don't dismiss the alternative approaches in later chapters).

As regards implementation, postgreSQL currently has good temporal support generally and support for deferrable constraints (which is vital -- in SQL! -- for concepts such as sequenced keys).

Note there are other models for temporal databases e.g. Date Darwen Lorentzos.


Have you tried adding another index on DEPT_ID and END_DATE? If you are using MySQL 5+, it may be able to do an index merge and use both that index and the DEPT_ID, START_DATE, PERSON_ID index.

As for your second question, I think the only way to enforce that type of constraint would be either via application logic or an insert/update trigger.


Would it be possible to change the structure of table DeptPeopleHistory to?:

CREATE TABLE DeptPeopleHistoryDetail (
  DEPT_ID INTEGER,
  PERSON_ID INTEGER,
  WORK_DATE INTEGER,               --- why is that INT and not DATE by the way?
  UNIQUE(WORK_DATE, PERSON_ID)
);

Pros:

  • You don't need to enforce any of the previous UNIQUE constraints, nor the START_DATE < END_DATE one.
  • The second complex constraint(s) are magically solved too.

Cons:

  • The (1, 1, 20100501, 20100520) from the previous example is now split into 20 rows. Not a real problem, I'd say. Relational databases are designed to handle many rows.
  • To find START_DATE or END_DATE for a person in a department, a query has to be run. (if that is too slow, which I doubt, an additional table can be used)

Oh, and your slow query would be written as:

SELECT PERSON_ID FROM DeptPeopleHistoryDetail
WHERE
  DEPT_IT = :given_dept AND
  WORK_DATE = :given_date 

With your current DeptPeopleHistory design, can you try the performance of the following query?

SELECT H.PERSON_ID
FROM DeptPeopleHistory H
  JOIN
    ( SELECT PERSON_ID
           , MAX(START_DATE) AS LATEST_START_DATE
      FROM DeptPeopleHistory
      WHERE
        DEPT_IT = :given_dept AND
        START_DATE <= :given_date
      GROUP BY
        PERSON_ID
    ) AS grp
    ON  H.DEPT_IT = :given_dept
    AND grp.PERSON_ID = H.PERSON_ID
    AND grp.LATEST_START_DATE = H.START_DATE
WHERE 
   :given_date < H.END_DATE
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜