Storing and retrieving historical data using SQL / relational database

2023-03-14 22:14 问答作者：

Given this table:

CREATE TABLE DeptPeopleHistory (
  DEPT_ID INTEGER,
  PERSON_ID INTEGER,
  START_DATE INTEGER,
  END_DATE INTEGER,
  UNIQUE(DEPT_ID, START_DATE, PERSON_ID), -- works as sorted index.
  UNIQUE(PERSON_ID, START_DATE),
  UNIQUE(PERSON_ID, END_DATE),
  CONSTRAINT (START_DATE < END_DATE)
);

I have two needs. The first is to get all people that works on a given department at a given date. Currently I use this (semantically correct) query:

SELECT PERSON_ID FROM DeptPeopleHistory
WHERE
  DEPT_IT = :given_dept AND
  START_开发者_Python百科DATE <= :given_date AND :given_date < END_DATE

This is fast for small history table or querying recent data, but is slow for big history tables and old data, because the optimizer uses only the first index and there's no good way to deal with END_DATE. I've tried to add END_DATE to the first index, but query performance is the same. I guess it's because the sub-filter (DEPT_IT=:given_dept AND START_DATE <= :given_date) when applied to a sorted index (DEPT_ID, START_DATE, END_DATE, PERSON_ID) results in data with unsorted END_DATE, so (:given_date < END_DATE) still requires a sequential scan on the result.

My other need is to enforce the following constraint: a person cannot work at two departments at same time, nor twice at the same department. This means the following:

-- This must work for previously empty data:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
                      VALUES (1,       1,         20100501,   20100520);

-- This should cause constraint violation because the person already
-- works at dept 1 on days from 20100517 to 20100519:
INSERT INTO DeptPeopleHistory(DEPT_ID,   PERSON_ID, START_DATE, END_DATE)
                      VALUES (:any_dept, 1,         20100517,   20100523);

Another way to specify this constraint, is that for a given PERSON_ID, START_DATE must be the minimum or equals to END_DATE from another record.

Looking at those two needs, we actually need an efficient way for dealing with non-intersected ranges. Do you know some feature or construct in generic SQL or some specific database than can deal with these needs? Perhaps some "spatial database" feature?

The examples are in MySQL, but I need solutions that work on Oracle, SQL Server and FireBird. The solutions don't need to be portable across all such databases.

As a starting point, I recommend the book Developing Time-Oriented Database Applications in SQL by Rick Snodgrass, available as a free PDF download. Looks like you can jump right in a chapter 5 and read through chapters 6 and 7 (but don't dismiss the alternative approaches in later chapters).

As regards implementation, postgreSQL currently has good temporal support generally and support for deferrable constraints (which is vital -- in SQL! -- for concepts such as sequenced keys).

Note there are other models for temporal databases e.g. Date Darwen Lorentzos.

Have you tried adding another index on DEPT_ID and END_DATE? If you are using MySQL 5+, it may be able to do an index merge and use both that index and the DEPT_ID, START_DATE, PERSON_ID index.

As for your second question, I think the only way to enforce that type of constraint would be either via application logic or an insert/update trigger.

Would it be possible to change the structure of table DeptPeopleHistory to?:

CREATE TABLE DeptPeopleHistoryDetail (
  DEPT_ID INTEGER,
  PERSON_ID INTEGER,
  WORK_DATE INTEGER,               --- why is that INT and not DATE by the way?
  UNIQUE(WORK_DATE, PERSON_ID)
);

Pros:

You don't need to enforce any of the previous UNIQUE constraints, nor the START_DATE < END_DATE one.
The second complex constraint(s) are magically solved too.

Cons:

The (1, 1, 20100501, 20100520) from the previous example is now split into 20 rows. Not a real problem, I'd say. Relational databases are designed to handle many rows.
To find START_DATE or END_DATE for a person in a department, a query has to be run. (if that is too slow, which I doubt, an additional table can be used)

Oh, and your slow query would be written as:

SELECT PERSON_ID FROM DeptPeopleHistoryDetail
WHERE
  DEPT_IT = :given_dept AND
  WORK_DATE = :given_date

With your current DeptPeopleHistory design, can you try the performance of the following query?

SELECT H.PERSON_ID
FROM DeptPeopleHistory H
  JOIN
    ( SELECT PERSON_ID
           , MAX(START_DATE) AS LATEST_START_DATE
      FROM DeptPeopleHistory
      WHERE
        DEPT_IT = :given_dept AND
        START_DATE <= :given_date
      GROUP BY
        PERSON_ID
    ) AS grp
    ON  H.DEPT_IT = :given_dept
    AND grp.PERSON_ID = H.PERSON_ID
    AND grp.LATEST_START_DATE = H.START_DATE
WHERE 
   :given_date < H.END_DATE

继续阅读：constraints indexing spatial sql

Storing and retrieving historical data using SQL / relational database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？