What is the best way to do historic comparisons of database data?
i have a database that has a list of employees of an organziation. There is one main employee table and number of joined tables.
i want to start to track week over week and month over month metrics on this data so i can do things like:
March 1: headcount 100 (+1, -2 in Feb)
April 1: headcount 10开发者_如何学Python1(+3, -2 in Mar) May 1: headcount 105(+10, -6 in Apr)i am trying to thing of the best way to do this. do i:
Take a full database snapshot at the first of each month and have my application query multiple databases to generate these reports.
try to keep track of all changes in some db triggered history table and try to aggregate that information to try to build up current state of each month.
any other suggestions?
If you simply want to track when new employees are hired or terminated, then you should start with adding the relevant fields to the employee table itself: HireDate date NOT NULL
and TerminationDate date NULL
.
Then it's really quite easy to determine a headcount (and details) on any particular day:
SELECT EmployeeID, EmployeeName, ...
FROM Employees
WHERE HireDate <= @EndDate
AND (TerminationDate IS NULL OR TerminationDate > @BeginDate)
If you need to track modifications (i.e. a change in title), then the approach that will give you maximum flexibility is to maintain a real-time history table with triggers (or your database's built-in change tracking, if available). I wouldn't recommend full snapshots as that will consume enormous amounts of space over the lifetime of the app.
Your history table should contain all of the fields in your base table, plus two more - a date modified and a transaction type. Possibly a 3rd autonumber/sequence/identity field as well. T-SQL version follows:
CREATE TABLE EmployeeHistory
(
TransactionID int NOT NULL IDENTITY(1, 1)
CONSTRAINT PK_EmployeeHistory PRIMARY KEY CLUSTERED,
TransactionDate datetime NOT NULL,
TransactionType tinyint NOT NULL, -- 1 = Add, 2 = Change, 3 = Delete
EmployeeID int NOT NULL,
EmployeeName varchar(100) NOT NULL,
...
)
Then maintain it with a trigger:
CREATE TRIGGER tr_Employees_History
ON Employees
FOR INSERT, UPDATE
AS BEGIN
INSERT EmployeeHistory (TransactionDate, TransactionType, EmployeeID, ...)
SELECT
GETDATE(),
CASE
WHEN d.EmployeeID IS NULL THEN 1
WHEN (i.TerminationDate IS NOT NULL) AND
(d.TerminationDate IS NULL) THEN 3
ELSE 2
END,
i.EmployeeID, i.EmployeeName, ...
FROM inserted i
LEFT JOIN deleted d
ON d.EmployeeID = i.EmployeeID
END
I'm going to assume you don't delete employee records and just set a TerminationDate
; if you delete instead (please don't do this), then you'll need to write a similar DELETE
trigger instead of the second CASE WHEN i.TerminationDate ...
line.
Now seed your history table:
INSERT EmployeeHistory (TransactionDate, TransactionType, EmployeeID, ...)
SELECT HireDate, 1, EmployeeID, ...
FROM Employees
Note - if you don't have the HireDate
then just replace that with GETDATE()
- your history will only be valid from the moment you seed it.
Now if you want to get a historical "snapshot", you can do this:
CREATE FUNCTION dbo.GetEmployeeSnapshot(@ReportDate datetime)
RETURNS TABLE
AS RETURN
WITH History_CTE AS
(
SELECT
TransactionType, EmployeeID, EmployeeName, ...,
ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
FROM EmployeeHistory
WHERE TransactionDate <= @ReportDate
)
SELECT *
FROM History_CTE
WHERE RowNum = 1
AND TransactionType IN (1, 2) -- Filter out terminated employees
And if this query runs slow, if you need to speed up certain aggregates like a headcount, then and only then should you start thinking about snapshot tables:
CREATE TABLE HeadcountHistory
(
ReportDate datetime NOT NULL
CONSTRAINT PK_HeadcountHistory PRIMARY KEY CLUSTERED,
HeadCount int NOT NULL
)
And the update proc:
CREATE PROCEDURE dbo.UpdateHeadcountHistory
AS
DECLARE @ReportDate datetime
SET @ReportDate = GETDATE()
INSERT HeadcountHistory (HeadCount)
SELECT @ReportDate, COUNT(*)
FROM dbo.GetEmployeeSnapshot(@ReportDate)
Run that last sproc as part of a scheduled job and then you'll have a denormalized reporting table for the specific aggregates you want.
Anything more complicated than this and I think you'd want to start looking into a data warehouse instead.
If you are just running this on a schedule, then I would create a data summary table... once a month run a process to do your counts, and add a row to the summary table representing the data. That way, you can look back into history and generate whatever statistics you need. You might want to consider generating this data on a more frequent basis than you plan to report on (say, weekly).. as long as you have higher resolution than the reporting period you should have all the data you need.
精彩评论