Versioning in SQL Tables - how to handle it?
Here's a fictional scenario with some populated data. For tax purposes, my fictional company must retain records of historical data. For this reason, I've included a version column to the table.
TABLE EMPLOYEE: (with personal commentary)
|ID | VERSION | NAME | Position | PAY |
+---+---------+------------+----------+-----+
| 1 | 1 | John Doe | Owner | 100 | Started company
| 1 | 2 | John Doe | Owner | 80 | Pay cut to hire a coder
| 2 | 1 | Mark May | Coder | 20 |开发者_StackOverflow社区 Hire said coder
| 2 | 2 | Mark May | Coder | 30 | Productive coder gets raise
| 3 | 1 | Jane Field | Admn Asst| 15 | Need office staff
| 2 | 3 | Mark May | Coder | 35 | Productive coder gets raise
| 1 | 3 | John Doe | Owner | 120 | Sales = profit for owner!
| 3 | 2 | Jane Field | Admn Asst| 20 | Raise for office staff
| 4 | 1 | Cody Munn | Coder | 20 | Hire another coder
| 4 | 2 | Cody Munn | Coder | 25 | Give that coder raise
| 3 | 3 | Jane Munn | Admn Asst| 20 | Jane marries Cody <3
| 2 | 4 | Mark May | Dev Lead | 40 | Promote mark to Dev Lead
| 4 | 3 | Cody Munn | Coder | 30 | Give Cody a raise
| 2 | 5 | Mark May | Retired | 0 | Mark retires
| 5 | 1 | Joey Trib | Dev Lead | 40 | Bring outside help for Dev Lead
| 6 | 1 | Hire Meplz | Coder | 10 | Hire a cheap coder
| 3 | 4 | Jane Munn | Retired | 0 | Jane quits
| 7 | 1 | Work Fofre | Admn Asst| 10 | Hire Janes replacement
| 8 | 1 | Fran Hesky | Coder | 10 | Hire another coder
| 9 | 1 | Deby Olav | Coder | 25 | Hire another coder
| 4 | 4 | Cody Munn | VP Ops | 80 | Promote Cody
| 9 | 2 | Deby Olav | VP Ops | 80 | Cody fails at VP Ops, promote Deby
| 4 | 5 | Cody Munn | Retired | 0 | Cody retires in shame
| 5 | 2 | Joey Trib | Dev Lead | 50 | Give Joey a raise
+---+---------+------------+----------+-----+
Now, if I wanted to do something like "Get a list of the current coders" I couldn't just do SELECT * FROM EMPLOYEE WHERE Position = 'Coder'
because that would return lots of historical data... which is bad.
I'm looking for good ideas to handle this scenario. I see a few options that jump out at me, but I'm sure someone's going to say "Wow, that's a rookie mistake, glow... try this on for size:" which is what this place is all about, right? :-)
Idea number 1: Keep a version table with the current version like this
TABLE EMPLOYEE_VERSION:
|ID |VERSION|
+---+-------+
| 1 | 3 |
| 2 | 5 |
| 3 | 4 |
| 4 | 6 |
| 5 | 2 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 2 |
+---+-------+
Although I'm not sure how I'd do that with a single query, I'm sure it could be done, and I bet I could figure it out with a rather small amount of effort.
Of course, I would have to update this table every time I insert into the EMPLOYEE table to increment the version for the given ID (or insert into the version table when a new id is made).
The overhead of that seems undesireable.
Idea number 2: Keep an archive table and a main table. Before updating the main table, insert the row I'm about to overwrite into archive table, and use the main table as I normally would as if I wasn't concerned about versioning.
Idea number 3: Find a query that adds something along the lines of SELECT * FROM EMPLOYEE WHERE Position = 'Coder' and version=MaxVersionForId(EMPLOYEE.ID)
... Not entirely sure how I'd do this. This seems the best idea to me, but I'm really not sure at this point.
Idea number 4: Make a column for "current" and add "WHERE current = true AND ..."
It occurs to me that surely people have done this before, run into these same problems, and have insight on it to share, and so I come to collect that! :) I've tried to find examples of the problem on here already, but they seems specialized to a particular scenario.
Thanks!
EDIT 1:
Firstly, I appreciate all answers, and you've all said the same thing - DATE
is better than VERSION NUMBER
. One reason I was going with VERSION NUMBER
was to simplify the process of updating in the server to prevent the following scenario
Person A loads employee record 3 in his session, and it has version 4. Person B loads employee record 3 in his session, and it has version 4. Person A makes changes and commits. This works because the most recent version in the database is 4. It is now 5. Person B makes changes and commits. This fails because the most recent version is 5, while his is 4.
How would the EFFECTIVE DATE
pattern address this issue?
EDIT 2:
I think I could do it by doing something like this: Person A loads employee record 3 in his session, and it's effective date is 1-1-2010, 1:00 pm, with no experation. Person B loads employee record 3 in his session, and it's effective date is 1-1-2010, 1:00 pm, with no experation. Person A makes changes and commits. The old copy goes to the archive table (basically idea 2) with an experation date of 9/22/2010 1:00 pm. The updated version of the main table has an effective date of 9/22/2010 1:00 pm. Person B makes changes and commits. The commit fails because the effective dates (in the database and session) don't match.
What you have here is called a Slowly Changing Dimension (SCD). There are some proven methods for dealing with it:
http://en.wikipedia.org/wiki/Slowly_changing_dimension
Thought I'd add that since no one seems to call it by name.
I think you've started down the wrong path.
Typically, for versioning or storing historical data you do one of two (or both) things.
You have a separate table that mimics the original table + a date/time column for the date it was changed. Whenever a record is updated, you insert the existing contents into the history table just prior to the update.
You have a separate warehouse database. In this case you can either version it just like in #1 above OR you simply snapshot it once every so often (hourly, daily, weekly..)
Keeping your version number in the same table as your normal one has several problems. First, the table size is going to grow like crazy. This will put constant pressure on normal production queries.
Second, it's going to radically increase your query complexity for joins etc in order to make sure the latest version of each record is being used.
An approach that I've designed for a recent database is to use revisions as follows:
Keep your entity in two tables:
"employee" stores a primary key ID and any data that you do not want to be versioned (if there is any).
"employee_revision" stores all the salient data about the employee, with a foreign key to the employee table and a foreign key, "RevisionID" to a table called "revision".
Make a new table called "revision". This can be used by all the entities in your database, not just employee. It contains an identity column for the primary key (or AutoNumber, or whatever your database calls such a thing). It also contains EffectiveFrom and EffectiveTo columns. I also have a text column on the table - entity_type - for human readability reasons which contain the name of the primary revision table (in this case "employee"). The revision table contains no foreign keys. The default value for EffectiveFrom is 1-Jan-1900 and the default value for EffectiveTo is 31-Dec-9999. This allows me to not simplify the date querying.
I make sure that the revision table is well indexed on (EffectiveFrom, EffectiveTo, RevisionID) and also on (RevisionID, EffectiveFrom, EffectiveTo).
I can then use joins and simple <> comparisons to select an appropriate record for any date. This also means that relations between entities are also fully versioned. In fact, I find it useful to use SQL Server table-valued functions to allow very simply querying of any date.
Here's an example (assuming that you don't want to version employee names so that if they change their name, the change is effective historically).
--------
employee
--------
employee_id | employee_name
----------- | -------------
12351 | John Smith
-----------------
employee_revision
-----------------
employee_id | revision_id | department_id | position_id | pay
----------- | ----------- | ------------- | ----------- | ----------
12351 | 657442 | 72 | 23 | 22000.00
12351 | 657512 | 72 | 27 | 22000.00
12351 | 657983 | 72 | 27 | 28000.00
--------
revision
--------
revision_id | effective_from | effective_to | entity_type
----------- | -------------- | ------------ | -----------
657442 | 01-Jan-1900 | 03-Mar-2007 | EMPLOYEE
657512 | 04-Mar-2007 | 22-Jun-2009 | EMPLOYEE
657983 | 23-Jun-2009 | 31-Dec-9999 | EMPLOYEE
One advantage of storing your revision metadata in a separate table is that it's easy to apply it consistently to all your entities. Another is that it's easier to expand it to include other things, such as branches or scenarios, without having to modify every table. My principal reason is that it keeps your main entity tables clear and uncluttered.
(The data and example above are fictional - my database does not model employees).
Here is my suggested approach, which has worked very well for me in the past:
- Forget the version number. Instead, use
StartDate
andEndDate
columns - Write a trigger to ensure that there are no overlapping date ranges for the same
ID
, and that there is only ever one record with aNULL
EndDate
for the sameID
(this is your currently effective record) - Put indexes on
StartDate
andEndDate
; this should give you reasonable performance
This will easily let you report by date:
select *
from MyTable
where MyReportDate between StartDate and EndDate
or get the current info:
select *
from MyTable
where EndDate is null
Although the question has been asked 8 years ago, it worths to mention there is feature exactly for this in SQL Server 2016. System-versioned Temporal Table
Every table in SQL Server 2016 and above can have a history table, which the historical data will be populated automatically by SQL Server itself.
All you need is to add two datetime2 columns and one clause to the table:
CREATE TABLE Employee
(
Id int NOT NULL PRIMARY KEY CLUSTERED,
[Name] varchar(50) NOT NULL,
Position varchar(50) NULL,
Pay money NULL,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START NOT NULL,
ValidTo datetime2 GENERATED ALWAYS AS ROW END NOT NULL,
PERIOD FOR SYSTEM_TIME (ValidFrom,ValidTo)
)
WITH (SYSTEM_VERSIONING = ON);
The system versioned table creates a temporal table which maintains the history of the data. You can use a custom name WITH (SYSTEM_VERSIONING = ON ( HISTORY_TABLE = dbo.EmployeeHistory ) );
In this link you can find more details about System-version temporal tables.
As @NotMe mentioned, historical tables can be grow very fast, so there are a few ways to get around this. Take a look here
Idea 3 will work:
SELECT * FROM EMPLOYEE AS e1
WHERE Position = 'Coder'
AND Version = (
SELECT MAX(Version) FROM Employee AS e2
WHERE e1.ID=e2.ID)
You really want to use something like a date though, which is much easier to program and track, and will use the same logic (something like an EffectiveDate column)
EDIT:
Chris is totally correct about moving this info out of your production table for performance, especially if you expect frequent updates. Another option would be to make a VIEW that only shows you the most recent version of each person's info, that you build off of this table.
You are definitely doing this wrong. Keeping a database running sweetly requires that you only have the minimum amount of data in your production tables that you need. Inevitably holding historical data in with the live adds redundancy that will complicate queries and slow performance, plus your successors are going to look really askew at this before submitting it to the DailyWTF!
Instead create a copy of the table - EmployeeHistorical for instance - but with the ID column not set as identity (you might choose to add an additional new ID column and a dateCreated timestamp column too). Then add a trigger to your Employee table that fires on update & delete and writes out a copy of the complete row to the Historical table. And while you're at it capturing the ID of the user doing the edit often comes in handy for audit purposes.
Generally when I'm doing this on an active table I try and create the historical table in a different database as among other things this reduces fragmentation (and hence maintenance) on your prime database and it's easier to handle backups - as archives can grow very large.
Your issues about edit contention should be handled with the normal database transaction and locking mechanisms. Coding adhoc hacks up to emulate such yourself is always time-consuming and error prone (some edge condition you've not thought of always pops up, and to write locks correctly you've really got to grok sempahores, which is decidedly non-trivial)
精彩评论