Implementing a Type-2 Slowly Changing Dimension in PostgreSQL
I am planning to implement a Type-2 SCD in PostgreSQL. The downside of SCDs is that you cannot have foreign keys referencing thos tables if used as is often seen. In other words, I often see referential integrity being dealt with within the application code. This strikes me as bad practice as it could be done directly within the database. Adding a handful of triggers could even hide this implementation detail from the app-coders.
I came up with the following schemas. Are these okay?
One-to-Many
--
-- One-to-Many
--
BEGIN;
CREATE TABLE document(
id serial not null,
revision integer not null default 1,
title varchar(30),
primary key (id, revision)
);
CREATE TABLE page(
id serial not null,
title varchar(30),
document_id integer not null,
document_revision integer not null,
foreign key (document_id, document_revision) references document(id, revisi开发者_如何学运维on)
);
-- Insert the first revision
INSERT INTO document (title) VALUES ('my first document');
INSERT INTO page (title, document_id, document_revision) VALUES ('my first page', 1, 1);
-- DEBUG: display
SELECT * FROM document d inner join page p ON ( d.id = p.document_id and d.revision = p.document_revision );
-- "update" the document, by inserting a new revision
INSERT INTO document (id, revision, title) VALUES (1, 2, 'my first document, edited');
-- update the references
UPDATE page SET document_revision = 2 WHERE document_id = 1;
-- DEBUG: display
SELECT * FROM document d inner join page p ON ( d.id = p.document_id and d.revision = p.document_revision );
ROLLBACK;
Many-to-One
--
-- Many-to-One
--
BEGIN;
CREATE TABLE page(
id serial not null primary key,
title varchar(30)
);
CREATE TABLE document(
id serial not null,
revision integer not null default 1,
title varchar(30),
page_id integer references page(id),
primary key (id, revision)
);
-- Insert initial revision
INSERT INTO page (title) VALUES ('my first page');
INSERT INTO document (title, page_id) VALUES ('my first document', 1);
INSERT INTO document (title, page_id) VALUES ('my second document', 1);
-- DEBUG: display
SELECT * FROM page p inner join document d on (p.id = d.page_id);
-- destroy the link "from" the old revision
UPDATE document SET page_id = NULL WHERE id=1;
-- Add a new revision, referencing the page
INSERT INTO document ( id, revision, title, page_id ) VALUES ( 1, 2, 'My First Document, edited', 1 );
-- DEBUG: display
SELECT * FROM page p inner join document d on (p.id = d.page_id);
SELECT * FROM document;
ROLLBACK;
Many-to-Many
--
-- Many-to-Many
--
BEGIN;
CREATE TABLE page(
id serial not null primary key,
title varchar(30)
);
CREATE TABLE document(
id serial not null,
revision integer not null default 1,
title varchar(30),
primary key (id, revision)
);
CREATE TABLE page_contains_document(
page_id integer not null references page(id),
document_id integer not null,
document_revision integer not null,
foreign key (document_id, document_revision) references document( id, revision )
);
-- Insert initial revision
INSERT INTO page (title) VALUES ('My First page');
INSERT INTO document (title) VALUES ('My Fist Document');
INSERT INTO page_contains_document (page_id, document_id, document_revision) VALUES (1, 1, 1);
-- DEBUG: display
SELECT p.title, d.title, d.revision FROM page p INNER JOIN page_contains_document pcd ON (p.id = pcd.page_id) INNER JOIN document d ON (d.id = pcd.document_id and d.revision = pcd.document_revision);
-- Add a new document revision
INSERT INTO document (id, revision, title) VALUES (1, 2, 'My Fist Document, edited');
-- update the reference
UPDATE page_contains_document SET document_revision=2 WHERE document_id=1;
-- DEBUG: display
SELECT p.title, d.title, d.revision FROM page p INNER JOIN page_contains_document pcd ON (p.id = pcd.page_id) INNER JOIN document d ON (d.id = pcd.document_id and d.revision = pcd.document_revision);
ROLLBACK;
OK. I think we need to clear up some important misunderstandings about why we do SCD type 2.
It should hold all the data in one table bracketed by dates (not revision numbers!).
So, you could have:
id , name , valid_from, valid_to
1111 , MyBook , '2009-03-01', '9999-12-31'
After an update:
1111 , Mybook , '2009-03-01', '2009-06-20'
1111 , Mybook , '2009-06-21', '9999-12-31'
A similar structure with valid from and valid to dates should exist in the "pages" database.
The whole point is that now you can either get the latest version with:
select * from books where valid_to = '9999-12-31'
Or get the version that was valid on the first of April
select * from books where valid_to >= '2009-04-01' and valid_from <= '2009-04-01'
Also within you page structure you only need to store the updated pages. You don't need a new copy of all the pages for every revision.
I know it's a bit late for a reply to this question, but I think this could be useful to people looking for the same thing as you.
We've written a module implementing SCD-Type 2 which is used with Django. It was tested using PostgreSQL, so it should match your requirements. It also covers OneToMany- and ManyToMany-relations.
For more details look for CleanerVersion on GitHub or go to https://github.com/swisscom/cleanerversion directly.
精彩评论