Using DVCS for an RDBMS audit trail
I'm looking to implement an audit trail for a reasonably complicated relational database, whose schema is prone to change. One avenue I'm thinking of is using a DVCS to track changes.
(The benefits I can imagine are: schemaless history, snapshots of entire system's state, standard tools for analysis, playback and migration, efficient storage, separate system, keeping DB clean. The database is not write-heavy and history is not not a core feature, it's more for the sake of having an audit trail. Oh and I like trying crazy new approaches to problems.)
I'm not an expert with these systems (I only have basic git familiarity), so I'm not sure how difficult it would be to implement. I'm thinking of taking mercurial's approach, but possibly storing the file contents/manifests/changesets in a key-value data store, not using actual files.
Data rows would be serialised to json, each "file" could be an row. Alternatively an entire table could be stored in a "file", with each row residing on the line number equal to its primary key (assuming the tables aren't too big, I'm expecting all to have less than 4000 or so rows. This might mean that the changesets could be automatically generated, without consulting the rest of the table "file".
(But I doubt it, because I think we need a SHA-1 hash of the whole file. The files could perhaps be split up by a predictable number of lines, eg 0 < primary key < 1000
in file 1, 1000 < primary key < 2000
in file 开发者_运维技巧2 etc, keeping them smallish)
Is there anyone familiar with the internals of DVCS' or data structures in general who might be able to comment on an approach like this? How could it be made to work, and should it even be done at all?
I guess there are two aspects to a system like this: 1) mapping SQL data to a DVCS system and 2) storing the DVCS data in a key/value data store (not files) for efficiency.
(NB the json serialisation bit is covered by my ORM)
I've looked into this a little on my own, and here are some comments to share.
Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.
Blobs
The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".
Changesets
Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".
Trees
Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.
Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.
Storage
I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.
I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.
Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!
Use case: 1. Archiving a change
As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.
Use case 2. Retrieving an old state
After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).
Summary
None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.
If the database is not write-heavy (as you say), why not just implement the actual database tables in a way that achieves your goal? For example, add a "version" column. Then never update or delete rows, except for this special column, which you can set to NULL to mean "current," 1 to mean "the oldest known", and go up from there. When you want to update a row, set its version to the next higher one, and insert a new one with no version. Then when you query, just select rows with the empty version.
Take a look at cqrs and Greg Young's event sourcing. I also have a blog post about working in meta events that pin point schema changes within the river of business events.
http://adventuresinagile.blogspot.com/2009/09/rewind-button-for-your-application.html
If you look through my blog, you'll also find version script schemes and you can source code control those.
精彩评论