开发者

Strategies for populating a Reporting/Data Warehouse database

For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.

Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.

What are some other strategies that could be used to populate this reporting database?

  • SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
  • A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.

I'm looking 开发者_JAVA技巧for any new ideas or additional thoughts on the above. Thanks!


Our general process is:

  1. Copy data from source table(s) into tables with exactly the same structure in a loading database
  2. Transform data into staging table, which have the same structure as the final fact/dimension tables
  3. Copy data from the staging tables to the fact/dimension tables

SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.

For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.

In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.

Step 3 is a series of INSERT... SELECT... statements: it's fast.

So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.

I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.


I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.

SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).


I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.

A few considerations:

  • There's a lot of subtle tricks to delivering good ETL: making it run very fast, be very easily managed, handling rule-level audit results, supporting high-availability or even reliable recovery and even being used as the recovery process for the reporting solution (rather than database backups).
  • You can build your own ETL. The downside is that commercial ETL solutions have pre-built adapters (which you may not need anyway), and that custom ETL solutions tend to fail since few developers are familiar with the batch processing patterns involved (see your existing architecture). Since ETL patterns have not been well documented it is unlikely to be successful in writing your own ETL solution unless you bring in a developer very experienced in this space.
  • When looking at commercial solutions note that the metadata and auditing results are the most valuable part of the solution: The GUI-based transform builders aren't really any more productive than just writing code - but the metadata can be more productive than reading code when it comes to maintenance.
  • Complex environments are difficult to solution with a single ETL product - because of network access, performance, latency, data format, security or other requirements incompatible with your ETL tool. So, a combination of custom & commercial often results anyway.
  • Open source solutions like Pentaho are really commercial solutions if you want support or critical features.

So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.


I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜