Avoid writing SQL queries altogether in SSIS
Working on a Data Warehouse project, the guy that gave us the tutorial advised that we stick to using SQL queries over defining a lot of data flow transformations, citing points like it'll consume a lot of memory on the ETL box so we'd rather leave the开发者_如何学编程 processing to the DB box. Is this really advisable? Where's the balance between relying on GUI tools over executing a bunch of SQL scripts on your Integration package?
And honestly, I'd like to avoid writing SQL queries as much as I can. (but that's beside the point. I'd really like to look at this objectively.)
The answer is: it depends, but you want to pick one or the other for any given job and avoid mixing the two where possible.
Generally, it's best to either do everything possible within the tool or do everything possible within stored procedure code. When you have significant amounts of logic split between layers the system becomes harder to trace and debug.
Where the tool can do the transformations without the data flows becoming awkward and convoluted you could use the tool and try to have little or no logic in queries. This means that one single layer has the business logic and it should be fairly obvious where to find it. However, ETL tools tend to handle highly complex transformations relatively poorly. The sweet spot for this type of approach is on systems where you have a large number of data sources but relatively simple transformations.
If you have relatively complex transformations you may be better off putting all the business logic and transformation into a layer of stored procedures. SQL code is better at implementing complex transformations in a maintainable way - I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors use this type of architecture for precisely that reason.
In this case the ETL tool can be used to implement relatively dumb data copies. Source data can be copied into staging areas essentially verbatim and then picked up by a body of stored procedure code that does the ETL. The ETL tool can be used for data copies, bulk load operations, logging, scheduling and other framework tasks.
In either case you're best off picking one approach. Otherwise, you can end up with business logic spread across extraction layers, database views, data flows, and stored procedure code. Logic spread across multiple layers is much harder to test.
When all of the logic is (for example) contained within stored procedures or focussed ETL transformation jobs you can unit test a given transformation in isolation. The clarity in design also helps with maintenance and auditing.
I find that using SQl code is not only faster to run, but it is faster to develop and much much easier to maintain.
Generally when you want to process each row individually, use a data flow, otherwise it may be better to use a Sql Command.
Personally I'd go with writing the SQL where I can. It's easier to optimise later and (usually) faster as well. Google will give much more detailed answers.
Another factor to think about is the provider you use for your connections.
You need to make the decision based on your needs. We use postgres DB, so we have to create a load of staging tables for some processes, which speeds the whole thing up.
You should also take into consideration the box it is running on, if you have an all powerful DB box, and a little ETL box, there'd be no point in running anything.
If you do all your processing on the ETL box you'll be dragging a lot of data across the network as well.
Check out these links to get you started:
ssistalk.com/category/ssis/ssis-advanced-techniques/
msdn.microsoft.com/en-us/library/ms141031.aspx
weblogs.sqlteam.com/jamesn/Default.aspx
I think this is a difficult question; and an interesting one as well.
One reason to use SSIS is to improve maintainability, IMHO. If you pack all the logic in SQL statements (and you sure can!) you tend to spoil this reason of using SSIS in the first place. You cannot really "see the data flow" anymore.
On the other hand I feel there are times when a well placed SQL statement has its value. For example when you read data from a table and for whatever reason already know you will only ever need the rows satisfying condition X I do not see the reason for reading the whole table and in the next step "conditional-splitting most of it away".
What I do not know is what this means in terms of performance, by the way. Is SSIS smart enough to see what is happening and change the "read-whole-table-and-conditional-split-it" into a "select Y from where X" on the fly (or when building/deploying)?
The big question is where to draw the line. And this depends to a certain extent on the people working on your ETL process. If everyone ever supporting the process knows SQL since its beginning you can better support a higher amount of SQL in your ETL than if you have co-workers (or customers, or successors you care about) that hardly understand what is happening in all your SQL, let alone change/improve/add to it.
So I think the bottom line is that neither not using nor doing everything in SQL is better. Try to make up some simple rules that fit your requirements and that everyone can live with, then follow them. This buys you the most value from using SSIS.
SQL Server does some things well and other things not so well. I use SSIS to import to or export data from SQL Server. During the course of the move I use SSIS where it makes sense. I can easily do work on a per row basis, which is not very efficient in SQL Server (cursors). To say that you shouldn't use transformations and data flows on an ETL box, because it is too expensive on the ETL box is like say 'don't drive your car too fast, because it causes the engine to work'. The purpose of an ETL and SSIS is to take some of the processing that SQL Sever does not do well and move it to an engine that does.
Got to use the right tool for the job. Generally, you do most things in SSIS, with certain things done in "pure" SQL.
For instance, in cases where you do a lot of UPDATE (table difference on dimension table in a dimensional model, say), you really don't want to execute an UPDATE for each row. In this scenario, you do a regular insert into a temporary table and then do the UPDATE in SQL, joining on appropriate keys.
精彩评论