SQL Server - ETL approach
We get daily files that need to be loaded into our database. The files will get delivered on a separate server than the database. Which one of the 2 approaches are better for the ETL from a performance perspective?
- Transfer files over from the delivery server to the database server. Do bulk load.
- Open DB connect开发者_Python百科ion from delivery server and load
Edited to add: The servers are all on the same network.
Depends whether source servers are SQL servers or other technology, the driver used (if it's oracle the Microsoft driver will nerf your perf badly, oracle is better), the amount of database overhead You want to impose (while one server is feeding the other they are probably both IO bound), the disk layout You have (ie reading from one raid and writing to the other, conpressing and transferring through 1gig or 100mb might be more efficient. Usually the dumps compress nicely but as Beth have noticed, test it.
With dumps You can abuse parallel transformations (like multiple disk shares, and multiple processors use for compression - use 7zip period.) With ethernet YOu probably wont abuse as much parallelism. Same thing affects the target server.
All in all, as usual with performance, test, quantify, test, quantify, repeat:)
The universal response of 'It Depends'. It depends particularly on what ETL technology you are using. If your ETL is tied to the database server for its processing power (SSIS, BODI (to a lesser degree) then you need to get your files onto the database server asap. If you have a more file based ETL package (Abinitio, Informatica) then you are free to do your transformation on your delivery server and then move your 'ready-to-load' data onto the database server for bulk loading.
- in all cases. Espacially if the files are very large, you can compress data files before transporting over network.
精彩评论