Bulk Insert into a HEAP vs CLUSTERED index where minimal logging is not an option (SQL Server 2008)
The tool currently used is Informatica and we have bookend stored procedures that drop the clustered indexes and then add them back to the database. In the stored procedure where we add the clustered indexes back we have the DDL for the indexes hard coded into the stored procedure (we don't use sys tables because a fear of Microsoft changing the sys tables and regen from there creates a bad index or fails). This causes issues where people have created the clustered indexes but not thought to update the stored procedure and the next time bulk occurs these indexes are gone. We previously did this for all indexes but switched non clustered indexes to using disable/rebuild. This not an option though because we will no longer be able to insert into the table if this is done to the clustered index because it is essentially the table.
Performance is important but not everything. Good performance and easy maintainability trumps great performance and complex maintainability.
After reading many sites it is almost universally agreed that when performing bulk insert, on data not ordered the same as your primary key, inserting into a heap and then applying the pk afterwards is faster ( http://msdn.microsoft.com/en-us/library/ms177445.aspx , http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). Most these sites make assumptions that开发者_开发技巧 I cannot use at my organization and with my toolset.
Currently due to our current standards policies we have to use FULL recovery model so minimally logging will not occur no matter which selection that I make in reference to heap vs clustered index.
According to our informatica admins specifying tablock or order hints on bcp is not possible through the UI and our organization is adverse to customization beyond the UI because of maintainability.
So the question after all of this is with all the factors above would you recommend that we continue with our somewhat unreliable stored procedures, insert into a clustered index or have some third far superior solution. I also realize there is other stack questions similar to this item but they do not address bulk specifically and/or make similar assumptions in their answers.
My suggestion would be to bulk load into a staging table (a heap, or CI matching the file order), (re-)build the clustered index there matching the destination table, and then insert straight from the staging table. To reduce blocking, escalation, log use etc. you could do this in batches of 10000 rows at a time, committing and/or checkpointing every so often.
You might also consider using a pre-processor (C# perhaps) that takes the log file and builds a new one with the proper sort order.
Also I think you are safer using sys.indexes etc. than hard-coding the index structures in the code. Microsoft is far less likely to change a column name in sys.indexes than someone at your shop (no offense intended) will change an index but forget to update the hard-coded definition in the procedure.
精彩评论