开发者

SSIS Recommended Practices for handling "bad" data

Way back in 2000, I did some ETL development with SQL Server DTS packages. I started out using the designer to create the packages but then eventually settled with using the SQLDMO (Database Mgt Objects) Library to dynamically generate packages. At a high level, the generated packages would read an input file and transfer the data to a staging table. Any records that did not conform to the expected column data type were kicked out as error records and a error message record as to why the record failed validation. This code was written generically once.

To accomplish this, the data was first DTSed from the flat CSV layout source file into a "untyped" dest table where the columns in the table matched those in the input file but the columns were all defined as varchar(255). The intent was to get first get the raw data into a table where dynamically generated SQLs could then be run against the untyped table to indentify data that we know would not convert to the expected data type. Records that were of the wrong type or length would automatically be kicked out to an error table of the same layout as the input file. The common processing library would then dynamically create a DTS package to transfer the remaining records into a "typed" table where all of the columns were now of the types that were expected, not just varchar. All of these tables were dynamically created at run time, too, all based on the definition of the expected file layout.

There was more too it after this initial phase, but getting the data into the database and identify "bad" data an开发者_如何学JAVAd recording the reason each record is bad is now once again the focus of a new system, this time using SSIS packages. I was thinking of kind of recreating what I did back in 2000 using SQLMO to generate SSIS packages and create a common library to handle this process using .NET but I wanted to get advise on what other people are doing that may be better.

My personal preference for tools leans away from developing SSIS Packages with the designer using the visual approach. Personally, I feel like I have the most control and the best development environment by using .NET to create a reusable library that abstracts the whole process and leverages the File System, SSIS and SQL Serve while having a rich development lanaguage.

I found that putting file definitions in a table to have all sorts of side benefits. For example, I decided that every file defintiion would have a standard header which would uniquely identify which file format should be used to parse it. I then wrote a editor which would display the contents of a file in a grid and allow editing or dynamically generate a DTS package on the fly a transfer the data to a table to allow queries against the data as a standalone tool. Furthermore, I generated Word documents showing the file layouts as documentation and created libraries that could be used by standard tools to EXTRACT data in the specified format so source systems using it would be guranteed to provide the data in the correct format.

So, finally, my questions:

  • How does Microsoft recommend that bad data be handled and process and how can this be done without reinventing the wheel with every package? I'm an SSIS rookie.


There is no "silver bullet" in handling erroneous data on extraction and I doubt that there is only a single practice how to do it. Usually following are the options are available while importing the data from external source (that basically "extraction" is):

  1. Stop on the very first bad entry, report the entry.
  2. Process the whole import, report all bad entries, do not import if some data are bad.
  3. Proceed with the import despite some bad data, put the error data into a separate table.

As long as your implementation can support these three scenarios and allows the user to choose you are more or less covering all possible applications.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜