Field specific errors for ETL
I am creating a ETL process in MS SQL Server and I would like to have errors specific to a particular column of a particular row. For example, the data is initially loaded from excel files into a table(we'll call the Initial table) where all columns are varchar(2000)
and th开发者_StackOverflowen I stage the data to another table(the DataTypedTable) that contains more specific data types (datetime
,int
, etc.) or more tightly constrained varchar lengths. I need to be able to create error messages for a specific field such as:
"Jan. 13th" is not a valid date format for the submission date. Please use a format of MM/DD/YYYY
These error messages would need to be stored in some way such that later in the process a automated process can create reports with the error messages such that each message references a specific row and field(someone will need to go back and correct the data in the source system and resubmit the excel file). So ideally it would be inserted into a Failures tables of some sort and contain the primary key of the failed row, the column name, and the error message.
Question: So I am wondering if this can be accomplished with SSIS, or some open source tool like Talend, and if so, what would be your general approach? Or what hand coded approach you would take?
Couple approaches I've thought of using SQL(up until no I have done ETL by hand in SQL procs, but I want to consider other approaches. Possible C# even.):
Use a cursor to read through the Initial table, and for each row insert a blank record with only the primary key into the DataTyped table, then use a single update statement for each column, such that if that update fails I can insert a very specific error message specific to that column in the error messages table.
Insert all the data as is into the DataTyped table, but have duplicate columns like SubmissionDate
and SubmissionDateOld
. After the initial insert the *Old columns have data, the rest are blank, and I have a single update for each column that sets the SubmissionDate based on the SubmissionDateOld.
In addition to suggesting an approach, I'd like to know if you are using that approach or something similar already in the work you do.
I use the aproach where I put a conditional split into the data flow. The records which fail the conditions (invalid date, no data in a required field, etc.) are then sent to an exception table which includes the record identifier, the bad data, the reason it failed. You can then later on create a spreadsheet or text file of errors from this information to send back to the group providing the file. Good records of course go to the pther path and are inserted into the table.
How about some cleaning/transformation before loading into the staging (what you call initial tables) tables? Dump the data from Excel to a tab or comma separated file and then use some programming languages of your choice to do the data cleansing that you have noted. Also, how big is each data load? You can make use of multi-threaded or multi-process application to handle major loads (like loading few million rows at a time). During this process any error you encounter can be loaded into the exception table with identifier, error and comment details. This technique helps in having better control during data cleaning phase.
If the load is not that high and you want to do most of your work in database (SQL), then you may want to do as much data profiling as possible and have good understading of possible data variations that you can expect. With that you can use appropriate component (Talend or SSIS) to do the transformation or control the data flow. Also, by using regular expressions you can catch any entity that deviates from the set rule.
精彩评论