Loading Multiple Tables using SSIS keeping foreign key relationships

2022-12-19 21:57 问答作者：

I am trying to load data from a single file (with million+ records) into multiple tables on SQL Server using SSIS while maintaining the relationships defined in the file.

To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab.

File:

EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager

If my Office database schema has the following tables:

Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)

How can I use SSIS to load the file into the schema above Auto-Generating IDs for Employee, Office and JobTitle and maintaining the relationship between the employee and offices, and employee and Job Titles?

So in this case. the tables should look like:

Employee
1 John Smith
2 Alex Button

Office
1 开发者_运维技巧501
2 601
3 701
4 454

JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant

Employee2Office
1 1
1 2
1 3
2 2
2 4

Employee2JobTitle
1 1
1 2
1 3
2 4
2 3

I am a newbie to SSIS and have not played around with auto-generating IDs and establishing foreign key relationships while executing a Data Flow Task. Any pointers would be appreciated.

Thanks!

An interesting question. Here is how I would do it (Sql Server 2005). (I'm assuming this is a monthly job and not just a one time, so I added code for repeatability.)

Create three variables for the Employee, JobTitle and Office table (type = Object)
Use three sql tasks to select the rows for of those three tables into the corresponding variables.
Add a dataflow task.
Select from your flat file using a flat file destination.

Output goes into a script component with the three columns in the flat file as input, the three table variable imported into the script, five outputs in the script component each with the same exclusion group number and the input marked as synchronous for that output, seven new columns (3 for emp one for each output it will be in, 2 for job, 2 for office) added to the output, and with the following code (A Reference to System.xml.dll had to be added to make this all work.):

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Collections
Imports System.Data.OleDb

Public Class ScriptMain
    Inherits UserComponent

    Private da As New OleDbDataAdapter
    Private emp As New DataTable
    Private emph As New Hashtable()
    Private job As New DataTable
    Private jobh As New Hashtable()
    Private off As New DataTable
    Private offh As New Hashtable()
    Private maxempid As Integer
    Private maxjobid As Integer
    Private maxoffid As Integer

    Public Overrides Sub PreExecute()
        maxempid = 0
        maxjobid = 0
        maxoffid = 0
        da.Fill(emp, Me.Variables.EmpTab)
        For Each dr As DataRow In emp.Rows
            emph.Add(dr.Item("Name"), dr.Item("nID"))
            If (CInt(dr.Item("nID").ToString) > maxempid) Then
                maxempid = CInt(dr.Item("nID").ToString)
            End If
        Next
        da.Fill(job, Me.Variables.JobTab)
        For Each dr As DataRow In job.Rows
            jobh.Add(dr.Item("titleName"), dr.Item("nID"))
            If (CInt(dr.Item("nID").ToString) > maxempid) Then
                maxjobid = CInt(dr.Item("nID").ToString)
            End If
        Next
        da.Fill(off, Me.Variables.OffTab)
        For Each dr As DataRow In off.Rows
            offh.Add(dr.Item("number"), dr.Item("nID"))
            If (CInt(dr.Item("nID").ToString) > maxempid) Then
                maxoffid = CInt(dr.Item("nID").ToString)
            End If
        Next
        emp.Dispose()
        job.Dispose()
        off.Dispose()
        da.Dispose()
        MyBase.PreExecute()
    End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

    If Not emph.ContainsKey(Row.EmployeeName) Then
        maxempid += 1
        emph.Add(Row.EmployeeName, maxempid)
        Row.EmpId = maxempid
        Row.Emp2Id = maxempid
        Row.Emp3Id = maxempid
        Row.DirectRowToEmployee()
    Else
        Row.EmpId = CInt(emph.Item(Row.EmployeeName).ToString)
        Row.Emp2Id = CInt(emph.Item(Row.EmployeeName).ToString)
        Row.Emp3Id = CInt(emph.Item(Row.EmployeeName).ToString)
    End If
    If Not jobh.ContainsKey(Row.JobLevelHistory) Then
        maxjobid += 1
        jobh.Add(Row.JobLevelHistory, maxjobid)
        Row.JobId = maxjobid
        Row.Job2Id = maxjobid
        Row.DirectRowToJobTitle()
    Else
        Row.JobId = CInt(jobh.Item(Row.JobLevelHistory).ToString)
        Row.Job2Id = CInt(jobh.Item(Row.JobLevelHistory).ToString)
    End If
    If Not offh.ContainsKey(Row.OfficeHistory) Then
        maxoffid += 1
        offh.Add(Row.OfficeHistory, maxoffid)
        Row.OffId = maxoffid
        Row.Off2Id = maxoffid
        Row.DirectRowToOfficeNumber()
    Else
        Row.OffId = CInt(offh.Item(Row.OfficeHistory).ToString)
        Row.Off2Id = CInt(offh.Item(Row.OfficeHistory).ToString)
    End If
    Row.DirectRowToEmp2Job()
    Row.DirectRowToEmp2Off()
End Sub        
End Class

The results from this script (The script generates id's for new values in the input data. It does this by loading up the existing table into hashtables in the preexecute portion of the script, then by checking for existance of the name and based upon that either increments the maxid and adds it to the hash if it adds to the hash, it also adds the row to the appropriate (emp, job, or off) output, or retrieves the maxid from the hash for each row.) all row regardless of status above will be written to the two remaining outputs (emp2job and emp2off).
Next, send dataflows to lookup (to check for existing rows in the destination table, then the destination oledb connectors (the emp, job, and off check the box for identity inserts, the emp2job and emp2off uncheck check constraints).

If you are sure that your referential integrity is OK with the data that you want to load, you can disable foreign key constraints in a Script Task, then execute data flow with parallel data load and after data load finishes, enable the constraints again. If there is something not right with the data, the operation will fail. You will have to design rollback or cleanup strategy though.

The other option is just to load data in serial way, starting from main tables and finishing on child tables. I think this is 'safer' option, as it doesn't expose your data integrity to other users who may be using these tables at the time of ETL load. I would prefer this option.

Here's how - it's slightly difficult to explain in text only, but I'll give it a shot:

Define the employee, office and job title tables in your database with an identity column, so that IDs will be generated automatically.

Define the many-to-many tables without (identity is not needed or useful)

In your SSIS data flow, you have to do this in a couple of passes, in order to establish the IDs in the database first, then come back and insert the many-to-many rows.

Make one data flow:

Put a data source to read in the file
Split it into three copies with a Multi-Cast. One will be for employees, one offices, one titles.
For each, put a Sort (this is generally a no-no, but since the source is text and not a database, we have to go there). Set the sort to pass through only one of the three fields, and check the option in the sort to remove duplicates. This makes a unique list (like select distinct) for each base table.
Put a destination for each of the three, connected to each table.

After the first data flow, add a second data flow. This one will populate the many-to-many relation rows

Read the file with a data source
Add a lookup that finds for Employee Name in the database and returns the Employee ID. That gets you the Employee ID that was generated above. (this is usually called lookup by business or natural key for surrogate key)
Add a lookup that finds for Title in the database and returns the Title ID
Add a lookup that finds for Office in the database and returns the Office ID
Again, multi-cast the results into two copies, one for employee-office and one for employee-title
Depending on the logic you need, perhaps use Sort, again, to deduplicate these (depends on the detail of how you are normalizing from the input)
Put the results into the many-to-many tables with two destinations.

you can write a stored procedure that first insert data in basic tables (employee,office and JobTitle) then read your file row by row. after that you should search your basic tables for getting the Identities and insert data to relation tables.

继续阅读：sql-server ssis

Loading Multiple Tables using SSIS keeping foreign key relationships

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？