开发者

Masking or hiding inaccurately entered data in SQL Server 2008

OK, so my subject line isn't very descriptive, but here's the scenario:

An end-user has a legal obligation to submit transaction data to a government agency. The transactions contain the name and address of various individuals and organizations. HOWEVER, end users frequently misspell the names of the reported individuals and organizations, or they badly mangle the address, etc.

The information submitted by the end user is a legal 'document', so it cannot be altered by the agency that received it. Also, the transactions can be viewed and searched by the public. When the government agency notices an obvious misspelling or bad address, they开发者_开发问答 would like to 'hide' or 'mask' that bad value with a known good value. For example, if an end user entered 'Arnie Schwarzeger', the agency could replace that name with 'Arnold Schwarzenegger'. The public that viewed the data would see (and search for) the correct spelling, but could view the original data as entered by the end user after they found the data record in question.

Hopefully that explains the business case well enough...on to the SQL part! So to address this problem, we have tables that look like this:

CREATE TABLE [dbo].[SomeUserEnteredData](
    [Id] [uniqueidentifier] NOT NULL,
    [LastOrOrganizationName] [nvarchar](350) NOT NULL,  // data as entered by end-user
    [FirstName] [nvarchar](50) NULL, // data as entered by end-user
    [FullName]  AS ([dbo].[FullNameValue]([FirstName],[LastName])) PERSISTED,  // data as entered by end-user
    [MappedName]  AS ([dbo].[MappedNameValue]([FirstName],[LastName])))  // this is the 'override' data from the agency

CREATE TABLE [dbo].[CorrectionsByAgency](
    [Id] [uniqueidentifier] NOT NULL,
    [ReplaceName] [nvarchar](400) NOT NULL,
    [KeepName] [nvarchar](400) NOT NULL)

CREATE FUNCTION [dbo].[FullNameValue]
(
    @FirstName as NVARCHAR(40),
    @LastOrOrganizationName as NVARCHAR(350)
)
RETURNS NVARCHAR(400)
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @result NVARCHAR(400)
    IF @FirstName = '' OR @FirstName is NULL
        SET @result = @LastOrOrganizationName
    ELSE 
        SET @result = @LastOrOrganizationName + ', ' + @FirstName
    RETURN @result
END

CREATE FUNCTION [dbo].[MappedNameValue]
(
    @FirstName as NVARCHAR(50),
    @LastOrOrganizationName as NVARCHAR(350)
)
RETURNS NVARCHAR(400)
AS
BEGIN
    DECLARE @result NVARCHAR(400)
    DECLARE @FullName NVARCHAR(400)
    SET @FullName = dbo.FullNameValue(@FirstName, @LastOrOrganizationName)
    SELECT top 1 @result = KeepName from CorrectionsByAgency where ReplaceName = @FullName
    if @result is null
        SET @result = @FullName
    RETURN @result
END

Hopefully, if my sample isn't TOO convoluted, you can see that if the agency enters a name correction, it will replace all occurrences of the misspelled name. From a business logic perspective, this works exactly right: the agency staff only enters a few corrections and the corrections can override everywhere there are misspelled names.

From a server performance standpoint, this solution STINKS. The calculated SomeUserEnteredData.MappedName column can't be indexed, and no view that reads from that column can be indexed either! There's no way this can work for our needs if we can't index the MappedName values.

The only alternative I've been able to see as a possibility is to create an additional linking table between the end-user created data and the agency created data -- when the agency enters a correction record, a record is created in the linking table for every occurrence of the bad column value. The down side to this seems to be the very real likelihood of creating/destroying many (hundreds of thousands) of those linking records for every correction entered by an agency user...

Do any of you SQL geniuses out there have great ideas about how to address this problem?


I'm not sure if this is answering your question directly, but I would try to simplify the whole thing: stop using functions, persist "calculated" values and use application logic (possibly in a stored procedure) to manage the data.

Assuming that one agency correction can be applied to many user-entered names, then you could have something like this:

create table dbo.UserEnteredData (
  DocumentId uniqueidentifier not null primary key,
  UserEnteredName nvarchar(1000) not null,
  CorrectedNameId uniqueidentifier null,
  constraint FK_CorrectedNames foreign key (CorrectedNameId)
    references dbo.CorrectedNames (CorrectedNameId)
)

create table dbo.CorrectedNames (
  CorrectedNameId uniqueidentifier not null primary key,
  CorrectedName nvarchar(1000) not null
)

Now, you need to make sure your application logic can do something like this:

  1. External user enters dirty data
  2. Agency user reviews the dirty data and identifies both the incorrect name and the corrected name
  3. Application checks if the corrected name already exists
  4. If no, create a new row in dbo.CorrectedNames
  5. Create a new row in dbo.UserEnteredData, with the CorrectedNameId

I'm assuming that things are rather more complicated in reality and corrections are made based on addresses and other data as well as just names, but the basic relationship you describe seems simple enough. As you said, the functions add a lot of overhead and it's not clear (to me) what benefit they provide over just storing the data you need directly.

Finally, I don't understand your comment about creating/destroying linking records; it's up to your application logic to handle data changes correctly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜