开发者

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying ev开发者_如何学Goery column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?

If not, how would you accomplish something like this, preferably in TSQL?


You may consider only share VIEW, create VIEWs to hide data that you don't want share.

Example:

CREATE VIEW v_customer
AS
SELECT 
   NAME,
   LEFT(CreditCard,5) + '****' As CreditCard  -- OR, don't show this column at all
   ....
FROM customer


Firstly I need to state professional interest I work for IBM which has tools that do exactly this.

Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery

Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.

The two tools integrate to give a full data privacy solution.


Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:

UPDATE customers
   SET email = REPLICATE('z', LEN(email))
       -- additional fields as needed

Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.

I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:

SELECT *
  FROM INFORMATION_SCHEMA.COLUMNS
 WHERE DATA_TYPE IN ('...', '...')

I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜