How to overcome Netezza's lack of unique constraint/referential integrity enforcement?
It appears that the lack of support for enforcing 2 basic constraints (unique and foreign key), is a reason for loss of many man-hours debugging and troubleshooting difficult problems. What starts as a trivial, easy-to-fix problem (dup rows /inconsistent primary-detail tables) goes un-noticed, grows and raises complex edge-scenarios in our application, or even in our hardware (e.g. self-join with dups can cause inflation and storage exhaustion).
- Netezza serves multiple purposes in our environment: production, research, qa and staging. Naturally our ETL processes can’t be mature and can’t verify all constraints in all those scenarios.
- Even in our most mature applications used in production, where data is verified while ETL loads it, we create a series of tables each the result of a calculation on its开发者_如何学Go predecessors. Sometimes the data-integrity is broken along the way, not right on the start (as a result of a buggy statement)
Can anyone recommend a methodology/tool for avoiding these headaches?
I understand there is an accepted answer, though I wanted to contribute an alternate approach. Being new to my current position, I am not privy to all the business decisions behind primary key declarations in our warehouse. I developed logging type approach to track duplicate row removal efforts over time. Here are the major features of this design:
- Always up to date, addressing the fluid nature of DDL / DML
- New / dropped tables
- New / updated primary keys
- New / updated / deleted rows
- Self-populating history
- Tracks improvement over time
- Provides basis for trending analysis at all levels
- Easily query target tables for research purposes
- No self joins with HAVING clause or lookup of key columns required
- Addresses only Primary Keys at this time
- Can easily be expanded to address Unique constraints (CONTYPE = 'u' in _V_RELATION_KEYDATA)
What follows is everything required from a Netezza perspective. Where noted, you will need to fill in gaps to create dynamic SQL.
First, I created a table that tracks the database, table and internal rowid of all duplicate records.
CREATE TABLE
NZ_DUPLICATE_PKS
(
DATABASE_NAME CHARACTER VARYING(128) NOT NULL
,TABLE_OWNER CHARACTER VARYING(128) NOT NULL
,TABLE_NAME CHARACTER VARYING(128) NOT NULL
,ROW_ID BIGINT NOT NULL
,CURRENT_RECORD_INDICATOR CHARACTER(1) NOT NULL
,CREATE_TIMESTAMP TIMESTAMP NOT NULL
,LAST_UPDATE_TIMESTAMP TIMESTAMP NOT NULL
)
DISTRIBUTE ON
(
ROW_ID
);
NOTE: YMMV on the distribution key and volume of rows going into the table. Row IDs in our Netezza applicance had an even enough natural distribution that it served me well on a Mustang based NPS 10050.
Next, a staging version of this table was created:
CREATE TABLE
STG_NZ_DUPLICATE_PKS
(
DATABASE_NAME CHARACTER VARYING(128)
,TABLE_OWNER CHARACTER VARYING(128)
,TABLE_NAME CHARACTER VARYING(128)
,ROW_ID BIGINT
,CURRENT_RECORD_INDICATOR CHARACTER(1)
,CREATE_TIMESTAMP TIMESTAMP
,LAST_UPDATE_TIMESTAMP TIMESTAMP
)
DISTRIBUTE ON
(
ROW_ID
);
I then created dynamic queries from system views to seed the staging table. Here is the base query I started with:
SELECT
DATABASE
,OWNER
,RELATION
,CONSTRAINTNAME
,ATTNAME
FROM
{YOUR_DATABASE_NAME}._V_RELATION_KEYDATA
WHERE
CONTYPE = 'p'
-- Exclude the duplicate tracking table
AND RELATION != 'NZ_DUPLICATE_PKS'
ORDER BY
DATABASE
,OWNER
,RELATION
,CONSTRAINTNAME
,CONSEQ
;
Now I loop through the base query to dynamically create insert queries. My shop uses DataStage, whose approach is esoteric and not worth expounding upon here.
NOTE: Here is where a little work is needed to loop and construct dynamic SQL. One could use myriad flavors of shell, Perl, Python, etc. Using a sample table with two column key, here is what to construct for insertion into the staging table:
INSERT
INTO
STG_NZ_DUPLICATE_PKS
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
)
SELECT
'{YOUR_DATABASE_NAME}' DATABASE_NAME
,'{YOUR_TABLE_OWNER}' TABLE_OWNER
,'{YOUR_TABLE_NAME}' TABLE_NAME
,DUPS.ROWID ROW_ID
,'Y' CURRENT_RECORD_INDICATOR
,CURRENT_TIMESTAMP CREATE_TIMESTAMP
,CURRENT_TIMESTAMP LAST_UPDATE_TIMESTAMP
FROM
{YOUR_TABLE_NAME} DUPS
INNER JOIN
(
SELECT
{KEY_COLUMN_1}
,{KEY_COLUMN_2}
FROM
{YOUR_TABLE_NAME}
GROUP BY
{KEY_COLUMN_1}
,{KEY_COLUMN_2}
HAVING
COUNT(*) > 1
)
KEYS
ON
DUPS.{KEY_COLUMN_1} = KEYS.{KEY_COLUMN_1}
AND DUPS.{KEY_COLUMN_2} = KEYS.{KEY_COLUMN_2};
After looping through all tables to seed the staging table, I then run a series of queries, treating the database, owner, table name and row ID like a slowly changing dimension. This query end dates records in the target table that do not exist in staging table:
UPDATE
NZ_DUPLICATE_PKS
SET
CURRENT_RECORD_INDICATOR = 'N'
,LAST_UPDATE_TIMESTAMP = CURRENT_TIMESTAMP
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
AND
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
)
NOT IN
(
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
FROM
STG_NZ_DUPLICATE_PKS
)
;
Finally, insert the latest records into the target table:
INSERT
INTO
NZ_DUPLICATE_PKS
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
)
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
FROM
STG_NZ_DUPLICATE_PKS
WHERE
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
)
NOT IN
(
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
)
;
NOTE: Our environment is not such that an insert-only model is necessary. Netezza veterans will be familiar with this line of thought. If your environment is insert-only, adjust strategy accordingly.
Once everything is in place, it's a snap to locate duplicate rows for investigation:
SELECT
*
FROM
{YOUR_TABLE_NAME}
WHERE
ROWID IN
(
SELECT
ROW_ID
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
AND DATABASE_NAME = '{YOUR_DATABASE_NAME}'
AND TABLE_OWNER = '{YOUR_OWNER_NAME}'
AND TABLE_NAME = '{YOUR_TABLE_NAME}'
);
I like this because it's simple and the same for all tables, regardless of differences in primary key declaration.
I also use this query a lot to view current primary key violations by table:
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,COUNT(*) QUANTITY
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
GROUP BY
1
,2
,3
ORDER BY
1
,2
,3;
That sums everything up. I hope some folks find it useful. We have already made a lot of progress with this approach. At this point, you may be wondering why I went to all this trouble. I detest that PK violations are allowed into our warehouse, and I desired a comprehensive approach to their eradication. The above process has been running daily in our production environment for a couple months. We have ~350 tables with primary keys declared, ranging in size from 5 row dimensions to ~200 million row facts @ 10Gb. For Netezza, this is a pretty modest outlay. The entire process takes less than 10 minutes on our Mustang NPS 10050.
We originally wrote a stored procedure to handle this very thing in our data warehouse.
While it works in theory, it is kind of slow for huge tables commonly found on NZ (~10 mins for a 500M record table).
I'll explain the stored proc below, but for now I wanted to say that we're not even leveraging the proc at the moment because our incremental/upsert loads are explicitly only inserting records that do not already exist in the target database. (For upsert, we're basically just deleting any records that exist in our set of records to insert before inserting).
This has it's own problems, particularly because NZ doesn't like deletes and needs to have the tables constantly groomed to actually reclaim the space, not to mention the fact that our upserts might lose historical column data by deleting old records that had since been changed (we have other processes for loading slowly changing dimensions that we want to track.)
Anyway, the constraint stored proc looks like:
check_constraints('table','constraint_type') returns boolean
and basically loops through:
select constraintname, contype, attname
from _v_relation_keydata
where relation = upper(p_table) order by constraintname, conseq;
To get the columns that should be compared. For each constraint, it then runs something like:
select count(*)
from (
select distinct col1,col2,...
from p_table
where col1 is not null
and col2 is not null... );
and compare that number to
select count(*)
from p_table
where col1 is not null
and col2 is not null...;
If they're different, we raise an exception.
Why not add a column that is a hash value of the columns to which you wish to apply the constraint? The NZ SQL extensions package has hashing functions. If you use hash8('key') where 'key' is a concatination of your constrained column values (cast to varchar) then you get a probably unique BIGINT value. I say 'probably' because the probability of hash collision is finite all-be-it low. If you need a stronger gaurantee of uniqueness you can use the cryptographic-level 'hash()' function instead.
Unless you have really huge quantities of data however, hash8 should be the better choice. Since it gives integer values, comparison of rows is a matter of simple arithmatic and as such is fast. So for example you can just update or insert rows on the basis of the hash value, which is just an integer comparison.
Netezza is a data warehouse. It doesn't have PKs, FKs, constraints, UKs, SKs, etc. It's a data warehouse - it's not supposed to. All tables should be flat tables because it's an MPP - distributed. Looks up distribution keys. Partitioning only. It's parallel processing - a bit like Map Reduce. Put your joins into a relational database (has keys, indexes), then move your flattened tables into Netezza. https://learn.microsoft.com/en-us/sql/analytics-platform-system/staging-database?view=aps-pdw-2016-au7#:~:text=A%20staging%20database%20is%20a,is%20loaded%20into%20the%20appliance.&text=For%20example%2C%20an%20ELT%20process,into%20the%20target%20fact%20table. Don't do any data modeling in Netezza. Use Oracle. And Netezza does have primary and foreign key constraints, and possibly even unique keys - but they are inactive - and that's also on purpose - it's tuned for running everything in parallel across an entire table. Snowflake is more forgiving in that it is partitioned both horizontally and vertically, plus is has clustered-like parallel processing, so it runs very fast in parallel (might not be as good as Netezza - or as cheap), but Snowflake is likely much better at filtering.
I see a comment above "business decisions behind primary key declarations in our warehouse" - so the people making those decisions the comment above. Unless of course you build your warehouse in Exadata, which does have indexes, but I have known some Exadata databases with all indexing removed speeding up quite substantially.
精彩评论