How to verify if two tables have exactly the same data?
Basically, we have one table (original table
) and it is backed up into another table (backup table
); thus the two tables have exactly the same schema.
In the beginning, both tables (original table
and backup table
) c开发者_如何学Pythonontains exactly the same set of data. After some time for some reason, I need to verify whether dataset in the original table
has changed or not.
In order to do this, I have to compare the dataset in the original table
against the backup table
.
Let's say the original table
has the following schema:
create table LemmasMapping (
lemma1 int,
lemma2 int,
index ix_lemma1 using btree (lemma1),
index ix_lemma2 using btree (lemma2)
)
How could I achieve the dataset comparison?
Update: the table does not have a primary key. It simply stores mappings between two ids.
You can just use CHECKSUM TABLE and compare the results. You can even alter the table to enable live checksums so that they are continuously available.
CHECKSUM TABLE original_table, backup_table;
It doesn't require the tables to have a primary key.
If you do not care about one table being less than the other, but you only care about differences in the attributes:
SELECT * FROM Table1
UNION
SELECT * FROM Table2
If you get records greater than the max number of the two tables, they don't have same data.
I would write three queries.
An inner join to pick up the rows where the primary key exists in both tables, but there is a difference in the value of one or more of the other columns. This would pick up changed rows in original.
A left outer join to pick up the rows that are in the original tables, but not in the backup table (i.e. a row in original has a primary key that does not exist in backup). This would return rows inserted into the original.
A right outer join to pick up the rows in backup which no longer exist in the original. This would return rows that have been deleted from the original.
You could union the three queries together to return a single result set. If you did this you would need to add a column to indicate what type of row it is (updated, inserted or deleted).
With a bit of effort, you might be able to do this in one query using a full outer join. Be careful with outer joins, as they behave differently in different SQL engines. Predicates put in the where clause, instead of the join clause can sometimes turn your outer join into an inner join.
Most DBMSs support relational algebra's DIFFERENCE operation. In Snowflake, it is represented as MINUS. So the following code would work. The first MINUS makes sure every element in Table1 is also in Table2. The second MINUS makes sure every element in Table2 is also in Table1. If the query returns a zero, mathematically, the two tables are exactly the same.
select count(*) from
(
(
SELECT * FROM Table1
MINUS
SELECT * FROM Table2
)
UNION ALL
(
SELECT * FROM Table2
MINUS
SELECT * FROM Table1
)
)
However, You can also use the 'Except' keyword in SQL Server.
SELECT Col1, Col2, Col3, Col4, Col5 FROM Table_1
EXCEPT
SELECT Col1, Col2, Col3, Col4, Col5 FROM Table_2
If the query returns any rows: They are not Identical.
If the query returns no rows: They are Identical.
select count(*)
from lemmas as original_table
full join backup_table using (lemma_id)
where backup_table.lemma_id is null
or original_table.lemma_id is null
or original_table.lemma != backup_table.lemma
The full join / check for null should cover additions or deletions as well as changes.
- backup.id is null = addition
- original.id is null = deletion
- neither null = change
Try the following to compare two tables:
SELECT 'different' FROM DUAL WHERE EXISTS(
SELECT * FROM (
SELECT /*DISTINCT*/ +1 AS chk,a.c1,a.c2,a.c3 FROM a
UNION ALL
SELECT /*DISTINCT*/ +1 AS chk,b.c1,b.c2,b.c3 FROM b
) c
GROUP BY c1,c2,c3
HAVING SUM(chk)<>2
)
UNION SELECT 'equal' FROM DUAL
LIMIT 1;
1: First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
Please try the following method for determining if two tables are exactly the same, when there is no primary key of any kind and there are no duplicate rows within a table, using the below logic:
Step 1 - Test for Duplicate Rows on TABLEA
If SELECT DISTINCT * FROM TABLEA
has the same row count as
SELECT * FROM TABLEA
then go to the next step, otherwise you can't use this method...
Step 2 - Test for Duplicate Rows on TABLEB
If SELECT DISTINCT * FROM TABLEB
has the same row count as
SELECT * FROM TABLEB
then go to the next step, else you can't use this method...
Step 3 - INNER JOIN TABLEA to TABLEB on every column
If the row count of the below query has the same row count as the row counts from Steps 1 and 2, then the tables are the same:
SELECT
*
FROM
TABLEA
INNER JOIN TABLEA ON
TABLEA.column1 = TABLEB.column1
AND TABLEA.column2 = TABLEB.column2
AND TABLEA.column3 = TABLEB.column3
--etc...for every column
Note that this method doesn't necessarily test for different data types, and probably won't work on non-joinable data types (like VARBINARY)
Feedback welcome!
For the lazier or more SQL-averse developer working with MS SQL Server, I would recommend SQL Delta (www.sqldelta.com) for this and any other database-diff type work. It has a great GUI, is quick and accurate and can diff all database objects, generate and run the necessary change scripts, synchronise entire databases. Its the next best thing to a DBA ;-)
I think there is a similar tool available from RedGate called SQL Compare. I believe some editions of the latest version of Visual Studio (2010) also include a very similar tool.
Easy workaround for the quick gain:
SELECT sum(col1), sum(col2) FROM table1
UNION
SELECT sum(col1), sum(col2) FROM table2
Output is like:
sum(col1) | sum(col2) |
---|---|
11111345678 | 123456789101234 |
11111123456 | 123456789101234 |
and then check for differences by testing around:
SELECT count(col1), sum(col2) FROM table1 WHERE col1 = 0
UNION
SELECT count(col1), sum(col2) FROM table1 WHERE col1 = 0
count(col1) | sum(col2) |
---|---|
1234 | 123456789 |
345 | 6543210 |
or by group+sum() and then subtract one table's aggregation from the other table's aggregation and order by the difference.
More details, on the PostgreSQL case which is the same for this example, see Checking whether two tables have identical content in PostgreSQL.
精彩评论