Duplicate checking in SQL server - one column or many?
In my line of work, we have to do a lot of duplicate checking on name and address data. In the past, I've always just created an index on the relevant columns and queried against those columns (firstname, lastname, address, city, state, zip, etc...) directly. Lately, I've been wondering if it makes more sense to hash those values before inserting and index only the hash for comparison. Assuming I use something like SHA-256 for the hash, I shouldn't have to worry about collisions (my data set isn't that large), and my index size would be only 32 bytes per record, or 64 bytes if I store it as a string, where as the index on NA开发者_Go百科CSZ data might be 200-300 bytes.
Is this a good idea? Am I stupid for not considering it before? Is there a valid reason not to do so? And as a matter of style, what would be a good name for this column?
Good idea, sure - are you stupid for not considering it before, definitely not. I also can't think of any valid reason not to do so, particularly if your dataset isn't that large. Of course, that may make someone wonder if there is a valid reason to do so (i.e. if your dataset isn't that large, why do you care that the NACSZ index would be 200-300 bytes in size).
As always, there are lots of things to consider to determine what is best for your scenario (i.e. usage patterns, selectivity, read vs. write ratios, etc.), so it's very hard for anyone to provide any sort of definitive answer on something like this.
One thing to consider given that you are using Sql Server (not sure which version, but I'll assume 2k5 or later) - though there is a built-in function for generating SHA/MDx hashes, it isn't as easy to use as the similar checksum hashing function. Using the hashbytes function requires you to build the input string to be hashed explicitly, whereas the checksum function can simply take a valid column list - a simple example shows the differences:
-- CHECKSUM - easily hash an entire record over a set, regardless of column datatypes
select checksum(c.*)
from sys.columns c
-- CHECKSUM - easily hash a subset of columns for a record over a set, regardless of column datatypes
select checksum(c.name, c.object_id, c.column_id)
from sys.columns c
-- HASHBYTES - this DOES NOT work
select hashbytes('MD5', c.*)
from sys.columns c
-- HASHBYTES - this DOES NOT work either
select hashbytes('MD5', c.name, c.object_id, c.column_id)
from sys.columns c
-- HASHBYTES - you have to explicitly build the string, casting to valid character-based datatypes
select hashbytes('MD5', a.name + cast(a.object_id as varchar(50)) + cast(a.column_id as varchar(50)))
from sys.columns a
The ease of use also comes in very handy when performing joins/unions/etc. across different tables/sets of data.
Obviously, the checksum hash is just a 32-bit algorithm and will likely lead to collisions, however this may not make any difference if your primary goal is to create a seekable index to improve performance and then perform additional secondary checks. For example, if you create a checksum() hash column on the table and index that column alone, you'll end up with a small index that can still be used to seek into and then perform secondary/residual comparisons on the NACSZ values on the small subset of columns that met the checksum() match. A query in this scenario might look like this:
declare @hash int
select @hash = checksum(@first_name,@last_name,@address,@city,@state,@zip)
select t.firstname, t.lastname, t.address, t.city, t.state, t.zip
from TableName t
where t.record_hash = @hash
and t.firstname = @first_name
and t.lastname = @last_name
and t.address = @address
and t.city = @city
and t.state = @state
and t.zip = @zip
The optimizer will seek primarily on the hash index and perform a residual check on the NACSZ values for the subset of records that match the hash value.
Naturally, if you are going to generate the hash in App code, this is likely less of an issue altogether.
As for naming conventions/style, can't say as I've ever heard of anything specific for this type of use, however for those I've seen/used, the column names typically include a designation of 'hash' and the type of hash it is - for example 'record_checksum_hash' or 'record_sha1_hash' or 'md5_hash'.
It's not a bad idea, but if your data set is small I'm not sure you'll gain anything tangible other than some duplicate fields (hashes of already existing fields). Are you considering doing this because the dupe finding is running slow, or just out of a desire to optimize?
If it's the latter I'd maybe not do it, don't forget that you now have to update two fields instead of just one when you make a change to those fields, and it's easy to forget to do, especially if you ever do updates in SQL directly vs. through your front end.
If your comparisons look for an exact match, and if the comparisons happen often (or if the speed of comparison is more important than speed of insert), then I think computing a hash would be a very good idea. Not only would the index be smaller, but the per-row comparison will be faster. In fact, you should be able to get away with SHA-1, which is only 160 bits, instead of SHA-256. The extra bits help from a cryptological perspective; they don't really improve uniqueness.
Of course, be sure to canonicalize your data before computing the hash, "LIKE" comparisons don't work with a hash as they would with strings.
Most hashes won't detect duplicates where case or accented letters differ. Please compare:
select checksum('Cafe'), checksum('cafe'), checksum('Café'), checksum(' Cafe');
select hashbytes('SHA1', 'Cafe'), hashbytes('SHA1', 'cafe'), hashbytes('SHA1', 'Café'), hashbytes('SHA1', ' Cafe');
精彩评论