Duplicate checking in SQL server - one column or many?

2022-12-13 05:55 问答作者：

In my line of work, we have to do a lot of duplicate checking on name and address data. In the past, I've always just created an index on the relevant columns and queried against those columns (firstname, lastname, address, city, state, zip, etc...) directly. Lately, I've been wondering if it makes more sense to hash those values before inserting and index only the hash for comparison. Assuming I use something like SHA-256 for the hash, I shouldn't have to worry about collisions (my data set isn't that large), and my index size would be only 32 bytes per record, or 64 bytes if I store it as a string, where as the index on NA开发者_Go百科CSZ data might be 200-300 bytes.

Is this a good idea? Am I stupid for not considering it before? Is there a valid reason not to do so? And as a matter of style, what would be a good name for this column?

Good idea, sure - are you stupid for not considering it before, definitely not. I also can't think of any valid reason not to do so, particularly if your dataset isn't that large. Of course, that may make someone wonder if there is a valid reason to do so (i.e. if your dataset isn't that large, why do you care that the NACSZ index would be 200-300 bytes in size).

As always, there are lots of things to consider to determine what is best for your scenario (i.e. usage patterns, selectivity, read vs. write ratios, etc.), so it's very hard for anyone to provide any sort of definitive answer on something like this.

One thing to consider given that you are using Sql Server (not sure which version, but I'll assume 2k5 or later) - though there is a built-in function for generating SHA/MDx hashes, it isn't as easy to use as the similar checksum hashing function. Using the hashbytes function requires you to build the input string to be hashed explicitly, whereas the checksum function can simply take a valid column list - a simple example shows the differences:

-- CHECKSUM - easily hash an entire record over a set, regardless of column datatypes
select  checksum(c.*)
from    sys.columns c

-- CHECKSUM - easily hash a subset of columns for a record over a set, regardless of column datatypes
select  checksum(c.name, c.object_id, c.column_id)
from    sys.columns c

-- HASHBYTES - this DOES NOT work
select  hashbytes('MD5', c.*)
from    sys.columns c

-- HASHBYTES - this DOES NOT work either
select  hashbytes('MD5', c.name, c.object_id, c.column_id)
from    sys.columns c

-- HASHBYTES - you have to explicitly build the string, casting to valid character-based datatypes
select  hashbytes('MD5', a.name + cast(a.object_id as varchar(50)) + cast(a.column_id as varchar(50)))
from    sys.columns a

The ease of use also comes in very handy when performing joins/unions/etc. across different tables/sets of data.

Obviously, the checksum hash is just a 32-bit algorithm and will likely lead to collisions, however this may not make any difference if your primary goal is to create a seekable index to improve performance and then perform additional secondary checks. For example, if you create a checksum() hash column on the table and index that column alone, you'll end up with a small index that can still be used to seek into and then perform secondary/residual comparisons on the NACSZ values on the small subset of columns that met the checksum() match. A query in this scenario might look like this:

declare @hash int

select  @hash = checksum(@first_name,@last_name,@address,@city,@state,@zip)

select  t.firstname, t.lastname, t.address, t.city, t.state, t.zip
from    TableName t
where   t.record_hash = @hash
and     t.firstname = @first_name
and     t.lastname = @last_name
and     t.address = @address
and     t.city = @city
and     t.state = @state
and     t.zip = @zip

The optimizer will seek primarily on the hash index and perform a residual check on the NACSZ values for the subset of records that match the hash value.

Naturally, if you are going to generate the hash in App code, this is likely less of an issue altogether.

As for naming conventions/style, can't say as I've ever heard of anything specific for this type of use, however for those I've seen/used, the column names typically include a designation of 'hash' and the type of hash it is - for example 'record_checksum_hash' or 'record_sha1_hash' or 'md5_hash'.

It's not a bad idea, but if your data set is small I'm not sure you'll gain anything tangible other than some duplicate fields (hashes of already existing fields). Are you considering doing this because the dupe finding is running slow, or just out of a desire to optimize?

If it's the latter I'd maybe not do it, don't forget that you now have to update two fields instead of just one when you make a change to those fields, and it's easy to forget to do, especially if you ever do updates in SQL directly vs. through your front end.

If your comparisons look for an exact match, and if the comparisons happen often (or if the speed of comparison is more important than speed of insert), then I think computing a hash would be a very good idea. Not only would the index be smaller, but the per-row comparison will be faster. In fact, you should be able to get away with SHA-1, which is only 160 bits, instead of SHA-256. The extra bits help from a cryptological perspective; they don't really improve uniqueness.

Of course, be sure to canonicalize your data before computing the hash, "LIKE" comparisons don't work with a hash as they would with strings.

Most hashes won't detect duplicates where case or accented letters differ. Please compare:

select checksum('Cafe'), checksum('cafe'), checksum('Café'), checksum(' Cafe');
select hashbytes('SHA1', 'Cafe'), hashbytes('SHA1', 'cafe'), hashbytes('SHA1', 'Café'), hashbytes('SHA1', ' Cafe');

继续阅读：sql-server

Duplicate checking in SQL server - one column or many?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？