开发者

Unique identifier for an email

I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an e开发者_开发百科mail from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.

MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.

My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?


What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.

Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.

EDIT Psuedocode example

# intialized the hash object
hash = md5()

# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields

# compute the identifier string
id = hash.hexdigest()

You will get the same output if you replace all the update calls with

# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)

How you extract the strings and interface will vary based on your application, language, and api.

It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.


Have you looked at some other headers like (in my mail, OS X Mail):

X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <EE7CA968-13EB-47FB-9EC8-5D6EBA9A4EB8@example.com>

At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.

Not the answer to the question, but maybe the answer to the problem :)


Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜