Unique identifier for an email

2022-12-28 04:48 问答作者：

I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an e开发者_开发百科mail from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.

MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.

My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?

What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.

Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.

EDIT Psuedocode example

# intialized the hash object
hash = md5()

# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields

# compute the identifier string
id = hash.hexdigest()

You will get the same output if you replace all the update calls with

# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)

How you extract the strings and interface will vary based on your application, language, and api.

It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.

Have you looked at some other headers like (in my mail, OS X Mail):

X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <EE7CA968-13EB-47FB-9EC8-5D6EBA9A4EB8@example.com>

At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.

Not the answer to the question, but maybe the answer to the problem :)

Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

继续阅读：cryptography email uniqueidentifier

Unique identifier for an email

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？