开发者

Hashing SMTP and NNTP messages?

I want to store and index all of my historical e-mail and news as individual message files, using some computed hash code based on the message body+headers. Then I'll index on other things as well -- for searching.

For the primary index key, my thought is to use SHA-1 for the hash algorithm and assume that there will never be any collisions (although I know that there theoretically could be).

Besides the body, what headers should I index? Or more generally, what transformations should I apply to an in-memory copy of the message prior to hashing?

Should I ignore "ReSen开发者_开发技巧t-*:" headers? Should I join line-broken headers into single-line headers and remove extraneous whitespace?

(The reason I want to index the messages based on some head instead of on the Message-ID header is because Message-ID headers aren't uniformly formatted.)


You should hash precisely that which constitutes uniqueness of the message. If two messages may differ by the presence of "ReSent-*:" headers but still must be considered to be the "same" message, then those headers must not be part of what is hashed. Similarly, if equal messages may differ in header syntax then you should normalize header syntax. Hash functions such as SHA-1 return the same output only if the input is eaxctly the same, every single bit of it.

Now if using Message-IDs are just enough for you, save for the formatting issue, then there is a simple way: just hash the Message-IDs. A hashed Message-ID will have your regular, fixed-size, randomized format on which you can index.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜