开发者

Parsing bulk text with Hadoop: best practices for generating keys

I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.

I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).

Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. 开发者_Go百科However, this isn't a main requirement.

Aντίο,


Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.


Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.


Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.

You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜