Gaussian Distribution + Hash Tables
I had a weird idea about a hashing function. The problem statement is
You are storing id-numbers of 162 students in a class obtaining n marks out of 300 in a course (for each n=0, 1, 2, ... 300) in a hash table. Devise the simplest and least collision prone hash function for this such that the wasted memory cells also are minimum. Here, a collision is when two students scoring n1 and n2 get the same slot in the hash table.
One solution can be to use h(n) = (n*5 + 7) % 163 along with chaining. There can be at most 162 distinct marks.
EDIT There can be several standard ways to do this. But I'd like to try my idea and check it (maybe mathematically). It just might have lesser collisions with lesser memory.
Now, here's the idea I had. I can assume distribution of marks to be gaussian. So, there are more people near the 开发者_Go百科average score and lesser at the extremes.
So, I can have a hash function something like this:
h(n) = 0 (if n<100 || n>200)
h(n) = 1 (if 100<=n<125 || 175<=n<200) h(n) = 2 (if 125<=n<140 || 160<=n<175) h(n) = 3 (if 140<=n<160)
For some such conditions (say, k), the hash table will have the least number of collisions and the least amount of space occupied.
Now, this is just a guess.Does something like this make sense?Is there a way to prove this? Or am I wrong somewhere?
What you are doing manually here is called in image processing histogram equalization. I think it makes absolutely sense, because you make sure that statistically all slots are used with the same probability, and so you're minimizing collisions.
Edit: Mis-read the question, voting 'delete' doesn't seem to do anything on SO.
If your variable is normally distributed, why not transform it using the normal CDF? The result would be uniformly distributed between 0 and 1 and would naturally be a good key into your hash table.
Doing histogram_equalization and the like can get pretty expensive. You might consider other standard ways of reducing hash collisions or their effects, like cuckoo hashing or hopscotch hashing.
精彩评论