Programming Pearls: find one integer appears at least twice

2023-02-21 19:17 问答作者：

It's in the section 2.6 and problem 2, the original problem is like this:

"Given a s开发者_如何学编程equential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"

My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.

Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.

The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.

You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:

int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
  if ( N >= 1000000 && N <= 2000000 ) {
    pigeons++
  }
}
if (pigeons > pigeonholes) {
  // one of the duplicates is between 1,000,000 and 2,000,000
  // try again with a narrower range
}

Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)

As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

If what do you mean is 32 bit positive integers, I think this problem doesn't require some special algorithm or trick to solve. Just a simple observation will lead to the intended solution.

My observation goes like this, the sequential file will contain only 32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them in that file uniquely, you will end up with 2 ^ 31 lines. You can see that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines and it is smaller than 4,300,000,000.

Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.

Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.

My first implementation attempt would be to use sort and uniq commands from unix.

继续阅读：algorithm programming-pearls

Programming Pearls: find one integer appears at least twice

更多精彩内容

精彩评论

最新问答

北医三院三代试管养囊一次费用是多少？贵不贵？？

下周全市中小学校恢复线下教学，如何让孩子收心准备开学？？

飞利浦液晶电视,不小心按了TV,现实无信号,要怎么样才能返回电视...？

如何治疗输卵管阻塞？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机如何样？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？