Big data sort and search

2023-01-20 11:01 问答作者：

I have two files of data, 100 char lines each. File A: 10⁸ lines, file B: 10⁶ lines. And I need to find all the开发者_如何学编程 strings from file B that are not in file A.

At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 10⁸ records.

I'm waiting for your suggestions on this.

You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:

Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.

When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.

You can slightly improve on @michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.

For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:

#!/usr/bin/python3

import sys

if __name__=='__main__':
  b = open(sys.argv[2],'r')
  bs = set()
  for l in b:
    bs.add(l.strip())
  b.close()
  a = open(sys.argv[1],'r')
  for l in a:
    l = l.strip()
    if l in bs:
      bs.remove(l)
  for x in bs:
    print(x)

I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:

25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
  60298   60298  509244

继续阅读：bigdata search sorting

Big data sort and search

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？