Big data sort and search
I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the开发者_如何学编程 strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.I'm waiting for your suggestions on this.
You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:
Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.
When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.
You can slightly improve on @michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.
For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:
#!/usr/bin/python3
import sys
if __name__=='__main__':
b = open(sys.argv[2],'r')
bs = set()
for l in b:
bs.add(l.strip())
b.close()
a = open(sys.argv[1],'r')
for l in a:
l = l.strip()
if l in bs:
bs.remove(l)
for x in bs:
print(x)
I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:
25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
60298 60298 509244
精彩评论