find the difference between two very large list
I have two large list(could be a hundred million items), the source of each list can be either from a database table or a flat file. both lists are of comparable sizes, both unsorted. I need to find the difference between them. so I have 3 scenarios:
1. List1 is a database table(assume each row simply have one item(key) that is a string), List2 is a large file. 2. Both lists are from 2 db tables. 3. both lists are from two files. in case 2, I plan to us开发者_运维问答e:select a.item from MyTable a where a.item not in (select b.item form MyTable b)
this clearly is inefficient, is there a better way?
Another approach is:
I plan to sort each list, and then walk down both of them to find the diff. If the list is from a file, I have to read it into a db table first, then use db sorting to output the list. Is the run time complexity still O(nlogn) in db sorting?either approach is a pain and seems would be very slow when the list involved has hundreds of millions of items. any suggestions?
This is not really a database question.
Step 1. Get both lists sorted. Maybe the db list is already sorted, but if not, then either export it in sorted order, or create an index if this same list will be needed sorted multiple times.
Step 2. Use a sort utility to make a sorted copy of a list in a text file. If these lists are beyond the capability of the UNIX sort utility, break them up, sort each one, and include the merging of these in your application.
Step 3. Write your application to apply a merge algorithm against the two lists and identify the differences this way. Note that if the text file is in many chunks, you would need a secondary merge algorithm to feed the main algorithm in sorted order.
Note that if you cannot use UNIX or Linux to sort the text files, then get the source code of the UNIX sort command and port it to your O/S. This article explains why.
- Get both sets into the database under all scenarios...this kind of sorting and determination is what db's are for. Anything else will be reinventing the wheel.
The following will probably be faster than a NOT IN (but test it to be sure):
select a.item from MyTable a LEFT JOIN MyTable B ON A.JoinColumn = B.JoinColumn where B.JoinColumn IS NULL
Make sure that your JoinColumns are indexed. The indexing will make the whole question of sorting go poof.
精彩评论