find the difference between two very large list

2023-02-09 07:56 问答作者：

I have two large list(could be a hundred million items), the source of each list can be either from a database table or a flat file. both lists are of comparable sizes, both unsorted. I need to find the difference between them. so I have 3 scenarios:

1. List1 is a database table(assume each row simply have one item(key) that is a string), List2 is a large file.

2. Both lists are from 2 db tables.

3. both lists are from two files.

in case 2, I plan to us开发者_运维问答e:

select a.item from MyTable a where a.item not in (select b.item form MyTable b)

this clearly is inefficient, is there a better way?

Another approach is:

I plan to sort each list, and then walk down both of them to find the diff. If the list is from a file, I have to read it into a db table first, then use db sorting to output the list. Is the run time complexity still O(nlogn) in db sorting?

either approach is a pain and seems would be very slow when the list involved has hundreds of millions of items. any suggestions?

This is not really a database question.

Step 1. Get both lists sorted. Maybe the db list is already sorted, but if not, then either export it in sorted order, or create an index if this same list will be needed sorted multiple times.

Step 2. Use a sort utility to make a sorted copy of a list in a text file. If these lists are beyond the capability of the UNIX sort utility, break them up, sort each one, and include the merging of these in your application.

Step 3. Write your application to apply a merge algorithm against the two lists and identify the differences this way. Note that if the text file is in many chunks, you would need a secondary merge algorithm to feed the main algorithm in sorted order.

Note that if you cannot use UNIX or Linux to sort the text files, then get the source code of the UNIX sort command and port it to your O/S. This article explains why.

Get both sets into the database under all scenarios...this kind of sorting and determination is what db's are for. Anything else will be reinventing the wheel.
The following will probably be faster than a NOT IN (but test it to be sure):

select a.item from MyTable a LEFT JOIN MyTable B ON A.JoinColumn = B.JoinColumn where B.JoinColumn IS NULL

Make sure that your JoinColumns are indexed. The indexing will make the whole question of sorting go poof.

继续阅读：database sorting

find the difference between two very large list

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？