开发者

Comparing Value of one list inside Gigantic Two Dimen list in python, Fastest way?

I want to compare if value of one list exist in value of other list.They are huge (50k + items, from database).

EDIT:

I also want to mark the record which is duplicated as duplicate=True and keep them in the table for later refrence.

here how the lists are:

n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']] 
#in this case i want to retrive id 1 and 5 coz they are same checksum

for m in n_emails:
    dups=_getdups(n_emails,m[1],m[0])           
    n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
    if n_dups:
        print "Dupe Found"
        casesdb(cases开发者_高级运维db.email_data.id == m[0]).update(duplicated=True)

def _getdups(old_lst,em_md5,em_id):
    dups=[]
    for old in old_lst:
        if em_md5==old[0] and old[1]!=em_id:
            dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
    return dups

But it seems too long and in larger list (50k vs 50k records+) It ran for over 5000 seconds and never done , seems never ending loop? The server i running have 4 GB of ram and 4 cores. Obviously i am doing something wrong.

Please help .. thanks a lot!

SOLVED:

Dict Index Mapping is way a lot faster! (When mysql table is not indexed, plese note i have not test against indexed table).

Its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 Times! LOL


the fastest way is to use a dict like this:

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

this is nearly the algorithm for a hash join


for hash, count in select hash, count(id) as num from emails group by num having num > 1:
    first = None
    for index, id in enumerate(select id from emails where hash=hash sort by desc id):
        if index == 0:
            first = id
            continue
        update emails set duplicate=first where id=id

would be the sql/python solution in this i take the duplicate column and use it to store which message this one is thought to be a duplicate of

the emails table would be at least:

create table emails (id, hash, duplicate default null)


What you're doing wrong is:

  • You can probably get the result directly from the database. It's much faster than Python.
  • You are doing linear search for the checksums, which means that each of the 50k entries gets compared to each of the other 50k entries ... that is a huge number of comparisons.

What you should do is build an index over the checksums. Make a dict that maps checksum -> entry. When you insert the entries check if the checksum exists already, if so the entry is a duplicate.

Or you simply use your database, they love indexing.


You'd be better off looking for duplicates with SQL. For example, see this page describing how to find duplicates.

Pulling all of those results into Python and processing them is never going to be very fast, but if you must, your best bet is to have a dictionary of checksums to IDs:

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id


Finally thanks to all answers I found that dict mapping is insanely fast! Way a lot faster than SQL queries.

Here is my SQL query test (it will seem awkward, but it is syntax of Web2pyDAL's queries).

I tested both for 3500 records and only dict mapping against over 250000 records.

print "de_duping started at %s" % str( datetime.datetime.now() )

dupe_n = 0
l_dupe_n = 0
for em_hash in n_emails:
    dup_ids=casesdb(casesdb.email_data.MD5Hash==em_hash[1]).select(casesdb.email_data.id)
    if dup_ids>1:
        dupe_n+=1

print "Email Dupes %s" % (dupe_n)
print "Local de_duping ended at %s" % str( datetime.datetime.now() )

Resullts in:

de_duping started at 2010-12-02 03:39:24.610888
Email Dupes 3067
Local de_duping ended at 2010-12-02 03:39:52.669849

about 28 secs heres the dict based indexing map based on Dan D

    print "de_duping started at %s" % str( datetime.datetime.now() )
    for id, hash in em_hash:

            if hash not in dedupe_emails:

                dedupe_emails[hash] = [id]
            else:

                dedupe_emails[hash].append( id )
                dupe_n += 1
                casesdb( casesdb.email_data.id == id ).update( duplicated = True )

    print "Email Dupes %s" % (dupe_n)
    print "Local de_duping ended at %s" % str( datetime.datetime.now() )

results in :

de_duping started at 2010-12-02 03:41:21.505235
Email Dupes 2591 # this is accurate as selecting from database regards first match as duplicate too
Local de_duping ended at 2010-12-02 03:41:21.531899

only what? 30 ms!

and lets see what it done against de-duping 250k records!

de_duping at 2010-12-02 03:44:20.120880
Email Dupes 93567 
Local de_duping ended at 2010-12-02 03:45:12.612449

Less Than a min!!

Thanks to all answers , i would like to select all those who pointed me correct way , but Dan D give me most detailed answer! Thanks Dan!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜