开发者

Comparing 40 million lines in a file with 6 million list items in Python

I have a file with 40 million entries in the form of:

#No Username

And I have a list with 6 million items, where each item is a username.

I want to find the common usernames in the fastest way possible. Here’s what I’ve got so far:

import os
usernames=[]
common=open('/path/to/filf','w')
f=open('/path/to/6 million','r')
for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)
#noOfUsers=len(usernames)
for l in f:
    l=l.split(' ')
    if(l[1] in usernames):
        common.wr开发者_JS百科ite(l[1]+'\n')
common.close()
f.close()

How can I improve the performance of this code?


I see two obvious improvements: first, make usernames a set. Then, create a result list and write '\n'.join(resultlist) to file once.

import os

usernames = []

for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)

usernames = set(usernames)

f = open('/path/to/6 million','r')
resultlist = [] 
for l in f:
    l = l.split(' ')
    if (l[1] in usernames):
        resultlist.append(l[1])
f.close()

common=open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

Edit: assuming all you want is to find the most common names:

usernames = set(os.listdir('/path/to/directory/with/usernames/'))
from collections import Counter

f = open('/path/to/6 million')
name_counts = Counter(line.split()[1] for line in f if line in usenames)
print name_counts.most_common()

Edit2: Given the clarification, here's how to create a file that contains names common to the usernames in path and in the 6 million lines file:

import os
usernames = set(os.listdir('/path/to/directory/with/usernames/'))

f = open('/path/to/6 million')
resultlist = [line.split()[1] for line in f if line[1] in usernames]

common = open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()


If you create a dict with usernames as keys then the algorithm for testing the existence of a key in a dict is much faster than testing for the presence of an element in a list.


If this is an operation you will perform more than once, may I suggest threading? The following is some pseudo-code.

First, split the files up into 100,000 line files in Linux:

> split -l 100000 usernames.txt usernames_

Then, spawn some threads to do this parallel-wise.

 import threading
 usernames_one = set()
 usernames_two = set()
 filereaders = []

 # Define this class, which puts all the lines in the file into a set
 class Filereader(threading.Thread):
  def __init__(self, filename, username_set):
    # while 1:
    # read a line from filename, put it in username_set
  ...

 # loop through possible usernames_ files, and spawn a thread for each:
 # for.....
 f = Filereader('usernames_aa', usernames_one)
 filereaders.append(f)
 f.start()
 # do the same loop for usernames_two

 # at the end, wait for all threads to complete
 for f in filereaders:
     f.join()

 # then do simple set intersection:
 common_usernames = usernames_one ^ usernames_two

 # then write common set to a file:
 common_file = open("common_usernames.txt",'w')
 common_file.write('\n'.join(common_usernames))

You'll have to check if set addition is a thread-safe procedure. If not, you can of course create a list of sets (one for each file processed by the thread), and at the end union them all before intersecting.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜