开发者

Python halts while iteratively processing my 1GB csv file

I have two files:

  1. metadata.csv: contains an ID, followed by vendor name, a filename, etc
  2. hashes.csv: contains an ID, followed by a hash The ID is essentially a foreign key of sorts, relating file metadata to its hash.

I wrote this script to quickly extract out all hashes associated with a particular vendor. It craps out before it finishes processing hashes.csv

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iter开发者_如何学Pythonatively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

This script gets about 2000 entries through hashes.csv before it halts. What am I doing wrong? I thought I was processing it line by line.

ps. the csv files are the popular HashKeeper format and the files I am parsing are the NSRL hash sets. http://www.nsrl.nist.gov/Downloads.htm#converter

UPDATE: working solution below. Thanks everyone who commented!

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)


"Craps out" is not a particularly good description. What does it do? Does it swap? Fill all memory? Or just eats CPU without appearing to do anything?

However, just for a start, use a dictionnary rather than a list for stored_ids. Searching in a dictionnary is usually done in O(1) time while searching in a list is O(n).

Edit: here is a trivial micro-benchmark:

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

As you can see, a set (which has the same performance characteristics as a dict) does searches among one million integers more than 10000 times faster than a similar list (much less than a microsecond vs. almost 100 milliseconds per lookup). Consider that such a lookup happens for each line of your 1GB file and you understand how big the issue can be.


This code would die on any line that does not have at least 4 commas; for example, it would die on an empty line. If you are sure you dont want to use csv reader, then at least catch IndexError on line.split(',')[4]


Please explain what do you mean by halt? it hangs or quits? are there any error traceback?

a) It will fail on any line not having ","

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b) why are you splitting line multiple times, instead do this

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c) create a dict of stored_ids, so tokens[0] in stored_id will be fast

d) wrap your inner code in try/exept and see if there are any errors

e) where are you running it on command line or some IDE ?


Searching in array took O(n), so use dict instead

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

Or Use set

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

in c you will only have found strings for both hashes, and csv

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜