Python halts while iteratively processing my 1GB csv file

2022-12-16 00:55 问答作者：

I have two files:

metadata.csv: contains an ID, followed by vendor name, a filename, etc
hashes.csv: contains an ID, followed by a hash The ID is essentially a foreign key of sorts, relating file metadata to its hash.

I wrote this script to quickly extract out all hashes associated with a particular vendor. It craps out before it finishes processing hashes.csv

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iter开发者_如何学Pythonatively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

This script gets about 2000 entries through hashes.csv before it halts. What am I doing wrong? I thought I was processing it line by line.

ps. the csv files are the popular HashKeeper format and the files I am parsing are the NSRL hash sets. http://www.nsrl.nist.gov/Downloads.htm#converter

UPDATE: working solution below. Thanks everyone who commented!

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)

"Craps out" is not a particularly good description. What does it do? Does it swap? Fill all memory? Or just eats CPU without appearing to do anything?

However, just for a start, use a dictionnary rather than a list for stored_ids. Searching in a dictionnary is usually done in O(1) time while searching in a list is O(n).

Edit: here is a trivial micro-benchmark:

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

As you can see, a set (which has the same performance characteristics as a dict) does searches among one million integers more than 10000 times faster than a similar list (much less than a microsecond vs. almost 100 milliseconds per lookup). Consider that such a lookup happens for each line of your 1GB file and you understand how big the issue can be.

This code would die on any line that does not have at least 4 commas; for example, it would die on an empty line. If you are sure you dont want to use csv reader, then at least catch IndexError on line.split(',')[4]

Please explain what do you mean by halt? it hangs or quits? are there any error traceback?

a) It will fail on any line not having ","

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b) why are you splitting line multiple times, instead do this

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c) create a dict of stored_ids, so tokens[0] in stored_id will be fast

d) wrap your inner code in try/exept and see if there are any errors

e) where are you running it on command line or some IDE ?

Searching in array took O(n), so use dict instead

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

Or Use set

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

in c you will only have found strings for both hashes, and csv

继续阅读：csv large-files memory python

Python halts while iteratively processing my 1GB csv file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon