开发者

Speeding up parsing of HUGE lists of dictionaries - Python

I'm using MongoDB an nosql database. Basically as a result of a query I have a list of dicts which themselves contains lists of dictionaries... which I need to work with.

Unfo开发者_JS百科rtunately dealing with all this data within Python can be brought to a crawl when the data is too much.


I have never had to deal with this problem, and it would be great if someone with experience could give a few suggestions. =)


Do you really want all of that data back in your Python program? If so fetch it back a little at a time, but if all you want to do is summarise the data then use mapreduce in MongoDB to distribute the processing and just return the summarised data.

After all, the point about using a NoSQL database that cleanly shards all the data across multiple machines is precisely to avoid having to pull it all back onto a single machine for processing.


Are you loading all the data into memory at once? If so you could be causing the OS to swap memory to disk, which can bring any system to a crawl. Dictionaries are hashtables so even an empty dict will use up a lot of memory, and from what you say you are creating a lot of them at once. I don't know the MongoDB API, but I presume there is a way of iterating through the results one at a time instead of reading in the entire set of result at once - try using that. Or rewrite your query to return a subset of the data.

If disk swapping is not the problem then profile the code to see what the bottleneck is, or put some sample code in your question. Without more specific information it is hard to give a more specific answer.


If CPU is your bottleneck (and your problem can be parallelized), you can also consider using Python's multiprocessing module, Disco project or Parallel Python to utilize multiple cores and/or multiple machines.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜