开发者

Ruby or Python for heavy import script?

I have an application I wrote in PHP (on symfony) that imports large CSV files (up to 100,000 lines). It has a real memory usage problem. Once it gets through about 15,000 rows, it grinds to a halt.

I know there are measures I could take within PHP but I'm kind of done with PHP, anyway.

If I wanted to write an app th开发者_Python百科at imports CSV files, do you think there would be any significant difference between Ruby and Python? Is either one of them geared to more import-related tasks? I realize I'm asking a question based on very little information. Feel free to ask me to clarify things, or just speak really generally.

If it makes any difference, I really like Lisp and I would prefer the Lispier of the two languages, if possible.


What are you importing the CSV file into? Couldn't you parse the CSV file in a way that doesn't load the whole thing into memory at once (i.e. work with one line at a time)?

If so, then you can use Ruby's standard CSV library to do something like the following"

CSV.open('csvfile.csv', 'r') do |row|
  #executes once for each row
  p row
end

Now don't take this answer as an immediate reason to switch to Ruby. I'd be very surprised if PHP didn't have a similar functionality in its CSV library, so you should investigate PHP more thoroughly before deciding that you need to switch languages.


What are you importing the CSV file into? Couldn't you parse the CSV file in a way that doesn't load the whole thing into memory at once (i.e. work with one line at a time)?

If so, then you can use Python's standard csv library to do something like the following

import csv
with open('csvfile.csv', 'rb') as source:
    rdr= csv.reader( source )
    for row in rdr:
        # do whatever with row

Now don't take this answer as an immediate reason to switch to Python. I'd be very surprised if PHP didn't have a similar functionality in its CSV library, etc.


The equivalent in python (wait for it):

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

This code does not load the entire csv file in memory first but, instead, parses it line by line with iterators. I bet your problem is happening "after" the line is read, where you are somehow buffering the data (by storing it in a dictionary or array of some sort).

When dealing with bigdata, you need to discard of the data as fast as you can and buffer a little as possible. In the example above "print" is doing just that, performing some operation on the line of data but not storing/buffering any of it so python's GC can do away with that reference as soon as the loop scope ends.

I hope this helps.


I think the problem is that you are loading the csv in memory at once. If that is the case then I am sure that also python/ruby is going to blow up on you. I am a big fan of python, but that is just a personal opinion.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜