Best way to work with large amounts of CSV data quickly
I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.
Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.
A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV 开发者_C百科file with the millions of lines has to be walked through many, many, many times apparently.
My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)
Can I load this many records into memory? If so, how should I do it?
I know this is a little vague. Just looking for ideas as this is a little new to me.
Here is some ruby code I wrote to process large csv files (~180mb in my case).
https://gist.github.com/1323865
A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.
The relevant part is this:
lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.
In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY
. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.
how about using a database.
jam the records into tables, and then query them out using joins.
the import might take awhile, but the DB engine will be optimized for the join and retrieval part...
10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.
Two reasonably fast options:
Put your data into sqlite DB. Then it's a simple query with pair of
join
that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.
My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)
If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1
) instead of O(n)
. But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.
If you have a Java program written make sure you use the NIO libraries. They are way faster than the default. I have processed text files with 500,000 lines using the NIO libraries before.
精彩评论