Best way to work with large amounts of CSV data quickly

2023-02-22 20:03 问答作者：

I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV 开发者_C百科file with the millions of lines has to be walked through many, many, many times apparently.

My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

Can I load this many records into memory? If so, how should I do it?

I know this is a little vague. Just looking for ideas as this is a little new to me.

Here is some ruby code I wrote to process large csv files (~180mb in my case).

https://gist.github.com/1323865

A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.

The relevant part is this:

lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
  lines << line
  if lines.size >= 1000
    lines = FasterCSV.parse(lines.join) rescue next
    store lines
    lines = []
  end
end
store lines

IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.

In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.

Two reasonably fast options:

Put your data into sqlite DB. Then it's a simple query with pair of join that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.
Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.

My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)

If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1) instead of O(n). But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.

If you have a Java program written make sure you use the NIO libraries. They are way faster than the default. I have processed text files with 500,000 lines using the NIO libraries before.

继续阅读：csv ruby

Best way to work with large amounts of CSV data quickly

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？