Fast/low-memory method to parse first two columns in a large csv file using c#
I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive开发者_如何学Python lock.
What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?
Perhaps this is a way to allow simultaneous access?
using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
using ( var reader = new StreamReader( filestream ) )
{
...
}
}
Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx which seems to give me the ability to read just two columns and then skip to the next line. They also have some benchmarks showing fast performance and low memory profile.If you want low memory, you'll probably use a StreamReader and ReadLine by line.
In a similar case the other day, I was able to skip the first 20,000,000 lines in a 500 MB file and build a string (using StringBuilder) for the next 1,000,000 lines in about 7 seconds.
Assuming that the file contains ASCII encoded text (would be typical for csv), your best bet may be to use Stream directly and the Stream.Read method, which allows you to read into a pre-allocated buffer. This has a few advantages:
You only allocate a buffer once, whereas ReadLine() will create a new String for every line.
You don't have to perform the Unicode conversion for the entire line; you can either do this only for the portion up to the second comma or (if you're severely time-constrained), you can write your own numeric parser that operates on the ASCII string data in the buffer (I'm sure there are well-documented algorithms for doing this.) This is assuming you need numeric data, of course.
Additional methods you'll likely need include the ASCII Encoding methods, particularly Encoding.ASCII.GetString
.
精彩评论