Run Time Critical, reading operation of CSV files in C

2023-02-09 05:49 问答作者：

Is there a way 开发者_开发知识库to code a swift, efficient way of reading csv files?[the point to note here is: I am talking about a csv file with a million+ lines]

The Run Time is the critical metric here.

One resource on internet concentrated on using binary file operations to read in bulk. But I am sure, if it will be helpful in reading CSV files

There are other methods as well, like Robert Gamble written SourceForge code. Is there a way to write it using native functions?

Edit: Lets split the entire question in a clearer and better way:

Is there an efficient (Run Time critical) way to read files in C? (in this case a million rows long .csv file)
Is there a swift efficient way to parse a csv file?

There is no single way of reading and parsing any type of file that is fastest all the time. However, you might want to build a Ragel grammar for CSVs; those tend to be pretty fast. You can adapt it to your specific type of CSV (comma-separated, ;-separated, numbers only, etc.) and perhaps skip over any data that you're not going to use. I've had good experience with dataset-specific SQL parsers that could skip over much of their input (database dumps).

Reading in bulk might be a good idea, but you should measure on actual data whether it it's really faster than stdio-buffering. Using binary I/O might speed things up a bit on Windows, but then you need to handle newlines somewhere else.

In my experience, the parsing of CSV files — even in higher-level interpreted language — isn't usually a bottleneck. Usually huge amounts of data take a lot of space; CSV files are big, and most of the loading time is I/O, that is, the hard drive reading the tons of digits into memory.

So my strong advice is to consider compressing the CSVs. gzip does it's job very efficiently, it manages to squash and restore CSV streams on-the-fly, speeding up saving and loading by means of greatly decreasing file size and thus I/O time.

If you are developing under Unix, you may try this at cost of no additional code at all, benefiting from piping CSV input and output through gzip -c and gunzip -c. Just try it — for me it sped up things tens of times.

Set the input buffer to a much larger size than the default using setvbuf. This is the only thing that you can do in C to increase the read speed. Also do some timing tests because there will be a poingt of diminishing returns beyond which there is no point in increasing the buffer size.

Outside of C you can start by putting that .CSV onto an SSD drive, or store it on a compressed filesystem.

The best you can hope for is to haul large blocks of text into memory (or "memory map" a file), and process the text in memory.

The thorn in the efficiency is that text lines are variable length records. Generally, text is read until an end of line terminator is found. In general, this means reading a character, and checking for eol. Many platforms and libraries try make this more efficient by reading blocks of data and searching the data for eol.

Your CSV format further complicates the issue. In a CSV file, the fields are variable length records. Again, searching for a terminal character such as a comma, tab or vertical bar.

If you want better performance, you will have to change the data layout to fixed field lengths and fixed record lengths. Pad fields if necessary. The applications can remove the extra padding. Fixed length records are very efficient as far as reading is concerned. Just read N number of bytes. No scanning, just dump into a buffer somewhere.

Fixed length fields allow for random access into the record (or text line). The index into a field is constant and can be calculated easily. No searching required.

In summary, variable length records and fields are by their nature, not the most efficient data structure. Time is wasted searching for terminal characters. Fixed length records and fixed length fields are more efficient since they don't require searching.

If your application is data intensive, perhaps restructuring the data will make the program more efficient.

继续阅读：c csv file

Run Time Critical, reading operation of CSV files in C

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？