CSV Random Access; C#

2023-02-07 08:36 问答作者：

I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].

Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.

I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.

I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.

I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.

Cheers开发者_JS百科

Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?

I have used Lumenworks CSV reader for quite large CSV files, it may be worth a quick look to see how quickly it can parse your file.

Lumenworks CSV

First of all, how would you want to refer to a particular row? Is it the index of the row so that you have another table or something that will help you know which row you are interested? or is it by an id or something?

These ideas come to mind

Your approach
Binary search. Assuming you have average length (size/rows), you can use a binary search to find a row assuming there is an identifier in the row which is ordered and can tell you if you are hit or miss.
Loading it to a database! By the way, what prevents you to do that? You can even use SQL express - which is free - and to get around the size limit, you can shard your data to multiple databases.

Index-file would be the best you could do. I bet. Having unknown size of row, there is no way to skip directly to the line other than either scan the file or have an index.

The only question is how large your index is. If it is too large, you could make it smaller by indexing only every 5th (for example) line and scan in range of 5 lines.

Pre-process the file so that the fields are fixed width. Then you can do your random read easily.

From doing similar sorts of in the past you should be able to write some simple code that reads the 10G variable width file from a local disk and writes a 10G fixed width file to a local disk in a few (~20) minutes. If that up front time investment pays off depends on how many random reads you need to do and how often the file to be read changes.

What if you created 12345 separate file that are read with Lazy instantiation. Each file would only be read if the data was needed. If the data is completely sparse you could create a data structure with an IsEmpty bool property.

Do you need to access the same element over and over or do you need to just read each element once?

I disagree that you shouldn't load the file into RAM, especially if you use a 64bit OS.

It shouldn't be a problem to allocate a matrix of size 12345x20000 : that's only about 1.9 GB in double precision. And in fact even if the size was bigger, I would still recommend this approach under a 64bit platform (see "virtual memory").

Secondly you stated that your matrix was sparse, hence you could load into RAM but use a sparse representation to spare some memory.

In conclusion if your application requires many access to your matrix and performance is somewhat important, putting it in RAM would definitely be my favourite approach.

继续阅读：csv file random-access

CSV Random Access; C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？