How do you process a large data file with size such as 10G?

2022-12-23 01:00 问答作者：

I found this open q开发者_如何学Pythonuestion online. How do you process a large data file with size such as 10G? This should be an interview question. Is there a systematic way to answer this type of question?

If you're interested you should check out Hadoop and MapReduce which are created with big (BIG) datasets in mind.

Otherwise chunking or streaming the data is a good way to reduce the size in memory.

I have used streambased processing in such cases. An example was when I had to download a quite large (in my case ~600 MB) csv-file from an ftp server, extract the records found and put them into a database. I combined three streams reading from each other:

A database inserter which read a stream of records from
a record factory which read a stream of text from
an ftp reader class which downloaded the ftp stream from the server.

That way I never had to store the entire file locally, so it should work with arbitrary large files.

It would depend on the file and how the data in the file may be related. If you're talking about something where you have a bunch of independent records that you need to process and output to a database or another file, it would be beneficial to multi-thread the process. Have a thread that reads in the record and then passes it off to one of many threads that will do the time-consuming work of processing the data and doing the appropriate output.

In addition to what Bill Carey said, not only does the type of file determine "meaningful chunks" but also, it determines what "processing" would mean.

In other words, what you do to process, how you determine what to process will vary tremendously.

What separates a "large" data file from a small one is--broadly speaking--whether you can fit the whole file into memory or whether you have to load portions of the file from the disk one at a time.

If the file is so large that you can't load the whole thing into memory, you can process it by identifying meaningful chunks of the file, then reading and processing them serially. How you define "meaningful chunks" will depend very much on the type of file. (i.e. binary image files will require different processing from massive xml documents.)

Look for opportunities to split the file down so that it can be tackled by multiple processes. You don't say if records in the file are related, which makes the problem harder but the solution is in principle the same - identify mutually exclusive partitions of data that you can process in parallel.

A while back I needed to process 100s of millions of test data records for some performance testing I was doing on a massively parallel machine. I used some Perl to split the input file into 32 parts (to match the number of CPUs) and then spawned 32 processes, each transforming the records in one file.

Because this job ran over the 32 processors in parallel, it took minutes rather than the hours it would have taken serially. I was lucky though, having no dependencies between any of the records in the file.

继续阅读：data-structures

How do you process a large data file with size such as 10G?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？