开发者

How do you process a large data file with size such as 10G?

I found this open q开发者_如何学Pythonuestion online. How do you process a large data file with size such as 10G? This should be an interview question. Is there a systematic way to answer this type of question?


If you're interested you should check out Hadoop and MapReduce which are created with big (BIG) datasets in mind.

Otherwise chunking or streaming the data is a good way to reduce the size in memory.


I have used streambased processing in such cases. An example was when I had to download a quite large (in my case ~600 MB) csv-file from an ftp server, extract the records found and put them into a database. I combined three streams reading from each other:

  • A database inserter which read a stream of records from
  • a record factory which read a stream of text from
  • an ftp reader class which downloaded the ftp stream from the server.

That way I never had to store the entire file locally, so it should work with arbitrary large files.


It would depend on the file and how the data in the file may be related. If you're talking about something where you have a bunch of independent records that you need to process and output to a database or another file, it would be beneficial to multi-thread the process. Have a thread that reads in the record and then passes it off to one of many threads that will do the time-consuming work of processing the data and doing the appropriate output.


In addition to what Bill Carey said, not only does the type of file determine "meaningful chunks" but also, it determines what "processing" would mean.

In other words, what you do to process, how you determine what to process will vary tremendously.


What separates a "large" data file from a small one is--broadly speaking--whether you can fit the whole file into memory or whether you have to load portions of the file from the disk one at a time.

If the file is so large that you can't load the whole thing into memory, you can process it by identifying meaningful chunks of the file, then reading and processing them serially. How you define "meaningful chunks" will depend very much on the type of file. (i.e. binary image files will require different processing from massive xml documents.)


Look for opportunities to split the file down so that it can be tackled by multiple processes. You don't say if records in the file are related, which makes the problem harder but the solution is in principle the same - identify mutually exclusive partitions of data that you can process in parallel.

A while back I needed to process 100s of millions of test data records for some performance testing I was doing on a massively parallel machine. I used some Perl to split the input file into 32 parts (to match the number of CPUs) and then spawned 32 processes, each transforming the records in one file.

Because this job ran over the 32 processors in parallel, it took minutes rather than the hours it would have taken serially. I was lucky though, having no dependencies between any of the records in the file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜