开发者

Concurrent text file reads by numerous VMs

I have a java app I am porting as a proof-of-concept to a cloud architecture. I want to process a very large text file by running the same processing program on chunks of the file on separate VMs.

Worker nodes = n

Head node running Master and one Worker, with n-1 Worker nodes

I have two ideas in mind:

  1. Master reads file line-by-line, sends first line to first worker node, second to second worker node and so on, repeating every n lines.

  2. Master reads number of lines in file. Worker nodes then instructed to read no_of_lines/n concurrently from the file.

I am considering using an RMI or sockets based approach for transfer of data. Could anyone tell me which of the above methods would be most efficient? If this question cannot be 开发者_StackOverflow社区answered without specifying which java constructs I would be using, I would appreciate suggestions on those.

Also, would locking be an issue with concurrent file access if I each node knows which lines it is supposed to read?

Thanks for any suggestions

Ian


To take the second question first, there is never any problem in many programs reading one file IFF no program is writing the file: each program has its own file-position pointer. Even if some program is writing to the file, there might not be any problem if that program is always writing at the end of the file which, in any sane system, is always the case.

As for the first question, IFF all of the lines in the file are of constant length, then the issue is as always one of efficiency: it's more efficient to read several lines than it is to read one line.

If I were doing the project, the master would ask the workers to read (n_lines_in_file/n_workers) lines. There seems to me little point in the master's reading lines and passing them out to workers. That's assuming, though, that each line takes the same amount of worker processing as any other.

If that's not true, or if there are other variables you haven't told about, my strategy would no doubt change.


When you break up a program, you should ensure that you are not creating more overhead than you are looking to save. For example, reading a few lines of text is relatively cheap compared with doing an RMI call. Copying the data to many hosts may be more expensive than the processing you intend to do.

How long does the processing take? This will guide you as to how large each piece of work needs to be to be efficient. You may find that the optimal number of threads is one. ;)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜