Concurrent text file reads by numerous VMs

2023-02-28 16:01 问答作者：

I have a java app I am porting as a proof-of-concept to a cloud architecture. I want to process a very large text file by running the same processing program on chunks of the file on separate VMs.

Worker nodes = n

Head node running Master and one Worker, with n-1 Worker nodes

I have two ideas in mind:

Master reads file line-by-line, sends first line to first worker node, second to second worker node and so on, repeating every n lines.
Master reads number of lines in file. Worker nodes then instructed to read no_of_lines/n concurrently from the file.

I am considering using an RMI or sockets based approach for transfer of data. Could anyone tell me which of the above methods would be most efficient? If this question cannot be 开发者_StackOverflow社区answered without specifying which java constructs I would be using, I would appreciate suggestions on those.

Also, would locking be an issue with concurrent file access if I each node knows which lines it is supposed to read?

Thanks for any suggestions

Ian

To take the second question first, there is never any problem in many programs reading one file IFF no program is writing the file: each program has its own file-position pointer. Even if some program is writing to the file, there might not be any problem if that program is always writing at the end of the file which, in any sane system, is always the case.

As for the first question, IFF all of the lines in the file are of constant length, then the issue is as always one of efficiency: it's more efficient to read several lines than it is to read one line.

If I were doing the project, the master would ask the workers to read (n_lines_in_file/n_workers) lines. There seems to me little point in the master's reading lines and passing them out to workers. That's assuming, though, that each line takes the same amount of worker processing as any other.

If that's not true, or if there are other variables you haven't told about, my strategy would no doubt change.

When you break up a program, you should ensure that you are not creating more overhead than you are looking to save. For example, reading a few lines of text is relatively cheap compared with doing an RMI call. Copying the data to many hosts may be more expensive than the processing you intend to do.

How long does the processing take? This will guide you as to how large each piece of work needs to be to be efficient. You may find that the optimal number of threads is one. ;)

继续阅读：cloud concurrency distributed

Concurrent text file reads by numerous VMs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？