Can parallel operations speed the availability of a file from a hard disk in R?

2023-01-10 04:40 问答作者：

I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independ开发者_如何学JAVAent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?

The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.

Why not just use some of the standard Unix tools to split the file into chunks and call several R command-line expressions in parallel working on a chunk each? No need to be fancy if simple can do.

The bottleneck will likely be the HDD. It doesn't matter how many processes are trying to access it, it can only read/write one thing at a time.

This assumes the "string clean up" uses minimal CPU. awk or sed are generally better for this than R.

You probably want to read from the disk in one linear forward pass, as the OS and the disk optimize heavily for that case. But you could parcel out blocks of lines to worker threads/processes from where you're reading the disk. (If you can do process parallelism rather than thread parallelism, you probably should - way less hassle all 'round.)

Can you describe the string cleanup that's required? R is not the first thing I would reach for for string bashing.

Ruby is another easy scripting language for file manipulations and clean up. But still it is an issue of the ratio of processing time vs reading time. If the point is to do things like select out columns or rearrange things you are far better off going with ruby, awk or sed, even for simple computations those would be better. but if for each line you are say, fitting a regression model or performing a simulation, you would be better doing the tasks in parallel. The question cannot have a definite answer because we don't know the exact parameters. But it sound like for most simple cleanup jobs it would be better to use a language well suited for it like ruby and run it in a single thread.

继续阅读：hardware multicore multithreading r

Can parallel operations speed the availability of a file from a hard disk in R?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？