Recommended language for multithreaded data work
Right now, I use a combination of Python and R for all of my data processing needs. However, some of my datasets are incredibly large and would benefit strongly from multithreaded processing.
For example, if there are two steps that each hav开发者_Python百科e to performed on a set of several millions of data points, I would like to be able to start the second step while the first step is still being run, using the part of the data that has already been processed through the first step.
From my understanding, neither Python nor R is the ideal language for this type of work (at least, I don't know how to implement it in either language). What would be the best language/implementation for this type of data processing?
It is possible to do this in Python using the multiprocessing
module -- this spawns multiple processes instead of threads, which bypasses the GIL and hence allows true concurrency.
That is not to say that Python is the 'best' language for this job; that's a subjective point which can be argued over. But it is certainly capable of it.
EDIT: Yes, there are several ways to share data between processes. Pipes are the simplest; they are sort-of file-like handles which one process can write to and then another can read from. Straight from the docs:
from multiprocessing import Process, Pipe
def f(conn):
conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints "[42, None, 'hello']"
p.join()
You could for instance have one process performing the first step and sending the results down a pipe to another process for the second step.
It is pretty easy to do multiprocessing with R (or definitely not harder than in Python); check out multicore package and other listed here.
I find that using R with the foreach package is really an easy way to use multithreading in your code. Use the doMC or the doMPI package as the parallel back-end if you have a UNIX-alike or windows respectively. The vignette should get you going fairly quickly. This method is mostly best for parallelizing for loops, and i find that using 7 of the 8 cores on my machine usually give nearly a sixfold speed increase. I'm not sure that you can start a second process based on results of the first, but it is worth a quick look.
Good luck. Sorry that I am a new user and can only post one link, or I would have linked all of the other pages as well.
On the Python side, your best bet is probably to separate the two steps in two different processes. There are a couple of modules that help you to achieve that. You would couple the two processes through pipes. In order to pass arbitrary data through the pipe, you need to serialize and deserialize it. The pickle module would be a good candidate for this.
If you want to jump ship, languages like Erlang, Haskell, Scala or Clojure have probably the concurrency features you are looking for, but I don't know how well they would integrate with R or some other statistical package that suits you.
If I remember correctly (but I might be wrong here) one of the main purposes of Ada95 was parallel processing. Funny language, that was.
Jokes aside I'm not quite sure how good performance wise it would be (but seeing you are using Python now then it shouldn't be that bad) but I'd suggest Java since the basics of multithreading are quite simple there (but making a well written, complex multithreaded application is rather hard). Heard the Concurrency library is also quite nice, I haven't tried it out myself yet, though.
My experiences with Java multi-threading have been very positive, although it does take a lot of getting used to. At the end of the day, the biggest problem wasn't the syntax or the java features but the different mindset you need to develop multi-threaded code.
If you're using eclipse there's also a profiling suite that's very helpful in debugging and optimisation. Causes a rather big performance hit though :)
精彩评论