Dealing with the surprising lack of ParList in scala.collections.parallel
So scala 2.9 recen开发者_运维百科tly turned up in Debian testing, bringing the newfangled parallel collections with it.
Suppose I have some code equivalent to
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):List[Int} = s.map(expensiveFunction)
now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList
... but to my surprise, there isn't one! (Just ParVector
, ParMap
, ParSet
...).
As a workround, this (or a one-line equivalent) seems to work well enough:
def process(s:List[Int]):List[Int} = {
val ps=scala.collection.parallel.immutable.ParVector()++s
val pr=ps.map(expensiveFunction)
List()++pr
}
yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.
My question is a sort of an aggregated:
- Why isn't there a
ParList
? - Given there isn't a
ParList
, is there a better pattern/idiom I should adopt so that I don't feel like they're missing ? - Am I just "behind the times" using Lists a
lot in my scala programs (like all the Scala books I
bought back in the 2.7 days taught me) and
I should actually be making more use of
Vectors
? (I mean in C++ land I'd generally need a pretty good reason to usestd::list
overstd::vector
).
List
s are great when you want pattern matching (i.e. case x :: xs
) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys
).
Hence it does not make much sense (to have a parallel List
) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:
xs.toIndexedSeq.par
First, let me show you how to make a parallel version of that code:
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq
That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List
, call .toList
instead of .seq
.
As for the questions:
There isn't a
ParList
because aList
is an intrinsically non-parallel data structure, because any operation on it requires traversal.You should code to traits instead of classes --
Seq
,ParSeq
andGenSeq
, for example. Even performance characteristics of aList
are guaranteed byLinearSeq
.All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.
Furthermore, there wasn't a collection like
Vector
in Scala 2.7 -- an immutable collection with (near) constant indexed access.
A List
cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List
cannot strip its tail, so one need to include a length parameter.
I guess, taking a Vector
will be the better solution.
Note that Scala’s Vector
is different from std::vector
. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector
is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.
精彩评论