How to optimize statistic counting sequence and why it works so slow
intro : I spent whole day looking why my processing operation is so so slow. It was really slow on low data. I checked sql views , procedures , and linq logics - and all of them worked perfect. but then I saw the little thing takes a开发者_JAVA技巧ges to process.
member X.CountStatistics()=
linq.TrueIncidents
|> PSeq.groupBy (fun v -> v.Name)
|> PSeq.map (fun (k, vs) -> k, PSeq.length vs)
|> Array.ofSeq
It simply counts grouped values but how much time it spends ! about 10 seconds on easy table,
There must be something angry recursive but I can't see it...
How can I make this operation "a bit faster" or recode it to linq-to-sql ?
If I understand correctly, TrueIncidents is a table in a db, you're pulling the entire contents into a client app to do some grouping and counting. If TrueIncidents is a large table then this operation is always going to be slow since you’re moving a large amount of data around. The “correct” way to do this to do this is on the database, as you suggest using linq to SQL, or as Tomas suggest using a stored procedure.
Regarding PSeq, I do not thinking inlining will make much of a difference. Parallelization has an overhead and for this overhead to amortize the list needs to be relatively large and the operation you perform on each item in the list needs to be significant. A parallelizing may worth it for a small list if the operation you’re performing on each item is very expensive, however the reverse does seem to be true; even if a list is very large parallelizing a small operation will not be worth the overhead. So, the problem in this case is the operation you perform on each item in the list is too small, so the cost of the parallelization will always make the operation slower. To see this consider the following C# program were we perform a simple addition on a list with 10 million items, you’ll see that the parallel version always runs slow (well, on the machine I’m working on at the moment, which has two cores, I guess on a machine with more cores the result might be different).
static void Main(string[] args)
{
var list = new List<int>();
for (int i = 0; i < 10000000; i++)
{
list.Add(i);
}
var stopwatch = new Stopwatch();
stopwatch.Start();
var res1 = list.Select(x => x + 1);
foreach (var i in res1)
{
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed);
// 00:00:00.1950918 sec on my machine
stopwatch.Start();
var res2 = list.Select(x => x + 1).AsParallel();
foreach (var i in res2)
{
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed);
// 00:00:00.3748103 sec on my machine
}
The current version of the F# LINQ support is a bit limited.
I think that the best way to write this is to sacrifice some of the elegance in using F# for this and write it as a stored procedure in SQL. Then you could add the stored procedure to your linq
data context and call it nicely using a generated method. When F# LINQ improves a bit in the future, you can change it back :-).
Regarding the PSeq
example - as far as I know, there was some efficiency issue because the methods were not inlined (thanks to inlining, the compiler was able to do some additional optimization and it removed some overhead). You can try downloading the source and adding inline
to map
and groupBy
.
As already mentioned in other answers, if you bring large amount of data from database and then do some calculations on this large data set, than it will be expensive (I think the IO part will be more expensive then computation part). In your specific case it seems that you want the count for each incident name. One approach for this can be using F# linq-sql just bring the "names" of the incident from database (no other column as you don't need them) and then do group by and mapping operating in F#. It may help you to improve performance, but not sure how much the improvement will be.
精彩评论