开发者

Best way to utilize Parallel / PLINQ in finding keywords in all Excel Worksheet Cells

As title, I have a List<string> keywords; and also a Workbook object model that similar to Excel.

I would like to get all the WorkbookCell that matches the keywords in the list.

I was thinking maybe Parallel the searching would be an good idea:

            //Loop through all the Worksheets in parallel
            Parallel.ForEach(Workbook.Worksheets, (ws, st) =>
            {
                if (!st.ShouldExitCurrentIteration)
                {
                    //Loop through all the rows in parallel
                    Parallel.ForEach(ws.Rows, (wr, tk) =>
                    {
                        if (!tk.ShouldExitCurrentIteration)
                        {
                            //Loop through all the columns in parallel
                            Parallel.ForEach(wr.Cells, (cell, ctk) =>
                            {
                                if (cell.Value != null)
                                {
                                    var cellValue = cell.Value.ToString();

                                    //Block keyword found, add the occurance
                                    var matchedKeyword = IsKeywordMatched(cellValue)开发者_如何学编程;

                                    if (matchedKeyword != null)
                                    {
                                        matchedKeyword.AddMatchedCell(cell);
                                    }
                                }
                            });
                        }
                    });
                }
            });

Would this be too much of parallel in fact? Please let me know if you have better ideas.

** I have less than 20 worksheets in normal case, but every worksheet will contains more than 10000 of rows and hundreds of columns.


The default number of parallel threads is equal to the number of cores per default. Each parallel loop is related to the overhead of splitting (clustering) the data into n portions and merging them again. I wold say it makes sense to live only the first loop if the number of worksheets is greater then number of cores in common case, otherwise split data on the second level. Nested parallel loops will only decrease performance. Thus yes, you are right it's too much parallelism.


This looks as a good candidate for paralleling for me...

worksheet.Cells.AsParallel().Select(x => new{x,KeywordMatched(x.Value.ToString())}).Where(...)...

Should give you almost linear performance improvement vs. number of cores available.

HINT: Change your IsKeywordMatched function to KeywordMatched, which returns the string matched or NULL if nothing is there. Then filter the resulting query (.Where(...)) by the records where stinr is not null.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜