How to parallelize sequential tasks using c# and parallel extensions?
I have the following method that are called sequentially:
- private StringBuilder ReadPDF();
- private StringBuilder CleanText(StringBuilder sb);
- private void ParseText();
ParseText calls ReadPDF that calls CleanText;
The PDF I'm parsing have 15MB of text and it takes 10 minutes to extract all data from the file using a regular core 2 duo computer.
How can I parallelize these tasks?
edit: Just to clarify, reading the PDF takes very little time, the problem lies with parsing the extracted text, more specifically in the CleanText phase. The reason why I need to parallelize is that cleaning up a single page is instan开发者_如何学编程t, but cleaning 2k+ pages takes a long time.
First of all, you probably need to review the way you're reading the PDF. If it is only 15MB it cannot take 10 minutes to read unless you're using some VERY-VERY bad way of parsing it. Second, after you will find the way of parsing it better, you should be sure that you can read a single page at a time from any page you need. After that you will be able to run multiple tasks of reading a single page in parallel.
Read the PDF page by page and use Pipelining to process each page.
http://blogs.msdn.com/b/pfxteam/archive/2010/04/14/9995613.aspx
And as was mentioned in a post before, probably you're doing something wrong. It's ONLY 15MB PDF, it shouldn't take 10 minutes to read it.
As Denis said you can read a portion of the text, typically a page, but you might be able to break it into smaller blocks, then process that text while you are reading the next portion of text.
If you want to learn more about parallel programming you can find good info and labs at the MSDN Parallel Computing center.
MDN also has a Parallel Programming with .NET blog.
There is also a good book Professional Parallel Programming with C#: Master Parallel Extensions with .NET 4 by Gastón Hillar
精彩评论