Intensive file I/O and data processing in C#
I'm writing an app which needs to process a large text file (comma-separated with several different types of records - I do not have the power or inclination to change the data storage format). It reads in records (often all the records in the file sequentially, but not always), then the data for each record is passed off for some processing.
Right now this part of the application is single threaded (read a record, process it, read the next record, etc.) I'm thinking it might be more efficient to read records in a queue in one thread, and process them in another thread in small blocks or as they become available.
I have no idea how to start programming something like that, including the data structure that would be necessary or how to implement the multithr开发者_如何学Ceading properly. Can anyone give any pointers, or offer other suggestions about how I might improve performance here?
You might get a benefit if you can balance the time processing records against the time reading records; in which case you could use a producer/consumer setup, for example synchronized queue and a worker (or a few) dequeueing and processing. I might also be tempted to investigate parallel extensions; it is pertty easy to write an IEnumerable<T>
version of your reading code, after which Parallel.ForEach
(or one of the other Parallel
methods) should actually do everything you want; for example:
static IEnumerable<Person> ReadPeople(string path) {
using(var reader = File.OpenText(path)) {
string line;
while((line = reader.ReadLine()) != null) {
string[] parts = line.Split(',');
yield return new Person(parts[0], int.Parse(parts[1]);
}
}
}
Take a look at this tutorial, it contains all you need... These are the microsoft tutorials including code samples for a similiar case as you describe. Your producer fills the queue, while the consumer pops records off.
Creating, starting, and interacting between threads
Synchronizing two threads: a producer and a consumer
You may also look at asynchronous I/O. In this style, you'll start a file operation from the main thread, it will then continue running in background and when it completes, it invokes a callback that you specified. In the meantime, you can continue doing other things (such as processing the data). For example, you could start an asynchronous operation to read the next 1000 bytes, then process the 1000 bytes you already have and then wait for the next kilobyte.
Unfortunately, programming asynchronous operations in C# is a bit painful. There is a MSDN sample, but it's not nice at all. This can be nicely solved in F# using asynchronous workflows. I wrote an article that explains the problem and shows how to do similar thing using C# iterators.
A more promissing solution for C# is Wintellect PowerThreading library which supports similar trick using C# iterators. There is a good introductory article in MSDN Concurrency Affairs by Jeffrey Richter.
精彩评论