Batch data processing in real time
I am tasked with optimizing a performance of a linear data processing routine. Here's an overview of what's already in place:
Data comes in on UDP ports, we have multiple listeners listening on different port and writing raw data to SQL Server database (lets call the table a RawData). Then we have multiple instances of a single threaded linear application grabbing raw data from RawData table and processing individual datarows. What processing means is the raw data is compared to previously received data for the given entity, calculations are done to calculate number of different readings, then couple of web services are called for each individual data row and finally a new record is added for each data row in ProcessedData table. Also corresponding entity record is updated in other table.
The way i see the problem, it can be broken down into smaller parts and i could utilize Producer/Consumer pattern for data processing: One thread of producer populates a shared (blocking) queue, multiple Consumers grab data rows from the queue and do parallel processing of them. After Consumers are done they put the processed data to another shared queue, which then will be accessed by yet another consumer thread (single) that will do a SqlBulkCopy to insert new records. Along the process there will be other shared queue that will store entity info for updates and yet another c开发者_StackOverflow社区onsumer will be grabbing updated information for the entities and performing updates.
Question is, even though it seems straight forward, it looks to me to be a cumbersome approach. I do feel there's a better way of doing what i'm looking for. Any suggestions on implementing the above Producer/Consumer pattern? Or should i look for a different design pattern for my problem?
Thanks in advance
Your proposed solution sounds reasonable, and I don't view it as cumbersome at all. It's simple to understand, simple to implement, effective, and efficient. It also allows you to tune the number of producers and consumers to achieve the best performance. Decomposition into smaller parts with limited communication among the parts is a very good thing.
So what you have is multiple threads (producers) reading data from UDP and storing those items in a shared queue. Call it the RawData
queue. Multiple consumers read from that queue, process items, and place the results into another shared queue. Call it the ProcessedData
queue. Finally, you have a single thread that reads the ProcessedData
queue and stores items in the database.
The .NET BlockingCollection
is perfect for this.
This might be of some help: Question on C# threading with RFID
精彩评论