multithread boost C++ program design for low-latency large-data exchange
I am trying to solve a network flow problem by C++ multithreading.
Given a network (all nodes are connected by arcs, each arc is connected to 2 and only 2 ending nodes, one is input node and another is output node, each node can have multiple input arcs and output arcs), each node needs to do some computing and then exchange computing result data to its connected input and output nodes.
Multiple nodes can be grouped into one task, which is run by one thread. In this way, the whole network computing workload can be partitioned into multiple tasks. All these tasks are pushed into a boost thread pool such that all threads can run the tasks at the same time.
But, if a node (in a thread task) needs to do data exchange with another node (in another thread task), there is a synchronization problem. Data receiver needs to wait for data available in the data buffer of the data sender.
My proram needs to partition the network such that each thread's task workload is assigned as evenly as possible. If all threads share the one-large data buffer structure, the program parallelism is not good because the critical section is too large. Some threads have to wait the the one-large data buffer structure unlocked even though the part of the data structure (which is useful to them ) has been available for read or write.
For example, the one-large data buffer structure has the following buffer cells: cell1 , cell2, cell3 , cell4.
When thread 1 is trying to write cell 1, it must lock the whole data buffer structure so that thread 2 cannot read or write cell 2 and so on.
So, I want to break the one-large data buffer structure into multiple distinct data cells according to the thread number so that each cell holds the data only needed by one thread task.
For example, if we have 2 threads, we create 2 data cells that hold data needed by the 4 thread separately. If we h开发者_如何学JAVAave 4 threads, we create 4 data cells that hold data needed by the 4 thread separately. and so on.
My questions are:
(1) How to design the data cell ? You can see that its size is based on the number of threads.
(2) How to reduce synchronization overhead ? The critical section is small but the overhead of geting and releasing mutex may be very high if the inter-node data exchange frequency is high.
(3) When a node's computing is done and data is written to its cell how to notify the data receiver node such that the notification messgae is only received by the waiting thread that run the receiver node computing task. All other unrelated nodes and threads are not impacted.
The program is very time-sensitive and the latency of message exchange should be controlled very toughly and reduced as much as possible.
Any help is really appreciated.
Thanks
The usual way of dealing with this, I think, is to set up a message-passing infrastructure between threads.
Each thread has a message queue. In your example, say node N1 is assigned to thread 1, node N2 is assigned to thread 2, and there is an edge between N1 and N2. Then, when thread 1 finishes the N1 calculation, it sends a message to thread 2:
"Send input to node N2"
To send a message to a thread, you just lock that thread's message queue and append your message. You use one mutex and two condition variables (queue_not_empty_condition and queue_not_full_condition) to implement a bounded queue. When a thread wants to wait for new work, it just goes to sleep on its message queue.
To reduce the synchronization overhead, you might want a way to put multiple messages into a queue ("batch send") while locking the mutex just once. Then then loop within one thread would look something like this:
if (I can do work without communicating with other threads)
do that work
else
send all pending messages (in batches to each destination thread)
wait on my input queue and pop the messages off in a batch
"Batching" of messages might interact in complicated ways with bounded queues, though. No free lunch.
精彩评论