How to store and push simulation state while minimally affecting updates per second?
My app is comprised of two threads:
- GUI Thread (using Qt)
- Simulation Thread
My reason for using two threads is to keep the GUI responsive, while letting the Sim thread spin as fast as possible.
In my GUI thread I'm rendering the entities in the sim at an FPS of 30-60; however, I want my sim to "crunch ahead" - so to speak - and queue up game state to be drawn eventually (think streaming video, you've got a buffer).
Now for each frame of the sim I render I need the corresponding simulation "State". So my sim thread looks something like:
while(1) {
simulation.update();
SimState* s = new SimState;
simulation.getAgents( s->agents ); // store agents
// store other things to SimState here..
stateStore.enqueue(s); // stateStore is a QQueue<SimState*>
if( /* some threshold reached */ )
// push stateStore
}
SimState
looks like:
struct SimState {
std::vector<Agent> agents;
//other stuff here
};
And Simulation::getAgents looks like:
void Simulation::getAgents(std::vector<Agent> &a) const
{
// mAgents is a std::vector<Agent>
std::vector<Agent> a_tmp(mAgents);
a.swap(a_tmp);
}
The Agent
s themselves are somewhat complex classes. The members are a bunch of int
s and float
s and two std::vec开发者_StackOverflow中文版tor<float>
s.
With this current setup the sim can't crunch must faster than the GUI thread is drawing. I've verified that the current bottleneck is simulation.getAgents( s->agents )
, because even if I leave out the push the updates-per-second are slow. If I comment out that line I see several orders of magnitude improvement in updates/second.
So, what sorts of containers should I be using to store the simulation's state? I know there is a bunch of copying going on atm, but some of it is unavoidable. Should I store Agent*
in the vector instead of Agent
?
Note: In reality the simulation isn't in a loop, but uses Qt's QMetaObject::invokeMethod(this, "doSimUpdate", Qt::QueuedConnection);
so I can use signals/slots to communicate between the threads; however, I've verified a simpler version using while(1){}
and the issue persists.
Try re-using your SimState objects (using some kind of pool mechanism) instead of allocating them every time. After a few simulation loops, the re-used SimState objects will have vectors that have grown to the needed size, thus avoiding reallocation and saving time.
An easy way to implement a pool is to initially push a bunch of pre-allocated SimState objects onto a std::stack<SimState*>
. Note that a stack is preferable to a queue, because you want to take the SimState object that is more likely to be "hot" in the cache memory (the most recently used SimState object will be at the top of the stack). Your simulation queue pops SimState objects off the stack and populates them with the computed SimState. These computed SimState objects are then pushed into a producer/consumer queue to feed the GUI thread. After being rendered by the GUI thread, they are pushed back onto the SimState stack (i.e. the "pool"). Try to avoid needless copying of SimState objects while doing all this. Work directly with the SimState object in each stage of your "pipeline".
Of course, you'll have to use the proper synchronization mechanisms in your SimState stack and queue to avoid race conditions. Qt might already have thread-safe stacks/queues. A lock-free stack/queue might speed things up if there is a lot of contention (Intel Thread Building Blocks provides such lock-free queues). Seeing that it takes on the order of 1/50 seconds to compute a SimState, I doubt that contention will be a problem.
If your SimState pool becomes depleted, then it means that your simulation thread is too "far ahead" and can afford to wait for some SimState objects to be returned to the pool. The simulation thread should block (using a condition variable) until a SimState object becomes available again in the pool. The size of your SimState pool corresponds to how much SimState can be buffered (e.g. a pool of ~50 objects gives you a crunch-ahead time of up to ~1 seconds).
You can also try running parallel simulation threads to take advantage of multi-core processors. The Thread Pool pattern can be useful here. However, care must be taken that the computed SimStates are enqueued in the proper order. A thread-safe priority queue ordered by time-stamp might work here.
Here's a simple diagram of the pipeline architecture I'm suggesting:
(Right-click and select view image for a clearer view.)
(NOTE: The pool and queue hold SimState by pointer, not by value!)
Hope this helps.
If you plan to re-use your SimState objects, then your Simulation::getAgents
method will be inefficient. This is because the vector<Agent>& a
parameter is likely to already have enough capacity to hold the agent list.
The way you're doing it now would throw away this already allocated vector and create a new one from scratch.
IMO, your getAgents
should be:
void Simulation::getAgents(std::vector<Agent> &a) const
{
a = mAgents;
}
Yes, you lose exception safety, but you might gain performance (especially with the reusable SimState approach).
Another idea: You could try making your Agent objects fixed-size, by using a c-style array (or boost::array
) and "count" variable instead std::vector
for Agent's float list members. Simply make the fixed-size array big enough for any situation in your simulation. Yes, you'll waste space, but you might gain a lot of speed.
You can then pool your Agents using a fixed-size object allocator (such as boost::pool
) and pass them around by pointer (or shared_ptr
). That'll eliminate a lot of heap allocation and copying.
You can use this idea alone or in combination with the above ideas. This idea seems easier to implement than the pipeline thing above, so you might want to try it first.
Yet another idea: Instead of a thread pool for running simulation loops, you can break down the simulation into several stages and execute each stage in it's own thread. Producer/consumer queues are used to exchange SimState objects between stages. For this to be effective, the different stages need to have roughly similar CPU workloads (otherwise, one stage will become the bottleneck). This is a different way to exploit parallelism.
精彩评论