Is SIMD Worth It? Is there a better option?
I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer one, but I'm wondering if there is a way of optimizing something like:
void collide(particle particles[], box boxes[],
double boxShiftX, double boxShiftY) {/*{{{*/
int i;
double nX;
double nY;
int boxnum;
for(i=0;i<PART_COUNT;i++) {
boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+
BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
//copied and pasted the macro which is why it's kinda odd looking
particles[i].vX -= boxes[boxnum].mX;
particles[i].vY -= boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1) {
nX = particles[i].vX*Wxx+particles[i].vY*Wxy;
nY = particles[i].vX*Wyx+particles[i].vY*Wyy;
} else { //to make it randomly pick a rot. direction
nX = particles[i].vX*Wxx-particles[i].vY*Wxy;
nY = -particles[i].vX*Wyx+part开发者_如何学Pythonicles[i].vY*Wyy;
}
particles[i].vX = nX + boxes[boxnum].mX;
particles[i].vY = nY + boxes[boxnum].mY;
}
}/*}}}*/
I've looked at SIMD, though I can't find much about it, and I'm not entirely sure that the processing required to properly extract and pack the data would be worth the gain of doing half as many instructions, since apparently only two doubles can be used at a time.
I tried breaking it up into multiple threads with shm and pthread_barrier (to synchronize the different stages, of which the above code is one), but it just made it slower.
My current code does go pretty quickly; it's on the order of one second per 10M particle*iterations, and from what I can tell from gprof, 30% of my time is spent in that function alone (5000 calls; PART_COUNT=8192 particles took 1.8 seconds). I'm not worried about small, constant time things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week last time.
I guess my question is if there is any way of dealing with these long vectors that is more efficient than just looping through them. I feel like there should be, but I can't find it.
I'm not sure how much SIMD would benefit; the inner loop is pretty small and simple, so I'd guess (just by looking) that you're probably more memory-bound than anything else. With that in mind, I'd try rewriting the main part of the loop to not touch the particles array more than needed:
const double temp_vX = particles[i].vX - boxes[boxnum].mX;
const double temp_vY = particles[i].vY - boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1)
{
nX = temp_vX*Wxx+temp_vY*Wxy;
nY = temp_vX*Wyx+temp_vY*Wyy;
}
else
{
//to make it randomly pick a rot. direction
nX = temp_vX*Wxx-temp_vY*Wxy;
nY = -temp_vX*Wyx+temp_vY*Wyy;
}
particles[i].vX = nX;
particles[i].vY = nY;
This has the small potential side effect of not doing the extra addition at the end.
Another potential speedup would be to use __restrict
on the particle array, so that the compiler can better optimize the writes to the velocities. Also, if Wxx etc. are global variables, they may have to get reloaded each time through the loop instead of possibly stored in registers; using __restrict
would help with that too.
Since you're accessing the particles in order, you can try prefetching (e.g. __builtin_prefetch
on GCC) a few particles ahead to reduce cache misses. Prefetching on the boxes is a bit tougher since you're accessing them in an unpredictable order; you could try something like
int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc...
// prefetch boxes[nextBoxnum]
One last one that I just noticed - if box::rotDir is always +/- 1.0, then you can eliminate the comparison and branch in the inner loop like this:
const double rot = boxes[boxnum].rotDir; // always +/- 1.0
nX = particles[i].vX*Wxx + rot*particles[i].vY*Wxy;
nY = rot*particles[i].vX*Wyx + particles[i].vY*Wyy;
Naturally, the usual caveats of profiling before and after apply. But I think all of these might help, and can be done regardless of whether or not you switch to SIMD.
Just for the record, there's also libSIMDx86!
http://simdx86.sourceforge.net/Modules.html
(On compiling you may also try: gcc -O3 -msse2 or similar).
((int)(particles[i].sX+boxShiftX))/BOX_SIZE
That's expensive if sX is an int (can't tell). Truncate boxShiftX/Y to an int before entering the loop.
Do you have sufficient profiling to tell you where the time is spent within that function?
For instance, are you sure it's not your divs and mods in the boxnum calculation where the time is being spent? Sometimes compilers fail to spot possible shift/AND alternatives, even where a human (or at least, one who knew BOX_SIZE and BWIDTH/BHEIGHT, which I don't) might be able to.
It would be a pity to spend lots of time on SIMDifying the wrong bit of the code...
The other thing which might be worth looking into is if the work can be coerced into something which could work with a library like IPP, which will make well-informed decisions about how best to use the processor.
Your algorithm has too many memory, integer and branch instructions to have enough independent flops to profit from SIMD. The pipeline will be constantly stalled.
Finding a more effective way to randomize would be top of the list. Then, try to work either in float or int, but not both. Recast conditionals as arithmetic, or at least as a select operation. Only then does SIMD become a realistic proposition
精彩评论