Does the following ARM instruction set generate stalls?
Programming the ARM11MP Vfpu, I've looked over the docs and am concerned that the following will stall badly when doing a 4-component dot product (as part of a 4x4 matrix multiply)
fmuls s0, s0, s4
fmacs s0, s1, s5
fmacs s0, s2, s6
fmacs s0, s3, s7
Does the accumuate step generate stalls h开发者_开发问答ere? If so, I will have to really change stuff around as I only get 32 single registers to work with and then takes 9 as it is. Also, I could setup the vector register to do this in 1 instruction, but am wondering if the 3 instruction cycles will be worth it as I'd have to unset it nearly immediately for a store back to memory unless I overflowed to the ARM registers. Posting from home without my real SO account here...
I'm not in any way familiar with ARM, so you should take this with a grain of salt. This answer is just based on about 20 mins of searching around for documentation on my phone. There could be some things I'm missing, so this may not be correct.
In any case, I believe yes, this should cause pipeline stalls. The VFP coprocessor has an 8 stage pipeline, but because of "forwarding" (each instruction depends on the result of the previous instruction) the number of stalled cycles should be reduced to 7 for each instruction. Still, given the 4 instructions you have you would be stalled for about 28 cycles, which isn't very good. This also doesn't account for time required to load the registers, which could exacerbate the problem.
You can probably improve performance by interleaving the "fld instructions" with the fmacs instructions.
Check out the following for more info:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CACBBDCE.html
The results of an "fld" instruction should be available within 4 cycles, which means if you could do something like:
fld s0
fld s4
fld s1
fld s5
fmuls s0, s0, s4
fld s2
fld s6
fld s3
fld s7
fmacs s0, s1, s5
famcs s0, s2, s6
fmacs s0, s3, s7
Then you could reduce the total number of stalled cycles down to 17.
Assuming you are doing this in a loop, you could probably further reducing stalling by trying to start work on the "next" loop iteration while the current iteration is executing (i.e. loop unrolling). Also, depending on how your data is stored, once you are doing loop unrolling you can probably improve things even more by using fldm instead of fld instructions.
In any case optimizing the pipeline behavior by hand is difficult. Is there are a reason you can't let the compiler do instruction scheduling for you?
精彩评论