Does the following ARM instruction set generate stalls?

2023-01-28 19:50 问答作者：

Programming the ARM11MP Vfpu, I've looked over the docs and am concerned that the following will stall badly when doing a 4-component dot product (as part of a 4x4 matrix multiply)


  fmuls   s0, s0, s4
  fmacs   s0, s1, s5
  fmacs   s0, s2, s6
  fmacs   s0, s3, s7

Does the accumuate step generate stalls h开发者_开发问答ere? If so, I will have to really change stuff around as I only get 32 single registers to work with and then takes 9 as it is. Also, I could setup the vector register to do this in 1 instruction, but am wondering if the 3 instruction cycles will be worth it as I'd have to unset it nearly immediately for a store back to memory unless I overflowed to the ARM registers. Posting from home without my real SO account here...

I'm not in any way familiar with ARM, so you should take this with a grain of salt. This answer is just based on about 20 mins of searching around for documentation on my phone. There could be some things I'm missing, so this may not be correct.

In any case, I believe yes, this should cause pipeline stalls. The VFP coprocessor has an 8 stage pipeline, but because of "forwarding" (each instruction depends on the result of the previous instruction) the number of stalled cycles should be reduced to 7 for each instruction. Still, given the 4 instructions you have you would be stalled for about 28 cycles, which isn't very good. This also doesn't account for time required to load the registers, which could exacerbate the problem.

You can probably improve performance by interleaving the "fld instructions" with the fmacs instructions.

Check out the following for more info:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CACBBDCE.html

The results of an "fld" instruction should be available within 4 cycles, which means if you could do something like:

fld s0
fld s4
fld s1
fld s5
fmuls s0, s0, s4
fld s2
fld s6
fld s3
fld s7
fmacs s0, s1, s5
famcs s0, s2, s6
fmacs s0, s3, s7

Then you could reduce the total number of stalled cycles down to 17.

Assuming you are doing this in a loop, you could probably further reducing stalling by trying to start work on the "next" loop iteration while the current iteration is executing (i.e. loop unrolling). Also, depending on how your data is stored, once you are doing loop unrolling you can probably improve things even more by using fldm instead of fld instructions.

In any case optimizing the pipeline behavior by hand is difficult. Is there are a reason you can't let the compiler do instruction scheduling for you?

继续阅读：arm assembly performance

Does the following ARM instruction set generate stalls?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？