iPhone ARMv6 VFP asm latency, throughput and hazards
in this document: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf
on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit.
Are those numbers independant of vectorsize?
1: let's take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don't use a register which is not currently calculated by a previous function? for example:
FMULS s8, s16, s20
FMULS s12, s21, s25
will those exectue right after each other?
2: wha开发者_运维百科t happens if I have two FMULS functions after each other where one argument depends upon the previous computation
FMULS s8, s16, s20
FMULS s12, s21, s8
will the VFP wait for 8 cycles before starting to process the second instruction?
3: what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?
4: sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?
thanks!
Your questions are all answered in the document that you linked. You should read it carefully.
Are those numbers independent of vectorsize?
No. See, for example, Table 21-15 in the document you linked. Note the latency of the short vector FADDS
.
does it mean that I can start a new
FMULS
operation every cycle if it doesn't depend on an earlier result that isn't available yet?
Yes, that's the definition of throughput.
what happens if I have two FMULS functions after each other where one argument depends upon the previous computation
Execution will stall until the result of the first FMULS
is available. See 21.6 "Operation of the scoreboards" for more detail.
what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?
It will stall. Same reference.
sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?
No. See section 21.10 "Parallel Execution". An example is given in Table 21-15, in which a non-dependent FADDS
executes immediately following FDIVS
.
Note that it can be a bit of a challenge (though not impossible) to write short-vector VFP code that performs substantially faster than scalar code for many types of computation. Even if you learn how to do it, it will be of questionable value since the NEON unit seems to be the new model for vector computation on ARM. You may be better served in the long run by ignoring the short-vector operation for now and focusing on learning NEON for the future.
精彩评论