Most efficient way to store 4 dot products into a contiguous array in C using SSE intrinsics
I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.
A portion of my program computes 4 dot products and adds each result to the previous 开发者_C百科values in a contiguous chunk of an array. More specifically,
tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);
tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);
Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:
tmp0= R0-zero-zero-zero
tmp1= zero-R1-zero-zero
tmp2= zero-zero-R2-zero
tmp3= zero-zero-zero-R3
I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:
tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array's indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):
tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);
I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.
I appreciate all help. Thank you.
For code like this, I like to store the "transpose" of the A's and B's, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you're done, you have all 4 dot products in one vector without any shuffling.
This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don't have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.
A small efficiency note on the existing code: instead of this
tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);
It may be slightly better to do this:
tmp0 = _mm_add_ps(tmp0, tmp1); // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3); // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2); // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);
As the first two mm_add_ps
's are completely independent now. Also, I don't know the relative timings of adding vs. shuffling, but that might be slightly faster.
Hope that helps.
It is also possible to use the SSE3 hadd. It turned out faster than using _dot_ps, in some trivial tests. This returns 4 dot products which could be added.
static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
__m128 z[4];
z[0] = x * y[0];
z[1] = x * y[1];
z[2] = x * y[2];
z[3] = x * y[3];
z[0] = _mm_hadd_ps(z[0], z[1]);
z[2] = _mm_hadd_ps(z[2], z[3]);
z[0] = _mm_hadd_ps(z[0], z[2]);
return z[0];
}
You could try leaving the dot product result in the low word and use the scalar store op _mm_store_ss
to save that one float from each m128 register into the appropriate location of the array. Nehalem's store buffer should accumulate consecutive writes on the same line and flush them to L1 in batches.
The pro way to do it is celion's transpose approach. MSVC's _MM_TRANSPOSE4_PS macro will do the transpose for you.
I realize this question is old, but why use _mm_add_ps
at all? Replace it with:
tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);
You can probably hide some of the _mm_dp_ps
latency. The first _mm_or_ps
doesn't wait for the final 2 dot products either, and it's a (fast) bit-wise operation. Finally:
_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));
精彩评论