开发者

Unexpected result from AVX _m256_unpack*_ps unpack intrinsic

I'm attempting to use the AVX intrinsic unpack instructions _m256_unpacklo_ps and _m256_unpackhi_ps to interleave 16 float values. The results I'm getting are strange, either because I'm not understanding how unpacking is supposed to work in AVX or because something isn't working as it should.

What I'm seeing is that when I attempt to, for example, unpack the low order fl开发者_如何学JAVAoats from two vectors, v1 and v2, into a third, v3, I see the following:

if v1 is [a b c d e f g h] and v1 is [i j k l m n o p]

then v3 = _m256_unpacklo_ps(v1, v2) results in [a i b j e m f n]

when I expected that v3 would give [a i b j c k d l]

Am I incorrect in my expectations or am I using this incorrectly? Or is something else malfunctioning?

Some test code is:

#include <immintrin.h>
#include <iostream>

int main()
{

  float output[16], input1[8], input2[8];
  __m256 vec1, vec2, vec3, vec4;

  vec1 = _mm256_set_ps(1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f);
  vec2 = _mm256_set_ps(9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f, 16.0f);

  _mm256_store_ps(input1, vec1);
  _mm256_store_ps(input2, vec2);

  vec3 = _mm256_unpacklo_ps(vec1, vec2);
  vec4 = _mm256_unpackhi_ps(vec1, vec2);

  _mm256_store_ps(output, vec3);
  _mm256_store_ps(output + 8, vec4);

  std::cout << "interleaving:" << std::endl;
  for (unsigned i = 0; i < 8; ++i)
    std::cout << input1[i] << " ";
  std::cout << std::endl;

  std::cout << "with:" << std::endl;
  for (unsigned i = 0; i < 8; ++i)
    std::cout << input2[i] << " ";
  std::cout << std::endl;

  std::cout << "= " << std::endl;
  for (unsigned i = 0; i < 16; ++i)
    std::cout << output[i] << " ";
  std::cout << std::endl;
}

I'm using gcc 4.5.2 to compile.

Thanks in advance for any help! - Justin


You are getting the correct result. See Intel® Advanced Vector Extensions Programming Reference, page 320-333.

Almost no AVX instructions cross the 128-bit boundary, most of them work as SSE instructions for each low and high 128bits separately. Very unfortunate.


It is behaving as expected.

To get [a i b j c k d l], you'll need to use:

A = unpacklo_ps(v1,v2)

B = unpackhi_ps(v1,v2) and then use

C=_mm256_permute2f128_ps(A,B,0x20) ,

to get the desired 128 bits from both of them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜