implement SIMD in C++

2022-12-29 03:27 问答作者：

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.

The following makes the call...

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

... and the following is what is executed.

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?

Edit: This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands bef开发者_开发技巧ore I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test

If there's no easy way to implement SSE2, any advice on how to further optimize the code?

Thanks, Hristo

Your question represents some confusion on what is going on. The i,j,k variables are almost certainly held in registers already, assuming you are compiling with optimizations on (which you should do - add "-O2" to your icc invocation).

You can use an asm block, but an easier method considering you're already using ICC is to use the SSE intrinsics. Intel's documentation for them is here - http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1019.htm

It looks like you can SIMD-ize the top-level loop, though it's going to depend greatly on what your delta function is.

When you want to use assembly language within a C++ module, you can just put it inside an asm block, and continue to use your variable names from outside the block. The assembly instructions you use within the asm block will specify which register etc. is being operated on, but they will vary by platform.

If you're using GCC, see http://gcc.gnu.org/projects/tree-ssa/vectorization.html for how to help the compiler auto-vectorize your code, and examples.

Otherwise, you need to let us know what platform you are using.

The compiler should be doing this for you. For example, in VC++ you can simply turn on SSE2.

继续阅读：simd sse2

implement SIMD in C++

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？