开发者

Can the STREAM and GUPS (single CPU) benchmark use non-local memory in NUMA machine

I want to run some tests from HPCC, STREAM and GUPS.

They will test memory bandwidth, latency, and throughput (in term of random accesses).

Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?)

Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.)

Thanks.

UPDATE:

(you may nor reorder the memory accesses that the program makes).

But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c

  /* Perform updates to main table.  The scalar equivalent is:
   *
   *     u64Int ran;
   *     ran = 1;
   *     for (i=0; i<NUPDATE; i++) {
   *       ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
   *       table[ran & (TableSi开发者_运维知识库ze-1)] ^= stable[ran >> (64-LSTSIZE)];
   *     }
   */
  for (j=0; j<128; j++)
    ran[j] = starts ((NUPDATE/128) * j);
  for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
    for (j=0; j<128; j++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
    }
  }

The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to

for (j=0; j<128; j++) {
  for (i=0; i<NUPDATE/128; i++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
  }
}

It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?


As far as I can tell it is allowed given that the memory interleaving is a system setting rather than a code modification (you may nor reorder the memory accesses that the program makes).

If GUPS actually gets better performance with non-local memory on a NUMA machine seems doubtful to me. Will bank conflict-induced latency really be greater than the off-node memory access latency?

STREAM should not be limited by bank conflicts but will probably benefit from off-node accesses if the CPU has an on-chip memory controller (like the Opterons) since the bandwidth is then shared between the local memory controller and the NUMA interconnect.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜