开发者

Member versus global array access performance

Consider the following situation:

class MyFoo {
public:
  MyFoo();
  ~MyFoo();
  void doSomething(void);
private:
  unsigned short things[10]; 
};

class MyBar {
public:
  MyBar(unsigned short* globalThings);
  ~MyBar();
   void doSomething(void);
private:
  unsigned short* things;
};

MyFoo::MyFoo(开发者_Python百科) {
  int i;
  for (i=0;i<10;i++) this->things[i] = i;
};

MyBar::MyBar(unsigned short* globalThings) {
  this->things = globalThings;
};

void MyFoo::doSomething() {
  int i, j;
  j = 0;
  for (i = 0; i<10; i++) j += this->things[i];
};

void MyBar::doSomething() {
  int i, j;
  j = 0;
  for (i = 0; i<10; i++) j += this->things[i];
};


int main(int argc, char argv[]) {
  unsigned short gt[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

  MyFoo* mf = new MyFoo();
  MyBar* mb = new MyBar(gt);

  mf->doSomething();
  mb->doSomething();
}

Is there an a priori reason to believe that mf.doSomething() will run faster than mb.doSomething()? Does that change if the executable is 100MB?


Because anything can modify your gt array, there may be some optimizations performed on MyFoo that are unavaible to MyBar (though, in this particular example, I don't see any)

Since gt lives locally (we used to call that the DATA segment, but I'm not sure if that still applies), while things lives in the heap (along with mf, and the other parts of mb) there may be some memory access & caching issues dealing with things. But, if you'd created mf locally (MyFoo mf = MyFoo()), then that would be an issue (i.e. things and gf would be on an equal footing in that regard)

The size of the executable should make any difference. The size of the data might, but for the most part, after the first access, both arrays will be in the CPU cache and there should be no difference.


There's little reason to believe one will be noticeably faster than the other. If gt (for example) was large enough for it to matter, you might get slightly better performance from:

int j = std::accumulate(gt, gt+10, 0);

With only 10 elements, however, a measurable difference seems quite unlikely.


MyFoo::DoSomething can be expected to be marginally faster than MyBar::DoSomething This is because when things is stored locally in an array, we just need to dereference this to get to things and we can access the array immediately. When things is stored externally, we first need to dereference this and then we need to dereference things before we can access the array. So we have two load instructions.

I have compiled your source into assembler (using -O0) and the loop for MyFoo::DoSomething looks like:

    jmp .L14
.L15:
    movl    -4(%ebp), %edx 
    movl    8(%ebp), %eax //Load this into %eax
    movzwl  (%eax,%edx,2), %eax //Load this->things[i] into %eax
    movzwl  %ax, %eax
    addl    %eax, -8(%ebp)
    addl    $1, -4(%ebp)
.L14:
    cmpl    $9, -4(%ebp)
    setle   %al
    testb   %al, %al
    jne .L15

Now for DoSomething::Bar we have:

    jmp .L18
.L19:
    movl    8(%ebp), %eax //Load this
    movl    (%eax), %eax //Load this->things
    movl    -4(%ebp), %edx
    addl    %edx, %edx
    addl    %edx, %eax
    movzwl  (%eax), %eax //Load this->things[i]
    movzwl  %ax, %eax
    addl    %eax, -8(%ebp)
    addl    $1, -4(%ebp)
.L18:
    cmpl    $9, -4(%ebp)
    setle   %al
    testb   %al, %al
    jne .L19

As can be seen from the above there is the double load. The problem may be compounded if this and this->things have a large difference in address. This they will then live in different cache pages and the CPU may have to do two pulls from main memory before this->things can be accessed. When they are part of the same object, when we get this we get this->things at the same time as this.

Caveate - the optimizer may be able to provide some shortcuts that I have not thought of though.


Most likely the extra dereference (of MyBar, which has to fetch the value of the member pointer) is meaningless performance-wise, especially if the data array is very large.


It could be somewhat slower. The question is simply how often you access. What you should consider is that your machine has a fixed cache. When MyFoo is loaded in to have DoSomething called on it, the processor can just load the whole array into cache and read it. However, in MyBar, the processor first must load the pointer, then load the address it points to. Of course, in your example main, they're all probably in the same cache line or close enough anyway, and for a larger array, the number of loads won't increase substantially with that one extra dereference.

However, in general, this effect is far from ignorable. When you consider dereferencing a pointer, that cost is pretty much zero compared to actually loading the memory it points to. If the pointer points to some already-loaded memory, then the difference is negligible. If it doesn't, you have a cache miss, which is very bad and expensive. In addition, the pointer introduces issues of aliasing, which basically means that your compiler can perform much less optimistic optimizations on it.

Allocate within-object whenever possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜