开发者

Can I optimize this piece of code?

Is it possible to use any loop optimization technique here to reduce the execution time ? I need the nested loop with i and j as I need those combinations of (i,j).

EDIT: even if I leave the "actual" code, with this trivial assignment, this is taking up ~5s on my Dual Core box, whereas with that actual code, it takes up ~6s. I experimented with replacing fn_val+=0 by j+=0, and it takes ~1.73s. What could be this due to?

# include <stdio.h>
# include <time.h>

int main(int argc, cha开发者_如何学JAVAr **argv)
{
        float fn_value=0.0;
        int n=10,i,j;
        unsigned int k;
        clock_t start, end;

        start = clock();
        for(k=0;k<9765625;k++)
        {

                for(i=0;i<n;i++)
                {
                        for(j=i;j<n;j++)
// substitute for an "actual" piece of code
                                fn_value+=0; 
                }
        }
        end= clock();

        printf("Time taken %lf", (double) (end-start) / CLOCKS_PER_SEC);
        return 0;
}


You may want to check out OpenMP as you can have the doStuff run on different threads for different indices of i, j, and k if "doStuff" is thread-safe.


You could do loop unrolling. Actualy, you could just specify an argument to your compiler to unroll all those loops (the actual arguments depend on your compiler).

I don't know what you're "actual code" is to be able give you more information. One thing you want to optimize your cache access if you are doing something non-trivial.

Also, are you compiling with optimization? (i.e. -O3 in gcc)

Per your edit:

The reason "j+=0" is faster than "fn_val += 0" is because integer arithemtic is MUCH faster than floating point operations.

This is why we need the actual code to give you informed optimizations.


Well, it can run parallel pretty well probably.


Loop unrolling does not always do better than the compiler can, as has been said else where, profile and find where the time is going.

I would first focus on the "actual" piece of code. Are there any smarts you can use to "block" up the calculations there? Reuse the previous anwer to cheaply calculate the next etc.?


Since your innermost loop has only 10 iterations, it would improve your speed a little if you could combine the two inner loops (for a total of 100 iterations).


The loops themselves are probably irrelevant, it all depends on how much work you do in the innermost loop.

You should do some profiling, it will tell you how much time is spent where, and suggest where optimizations could be done.


It really depends on what "substitute for an actual piece of code" does and how the values i, j and k are used by that code. If i, j and k are all used, then there may not be too much you can actually do (apart from multi-threading, though if used in a mathematical equation, you might be able to use some clever algebra to reduce complexity/repetition of calculations). On the other hand, if none of the values are used, then you could just make it one loop that will execute the specified number of times (though results may vary between compilers/optimization levels).

Basically, you can't optimize the loops themselves if they are the minimum of what you need. Also, this kind of micro optimizations is usually what leads to many bugs and unmaintainable code (even in the games industry, where speed is critical, we always optimize last and then only the biggest bottlenecks) and you will usually find its the algorithms not code itself that can be optimized (or replaced with faster algorithms that have a similar result). The example you have given contains no actual algorithm apart from:

fn_value = 0;
k = 9765625;
n = 10;
i = 10;
j = 10;

And as such, the code above is what you could replace your entire loop with and it would be as optimal as possible (assuming those values are used somewhere else, otherwise you can just eliminate them entirely).


I heard once, long ago.... that looping down to zero can be quicker on some cpus...

So:-

for(i=0;i<n;i++)
{
    for(j=i;j<n;j++)
// substitute for an "actual" piece of code
        fn_value+=0; 
}

Becomes (I think, always get arithmetic wrong ;) ):-

for(i=n;i--;)
{
    for(j=n-i;j--;)
// substitute for an "actual" piece of code
        fn_value+=0; 
}

Of course, then your loops are backwards...

I'd love to hear if that makes a difference! My gut instinct is that you're optimizing the wrong thing.

Aha, a link:- http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR


You are using floating point code! Compilers are rubbish with floating point code.

Here's some measurements I've made, I'm using DevStudio 2005 with default optimistations and I changed the code slightly:

// added to the inner part of the loop
fn_value += j; 

// added a dependancy on fn_value so that the compiler doesn't optimise the 
// whole code down to nothing
printf("Time taken %lf - %f", (double) (end-start) / CLOCKS_PER_SEC, fn_value);

So, I get this running in about 5s.

Now, I changed the code a bit:

# include <stdio.h>
# include <time.h>

int main(int argc, char **argv)
{
  int fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("Time taken %lf - %d", (double) (end-start) / CLOCKS_PER_SEC, fn_value);
  return 0;
}

I changed the fn_value to an int. It now takes about a second! So there's a four second overhead between adding ints and adding floats. I then wrote a version with IA32 FPU opcodes instead of the C code and got about 1.4 seconds which isn't that much slower than using ints.

Then, I used the C floating point version but made fn_value a double and the time became 1.25s. Now, that surprised me. It beat the FPU opcode version, but, looking at the dissembly, the only difference is the the pure C version unrolled the inner loop.

Also, when using floats, the result is incorrect.

Here's my final test code:

# include <stdio.h>
# include <time.h>

void p1 ()
{
  double fn_value=0;//if this is a float, the answer is slightly wrong
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  __asm fldz;
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        __asm {
          fiadd j
      }
    }
  }
  __asm fstp fn_value;
  end= clock();

  printf("p1: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

void p2 ()
{
  double fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("p2: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

void p3 ()
{
  float fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("p3: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

int main(int argc, char **argv)
{
  p1 ();
  p2 ();
  p3 ();
  return 0;
}

In summary, double appears to be faster than float. However, we need to see the contents of that inner loop to see if converting the floating point type will provide any speed up in your specific case.

UPDATE

The reason the float version is slower than the others is because the float version is constantly writing and reading the value to/from memory. The double and hand-written versions never write the value to RAM. Why does it do this. The main reason that I can think of is to decrease the precision of the value of fn_value between operations. Internally, the FPU is 80bit whereas a float is 32bit (in this implementation of C). To keep the values within the range of a float, the compiler is converting from 80bit to 32bit by writing and reading the value to/from RAM because, as far as I know, there is no FPU instruction to do this to a single FPU register. So, in order to keep the maths '32bit' (of type float) it introduces a huge overhead. The compiler is ignoring the dfference between the 80bit FPU and 64bit double type and is assuming the programmer wants the bigger type as much as possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜