Is memset() more efficient than for loop in C?

2023-04-04 01:17 问答作者：

Is memset() more efficient than for loop.

Considering this code:

char x[500];
memset(x,0,sizeof(x));

And this one:开发者_运维技巧

char x[500];
for(int i = 0 ; i < 500 ; i ++) x[i] = 0;

Which one is more efficient and why? Is there any special instruction in hardware to do block level initialization.

Most certainly, memset will be much faster than that loop. Note how you treat one character at a time, but those functions are so optimized that set several bytes at a time, even using, when available, MMX and SSE instructions.

I think the paradigmatic example of these optimizations, that go unnoticed usually, is the GNU C library strlen function. One would think that it has at least O(n) performance, but it actually has O(n/4) or O(n/8) depending on the architecture (yes, I know, in big O() will be the same, but you actually get an eighth of the time). How? Tricky, but nicely: strlen.

Well, why don't we take a look at the generated assembly code, full optimization under VS 2010.

char x[500];
char y[500];
int i;      

memset(x, 0, sizeof(x) );   
  003A1014  push        1F4h  
  003A1019  lea         eax,[ebp-1F8h]  
  003A101F  push        0  
  003A1021  push        eax  
  003A1022  call        memset (3A1844h)

And your loop...

char x[500];
char y[500];
int i;    

for( i = 0; i < 500; ++i )
{
    x[i] = 0;

      00E81014  push        1F4h  
      00E81019  lea         eax,[ebp-1F8h]  
      00E8101F  push        0  
      00E81021  push        eax  
      00E81022  call        memset (0E81844h)  

      /* note that this is *replacing* the loop, 
         not being called once for each iteration. */
}

So, under this compiler, the generated code is exactly the same. memset is fast, and the compiler is smart enough to know that you are doing the same thing as calling memset once anyway, so it does it for you.

If the compiler actually left the loop as-is then it would likely be slower as you can set more than one byte size block at a time (i.e., you could unroll your loop a bit at a minimum. You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced.

That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.

It really depends on the compiler and library. For older compilers or simple compilers, memset may be implemented in a library and would not perform better than a custom loop.

For nearly all compilers that are worth using, memset is an intrinsic function and the compiler will generate optimized, inline code for it.

Others have suggested profiling and comparing, but I wouldn't bother. Just use memset. Code is simple and easy to understand. Don't worry about it until your benchmarks tell you this part of code is a performance hotspot.

The answer is 'it depends'. memset MAY be more efficient, or it may internally use a for loop. I can't think of a case where memset will be less efficient. In this case, it may turn into a more efficient for loop: your loop iterates 500 times setting a bytes worth of the array to 0 every time. On a 64 bit machine, you could loop through, setting 8 bytes (a long long) at a time, which would be almost 8 times quicker, and just dealing with the remaining 4 bytes (500%8) at the end.

EDIT:

in fact, this is what memset does in glibc:

http://repo.or.cz/w/glibc.git/blob/HEAD:/string/memset.c

As Michael pointed out, in certain cases (where the array length is known at compile time), the C compiler can inline memset, getting rid of the overhead of the function call. Glibc also has assembly optimized versions of memset for most major platforms, like amd64:

http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/memset.S

Good compilers will recognize the for loop and replace it with either an optimal inline sequence or a call to memset. They will also replace memset with an optimal inline sequence when the buffer size is small.

In practice, with an optimizing compiler the generated code (and therefore performance) will be identical.

Agree with above. It depends. But, for sure memset is faster or equal to the for-loop. If you are uncertain of your environment or too lazy to test, take the safe route and go with memset.

Other techniques like loop unrolling which reduce the number of loops can also be used. The code of memset() can mimic the famous duff's device:

void *duff_memset(char *to, int c, size_t count)
{
    size_t n;
    char *p = to;
    n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *p++ = c;
    case 7:      *p++ = c;
    case 6:      *p++ = c;
    case 5:      *p++ = c;
    case 4:      *p++ = c;
    case 3:      *p++ = c;
    case 2:      *p++ = c;
    case 1:      *p++ = c;
            } while (--n > 0);
    }
    return to;
}

Those tricks used to enhancing the execution speed in the past. But on modern architectures this tends to increase the code size and increase cache misses.

So, it is quite impossible to say which implementation is faster as it depends on the quality of the compiler optimizations, the ability of the C library to take advantage of special hardware instructions, the amount of data you are operating on and the features of the underlying operating system (page faults management, TLB misses, Copy-On-Write).

For example, in the glibc, the implementation of memset() as well as various other "copy/set" functions like bzero() or strcpy() are architecture dependent to take advantage of various optimized hardware instructions like SSE or AVX.

继续阅读：c memset performance

Is memset() more efficient than for loop in C?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？