Is a logical right shift by a power of 2 faster in AVR?

2023-01-16 22:52 问答作者：

I would like to know if performing a logical right shift is faster when shifting by a power of 2

For example, is

myUnsigned &开发者_开发问答gt;> 4

any faster than

myUnsigned >> 3

I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.

Let's look at the datasheet:

http://atmel.com/dyn/resources/prod_documents/8271S.pdf

As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.

In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.

This makes the AVR fairly unsual, by the way.

You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).

According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.

Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3

However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3

Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts

x >>= 1;
x >>= 1;
x >>= 1;

whereas x >>= 4 only need a swap and a bit clear

swap(x);    // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;

x &= 0xf0;
swap(x);

For bigger cross-register shifts there are also various ways to optimize it

With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply

y0 = y1;
y1 = 0;

Similarly y >> 9 can be optimized to

y0 = y1 >> 1;
y1 = 0;

and hence is even faster than a shift by 3 on a char

In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char

Here are some demos from compiler explorer

A right shift by 4 is achieved by a swap and an and like above
```
  swap r24
  andi r24,lo8(15)
```
A right shift by 3 has to be done with 3 instructions
```
  lsr r24
  lsr r24
  lsr r24
```

Left shifts are also optimized in the same manner

See also Which is faster: x<<1 or x<<10?

It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.

Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.

Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.

If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.

With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.

replacing a divide with a bit-shift

This is not the same for negative numbers:

char div2 (void)
{
    return (-1) / 2;
    // ldi r24,0
}

char asr1 (void)
{
    return (-1) >> 1;
    //  ldi r24,-1
}

继续阅读：atmega avr bit-shift optimization

Is a logical right shift by a power of 2 faster in AVR?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？