开发者

AND faster than integer modulo operation?

It is possible to re-express:

  • i % m

as:

  • i & (m-1)

where,

  • i is an unsigned integer
  • m is a power of 2

My question is: is the AND operation any faster? Don't modern CPUs support integer modulo 开发者_如何学编程in hardware in a single instruction? I'm interested in ARM, but don't see the modulo operation in its instruction set.


It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)


You may be interested in Embedded Live: Embedded Programmers' Guide to ARM’s Cortex-M Architecture.

The ARM Cortex-M family has unsigned and singed division instructions, UDIV and SDIV, which take 2 to 12 cycles. There is no MOD instruction, but equivalent result is obtained by a {S,U}DIV followed by the multiply-and-subtract instruction MLS, which takes 2 cycles, for a total of 4-14 cycles.

The AND instruction is single cycle, therefore 4-14x faster.


ARM is very generic. There are lot of different ARMs and there are ARMs which do NOT have a division instruction (as Ray Toal already mentioned, modulo is usually implemented as additional result of the division implementation). So if you dont want to call a very slow division subroutine, the logical operation is much faster (and as cyco130 mentioned, any good compiler would recognize it on its own and generate the logical operation on its own - so for clearness of the program code I would stay with the division (except you program assembler, then you have of course to program it yourself, and then you should take the logical operation).


If m is known at compile time (or even it it isn't) integer division and modulo can be re-expressed using multiplication by a magic "multiplicative inverse." The result of the division ends up in the high 32 bits and the remainder (modulus) in the lower 32 bits:

http://www.hackersdelight.org/magic.htm

The following link claims that it is a standard compiler strength reduction:

http://www.flounder.com/multiplicative_inverse.htm


If you are using a decent C compiler with optimizations enabled, it will already optimize this to whatever is faster, a technique called "strength reduction". If you're doing hand-written assembly, only sure way to test is to benchmark it. But beware, even different models of the same processor could give different results.


According to http://www.coranac.com/tonc/text/asm.htm, the ARM has no division instruction. If that's true, then I wouldn't expect it to have a MOD instruction, either.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜