Static branch prediction on Arm9 with RVCT4.0
I'm writing some logging C code for an ARM9 processor. This code will record some data if a dynamic module is present. The module will usually not be present in a production build, but the logging code will always be compiled in. The idea is that if a customer encounters a bug, we can load this module, and the logging code will dump debugging information.
The logging code must have minimal impact when the module is not present, so every cycle counts. In general, the logging code looks something like this:
__inline void log_some_stuff(Provider *pProvider, other args go here...)
{
if (NULL == pProvider)
return;
... logging code goes here ...
}
With optimization on, RVCT 4.0 generates code that looks like this:
ldr r4,[r0,#0x2C] ; pProvider,[r0,#44]
cmp r4,#0x0 ; pProvider,#0
beq 0x23BB4BE (usually taken)
... logging code goes here...
... regular code starts at 0x23BB4BE
This processor has no branch 开发者_开发问答predictor, and my understanding is that there is a 2 cycle penalty whenever a branch is taken (no penalty if the branch is not taken).
I would like the common case, where NULL == pProvider
, to be the fast case, where the branch is not taken. How can I make RVCT 4.0 generate code like this?
I've tried using __builtin_expect
as follows:
if (__builtin_expect(NULL == pProvider, 1))
return;
Unfortunately, this has no impact on the generated code. Am I using __builtin_expect
incorrectly? Is there another method (hopefully without inline assembly)?
So if there's no branch predictor and you get a penalty of two cycles when taking a branch, why not just rewrite the program accordingly to just do that? (well actually you'd think that your example above would already result in the "correct" code, but we can try)
__inline void log_some_stuff(Provider *pProvider, other args go here...)
{
if (pProvider) {
... logging code goes here ...
}
}
that "could" compile to:
ldr r4,[r0,#0x2C] ; pProvider,[r0,#44]
cmp r4,#0x0 ; pProvider,#0
bneq logging_code (usually NOT taken)
... regular code here
logging_code: .. well anywhere
if you're lucky, but even if it does now, every change to the compiler may change it and I've got no idea if it'd even result in the assembly code with whatever compiler you're using. So probably write it in inline assembly anyhow? Not that much code and gcc (as well as VC; I assume others do too) make that quite easy. Easiest you'd just define an extra method with your logging code and call that (no idea about the ARM ABI, so you'll have to write that yourself)
If you use the following construct:
void log_some_stuff_implementation(Provider *pProvider, int x, int y, char const* str);
__inline void log_some_stuff(Provider *pProvider, int x, int y, char const* str)
{
if (__builtin_expect( pProvider != NULL, 0)) {
log_some_stuff_implementation(pProvider, x, y, str);
}
return;
}
GCC 4.5.2 with -O2
generates the following code (at least for my simple test) for a call to log_some_stuff()
:
// r0 already has the Provider* in it - r2 has a value that indicates whether
// r0 was loaded with a valid pointer or not
cmp r2, #0
ldrne r3, [r1, #0]
addne r1, r2, #1
ldrneb r2, [r3, #0] @ zero_extendqisi2
blne log_some_stuff_implementation
So in the common case (where the Provider* is NULL), 4 instructions are used but not executed due to the conditional, but the ARM's pipeline doesn't get flushed. I think this is probably about as good as you'll get for the common case where you don't actually want the logging code to run.
I think the key is that the code that actually does the logging work is done non-inline in a separate function so the compiler can reasonably have the setup and call for that function be an inlined sequence of a few conditionally executed instructions. Since the actual logging code doesn't need to be optimized, there's no reason for it to be inline. It's not supposed to be the common case, and presumably it's code that'll be doing some real work. Therefore the overhead of a function call should be acceptable (at least that's my assumption).
By the way, for my simple test the same code sequence (or essentially the same sequence) is generated even if the __builtin_expect()
is left out, however I imagine that in more complex sequences than my simple test, the builtin might help the compiler out. So I'd probably leave it in, but I'd also probably use more readable versions like the Linux kernel's macros:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
Your branch optimization is going to gain you very little. You could gain a lot more if you did the following:
#define log_some_stuff(pProvider, other_arg) \
do {\
if(pProvider != NULL) \
real_log_some_stuff(pProvider, other_arg); \
} \
while(0)
what this will do is it will inline the NULL check into all of the calling code. That may seem like a loss, but what really happens is that the compiler can avoid the overhead of a function call, including pushing the registers, the branch itself, and having r0-r3 and lr invalidated with a simple NULL check (that you would have had to do anyway). Overall, I'd bet this would gain far more than the single cycle you would have saved by exiting one instruction early.
You can use goto
:
__inline void log_some_stuff(Provider *pProvider, other args go here...)
{
if (pProvider != NULL)
goto LOGGING;
return;
LOGGING:
... logging code goes here ...
}
Using __builtin_expect is easier but i am not sure RVCT has it.
精彩评论