Is there a way to use expressions evaluated at compile-time with inline asm in gcc?
I have some code that basically needs to use a small expression in an assembly statement, where the expression is fairly trivial like i*4, but GCC doesn't seem to realize that at compile time (tried no -O flag, and -O3). Neither "i" nor "n" constraints work in the following snippet for the third usage.
#include <stdint.h>
#include <stdlib.h>
#define SHIFT(h, l, c) __asm__ volatile ( \
"shld %2, %1, %0\开发者_StackOverflow社区n\t" \
"sal %2, %1\n\t" \
: "+r"(h), "+r"(l) : "i"(c))
void main(void) {
uint64_t a, b;
SHIFT(a, b, 1); /* 1 */
SHIFT(a, b, 2*4); /* 2 */
size_t i;
for(i=0; i<24; i++) {
SHIFT(a, b, (i*4)); /* 3 */
}
}
Giving this error:
temp.c:15: warning: asm operand 2 probably doesn’t match constraints
temp.c:15: error: impossible constraint in ‘asm’
I also tried
"shld $" #c ", %1...
but that has its own issue, because the parens remain when stringified. It's my intention that the entire loop becomes unrolled, but -funroll-all-loops doesn't seem to be happening early enough in the process to cause i*4 to become a constant. Any ideas? The alternative is quite ugly, but if there was a way to automate this in a macro that'd be better than nothing:
SHIFT(a, b, 1);
SHIFT(a, b, 2);
...
SHIFT(a, b, 24);
Is there any specific reason to mark the asm block as volatile? It's nearly impossible that any optimization is going to be carried out while volatile is present.
Not sure why you're shifting left by 23*4=92, but...
There might be. You can use __builtin_constant_p() and __builtin_choose_expr() to pick the expression to compile; something like
__builtin_choose_expr(__builtin_constant_p(c), SHIFT(h, l, c), slower_code_here);
If it picks slower_code_here, then it "couldn't" determine that c
was constant. If it complains about an "impossible constraint", then it knows it's constant but doesn't manage to turn it into an immediate for some reason.
It's sometimes surprising what it thinks is and isn't constant; I was playing around the other day and it complained about something like __builtin_choose_expr(sizeof(char[__builtin_choose_expr(..., 1, -1)]),...)
.
(I'm assuming the %2,%1,%0 order is intentional; I would've expected %0,%1,%2 but the documentation is vague and I can never remember which asm syntax is being used.)
You are assuming that the compiler will unroll your loop and substitute the value of i * 4
each time... that is a bit much to assume. The * 4
looks like you want an addressing modification of some sort, why not pass in i
and write the instruction to do your * 4
? Take a careful look at the constraints GCC handles, and make sure that your instructions really take all the combinations your constraints might throw at it.
Your "ugly" way can be achieved using the Boost Preprocessor library (actually a set of cpp
macros, and the only part of Boost that can be used with plain C):
#include <boost/preprocessor/repetition/repeat.hpp>
#define SHIFT_a(z, CNT, b) __asm__ volatile ( \
"shld %2, %1, %0\n\t" \
"sal %2, %1\n\t" \
: "+r"(a), "+r"(b)
: "i"(CNT * 4)
: "cc");
void main(void) {
uint64_t a, b;
// whatever ...
BOOST_PP_REPEAT_FROM_TO(1, 25, SHIFT_a, b)
}
The "ugly" bit that remains in this is that the macros BOOST_PP_REPEAT*
can "iterate" over are limited to one user-provided argument, so you've got to "embed" either a
or b
in this example into the actual macro name. Maybe that can be worked around by another indirection level (to transform SHIFT(a)
into SHIFT_a
?). Not tried.
I doubt you are still interested in feedback for this question, but since you never accepted any of the other answers...
There are a few issues with the OP code, but with a bit of cleanup, you get:
#include <stdint.h>
#define SHIFT(h, l, c) __asm__ volatile ( \
"shld %b2, %1, %0\n\t" \
"sal %b2, %1\n\t" \
: "+r"(h), "+r"(l) : "Jc"(c))
int main(void) {
uint64_t a, b;
a = b = 0;
SHIFT(a, b, 1); /* 1 */
SHIFT(a, b, 2*4); /* 2 */
size_t i;
for(i=0; i<16; i++) {
SHIFT(a, b, (i*4)); /* 3 */
}
}
The most significant changes are:
- Using "Jc" for the constraint for (c). This allows gcc to use an immediate if possible, but falls back to rcx if necessary (ie the value doesn't fit in a "J" or the value is not known at compile time).
- Using %b2 instead of just %2. This gives us cl instead of rcx which is what these instructions require.
- Changing the loop size to 16. sal and shld only allow shifts of 0-63 on x64 (0-31 on x86).
Compiled with -O2 -m64 -funroll-all-loops -S
, we see:
/APP
# 12 "shl.cpp" 1
shld $1, %rdx, %rax
sal $1, %rdx
# 0 "" 2
# 13 "shl.cpp" 1
shld $8, %rdx, %rax
sal $8, %rdx
# 0 "" 2
# 16 "shl.cpp" 1
shld $0, %rdx, %rax
sal $0, %rdx
# 0 "" 2
# 16 "shl.cpp" 1
shld $4, %rdx, %rax
sal $4, %rdx
...
# 0 "" 2
# 16 "shl.cpp" 1
shld $60, %rdx, %rax
sal $60, %rdx
# 0 "" 2
/NO_APP
What's interesting is if you use i*6 instead of i*4, you see that gcc uses immediates until 60, then starts using cl.
Tada!
精彩评论