How good is NVCC at code optimizations?
How well does NVCC optimize device code? Does it do any sort of optimizations like constant folding and common subexpression elimination?
E.g, will it reduce the following:
float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_开发者_开发百科PI);
to this:
float sqrt_2pi = sqrtf(2 * M_PI); // Compile time constant
float a = 1 / sqrt_2pi;
float b = c / sqrt_2pi;
What about more clever optimizations, involving knowing semantics of math functions:
float a = 1 / sqrtf(c * M_PI);
float b = c / sqrtf(M_PI);
to this:
float sqrt_pi = sqrtf(M_PI); // Compile time constant
float a = 1 / (sqrt_pi * sqrtf(c));
float b = c / sqrt_pi;
The compiler is way ahead of you. In your example:
float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);
nvopencc (Open64) will emit this:
mov.f32 %f2, 0f40206c99; // 2.50663
div.full.f32 %f3, %f1, %f2;
mov.f32 %f4, 0f3ecc422a; // 0.398942
which is equivalent to
float b = c / 2.50663f;
float a = 0.398942f;
The second case gets compiled to this:
float a = 1 / sqrtf(c * 3.14159f); // 0f40490fdb
float b = c / 1.77245f; // 0f3fe2dfc5
I am guessing the expression for a
generated by the compiler should be more accurate than your "optmized" version, but about the same speed.
精彩评论