Approximating the square root of sum of two squares on a microcontroller
I'm working on implementing an FFT algorithm in assem开发者_如何学Cbly on an 8-bit microcontroller (HCS08) for fun. Once the algorithm is completed, I'll have an array of 8-bit real/imaginary pairs, and I'll want to find the magnitude of each of these values. That is, if x is complex, I want to find
|x| = sqrt(Re{x}^2 + Im{x}^2)
Now I have available to me a 16-bit register and an 8-bit register. I thought about just squaring them, adding them, and taking the square root of the result, but that poses a problem: the maximum possible value of the sum of the squares of two 8-bit numbers is ~130k, which is greater than the maximum value a 16-bit register can hold (65.5k).
I came up with a subroutine which computes the integer square root of a 16-bit number, which seems to work well, but obviously I'm not guaranteed to be working with values that will fit into 16 bits. My thinking right now is that there is an algorithm which will approximate what I need directly, but I can't seem to find anything. Any ideas would be greatly appreciated.
To summarize: Say I have a vector with two 8-bit components, and I want to find the length of the vector. How can I approximate this without actually computing squares and square roots?
Thanks!
There's a web page describing a Fast Magnitude Estimator. The basic idea is to get a least square (or other high quality) fit to the equation:
Mag ~= Alpha * max(|I|, |Q|) + Beta * min(|I|, |Q|)
for the coefficients Alpha and Beta. Several coefficient pairs are listed with mean square errors, max errors, etc., including coefficients suitable for integer ALUs.
If the sum is greater than 65535, divide it by 4 (shift right 2 bits), take the square root, and multiply that by 2. You'll lose one bit of accuracy, and naturally the result is not guaranteed to fit into 8 bits.
A possible alternative is to compute sqrt((x*x+y*y)/2
instead, which scales all the possible vector magnitudes to range 0..255.
Two (fast) algorithms seem to give near perfect results, one with Cordic, another with maximum of dot products.
void cordic_it(uint16 &x, uint16 &y, int n) {
auto X = x + y >> n; // vsraq_n_u16(x, y, n) in arm neon
y = abs(y - x >> n); // vabdq_u16(y, x >> n) in arm neon
}
uint16_t scaled_magnitude_cordic(uint8_t x, uint8_t y) {
const int kRound = 1;
if (x < y) std::swap(x,y);
// multiply by factor of 256/sqrt(2) == 181.02
// then reduce by the gain of the cordic iterations of 1.16
// - with prescaling we also ensure, that the cordic iterations
// do not lose too much significant bits when shifting right
uint16_t X = x * 156, Y = y * 156;
// exactly 4 iterations. 3 is too little, 5 causes too much noise
for (int j = 1; j <= 4; j++) cordic_it(X,Y,j);
return (X+kRound) >> 8;
}
By altering kRound, one can tune up the results:
Histogram of real - approx: -1 0 1
kRound == 0 -> smaller code 1 46617 18918
kRound == 1 -> approx >= real 0 46378 19158
kRound == -73 -> balanced error 3695 58301 3540
When selecting kRound == 1
, one can fix all the results by
uint8_t fix_if_larger_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
auto P = (x*x + y*y) / 2;
auto Q = sqrt*sqrt;
return sqrt - (P < Q);
}
One can also calculate the square root by approximating the dot product of xa + yb for several angles, where the traditional approach is to use a single angle a = 1, b = 1/2
.
With 5 unique angles, for approximately the angles of [0 10 20 30 40]
or [5 15 25 35 45]
, one comes up with either set of coefficients, both of which produce near perfect result that are at most off by 1 unit.
1) [181 0], [178 31], [170 62], [157 91], [139 116]
2) [180 18], [175 46], [164 76], [148 104], [128 128]
Option 1 has 9 non-trivial coefficients, (although 62 == 31*2). Option 2 has 8 non-trivial coefficients and which lends to the following implementation:
int approx(uint8_t x, uint8_t y) {
if (x < y) std::swap(x,y); // sort so that x >= y
auto a4 = (x + y) / 2; // vhaddq_u8(x,y) on Arm Neon
auto a0 = (x * 180 + y * 18) >> 8;
auto a1 = (x * 175 + y * 46) >> 8;
auto a2 = (x * 164 + y * 76) >> 8;
auto a3 = (x * 148 + y * 104) >> 8;
return max_of_five_elements(a0,a1,a2,a3,a4);
}
This set of mostly even coefficients converts quite nicely to SSSE3 instruction set with _mm_maddubs_epi16
and _mm_max_epu16
instrinsics: each dot product but a1
can be calculated readily with one instruction from interleaved x,y and interleaved coefficients. Naturally it makes more sense to calculate 16 adjacent approximations at the same time to combat the latencies and to not to waste any computations from _mm_packus_epi16
, sorting or averaging the uint8_t inputs.
auto a0 = _mm_maddubs_epi16(xy, coeffs0); // coeffs0 = 90 9 90 9 ...
auto a1 = _mm_maddubs_epi16(xy, coeffs1); // coeffs1 = 87 23 87 23 ...
auto a2 = _mm_maddubs_epi16(xy, coeffs2); // coeffs2 = 82 38 82 38 ...
auto a3 = _mm_maddubs_epi16(xy, coeffs3); // coeffs3 = 74 52 74 52 ...
auto a4 = _mm_maddubs_epi16(xy, coeffs4); // coeffs4 = 64 64 64 64 ...
a1 = _mm_add_epi16(a1, x_per_2); // LSB of the coefficient 87.5
// take the maximum, shift right by 7 and pack to uint8_t
a0 = _mm_max_epu16(a0, a1);
a0 = _mm_max_epu16(a0, a2);
a0 = _mm_max_epu16(a0, a3);
a0 = _mm_max_epu16(a0, a4);
a0 = _mm_srli_epi16(a0, 7);
a0 = _mm_packus_epi16(a0, a0);
Using just 8 coefficients is also suitable for ARM Neon implementation, which can now use 16-bit by 16-bit scalar multiplication, storing all the coefficients in just a single full width registers.
For perfect results the dot-product algorithm must be compensated to the other direction, as it may give values, that are just one element below the reference implementation of floor(sqrt((x*x+y*y)/2)
:
uint8_t fix_if_smaller_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
auto P = (x*x + y*y) / 2;
auto Q = (sqrt+1)*(sqrt+1);
return sqrt + (Q <= P);
}
Other approximation algorithms typically use either division, or scaling, which is difficult to vectorise in Intel before AVX2, due to lack of variable per-lane shift.
Well, you can write x in polar form:
x = r[cos(w) + i sin(w)]
where w = arctan(Im(x)/Re(x))
, so
|x| = r = Re(x)/cos(w)
No big numbers here but maybe you'll lose precision on trigonometric functions (that is, if you have access to trigonometric functions :-/ )
A cheap and dirty method that may or may not be suitable is to use
|x| ~ max(|Re{x}|,|Im{x}|) + min(|Re{x}|,|Im{x})/2;
This will tend to overestimate |x| by somewhere between 0 and 12%.
If you're subsequently going to be converting the magnitude to dB, then you dispense with the sqrt
operation altogether. I.e. if your calculation is:
magnitude = sqrt(re*re+im*im); // calculate magnitude of complex FFT output value
magnitude_dB = 20*log10(magnitude); // convert magnitude to dB
you can rewrite this as:
magnitude_sq = re*re+im*im; // calculate squared magnitude of complex FFT output value
magnitude_dB = 10*log10(magnitude_sq); // convert squared magnitude to dB
You might be limited with just 2 registers, but you could look at this code at http://www.realitypixels.com/turk/opensource/index.html Fixed Point Square Root Fixed-Point Trigonometric Functions using CORDIC
精彩评论