Approximating the square root of sum of two squares on a microcontroller

2023-02-22 05:12 问答作者：

I'm working on implementing an FFT algorithm in assem开发者_如何学Cbly on an 8-bit microcontroller (HCS08) for fun. Once the algorithm is completed, I'll have an array of 8-bit real/imaginary pairs, and I'll want to find the magnitude of each of these values. That is, if x is complex, I want to find

|x| = sqrt(Re{x}^2 + Im{x}^2)

Now I have available to me a 16-bit register and an 8-bit register. I thought about just squaring them, adding them, and taking the square root of the result, but that poses a problem: the maximum possible value of the sum of the squares of two 8-bit numbers is ~130k, which is greater than the maximum value a 16-bit register can hold (65.5k).

I came up with a subroutine which computes the integer square root of a 16-bit number, which seems to work well, but obviously I'm not guaranteed to be working with values that will fit into 16 bits. My thinking right now is that there is an algorithm which will approximate what I need directly, but I can't seem to find anything. Any ideas would be greatly appreciated.

To summarize: Say I have a vector with two 8-bit components, and I want to find the length of the vector. How can I approximate this without actually computing squares and square roots?

Thanks!

There's a web page describing a Fast Magnitude Estimator. The basic idea is to get a least square (or other high quality) fit to the equation:

Mag ~= Alpha * max(|I|, |Q|) + Beta * min(|I|, |Q|)

for the coefficients Alpha and Beta. Several coefficient pairs are listed with mean square errors, max errors, etc., including coefficients suitable for integer ALUs.

If the sum is greater than 65535, divide it by 4 (shift right 2 bits), take the square root, and multiply that by 2. You'll lose one bit of accuracy, and naturally the result is not guaranteed to fit into 8 bits.

A possible alternative is to compute sqrt((x*x+y*y)/2 instead, which scales all the possible vector magnitudes to range 0..255.

Two (fast) algorithms seem to give near perfect results, one with Cordic, another with maximum of dot products.

void cordic_it(uint16 &x, uint16 &y, int n) {
    auto X = x + y >> n;  // vsraq_n_u16(x, y, n)  in arm neon
    y = abs(y - x >> n);  // vabdq_u16(y, x >> n)  in arm neon
}

uint16_t scaled_magnitude_cordic(uint8_t x, uint8_t y) {
     const int kRound = 1;
     if (x < y) std::swap(x,y);
     // multiply by factor of 256/sqrt(2) == 181.02
     // then reduce by the gain of the cordic iterations of 1.16
     // - with prescaling we also ensure, that the cordic iterations
     //   do not lose too much significant bits when shifting right
     uint16_t X = x * 156, Y = y * 156;
     // exactly 4 iterations. 3 is too little, 5 causes too much noise
     for (int j = 1; j <= 4; j++)  cordic_it(X,Y,j);
     return (X+kRound) >> 8;
}

By altering kRound, one can tune up the results:

     Histogram of real - approx:   -1    0       1 
kRound == 0 -> smaller code         1    46617   18918
kRound == 1 -> approx >= real       0    46378   19158
kRound == -73 ->  balanced error    3695 58301   3540

When selecting kRound == 1, one can fix all the results by

uint8_t fix_if_larger_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
    auto P = (x*x + y*y) / 2;
    auto Q = sqrt*sqrt;
    return sqrt - (P < Q);
}

One can also calculate the square root by approximating the dot product of xa + yb for several angles, where the traditional approach is to use a single angle a = 1, b = 1/2.

With 5 unique angles, for approximately the angles of [0 10 20 30 40] or [5 15 25 35 45], one comes up with either set of coefficients, both of which produce near perfect result that are at most off by 1 unit.

1) [181 0],  [178 31], [170 62], [157 91], [139 116]
2) [180 18], [175 46], [164 76], [148 104], [128 128]

Option 1 has 9 non-trivial coefficients, (although 62 == 31*2). Option 2 has 8 non-trivial coefficients and which lends to the following implementation:

int approx(uint8_t x, uint8_t y) {
     if (x < y) std::swap(x,y);  // sort so that x >= y
     auto a4 = (x + y) / 2;      // vhaddq_u8(x,y) on Arm Neon
     auto a0 = (x * 180 + y * 18) >> 8;
     auto a1 = (x * 175 + y * 46) >> 8;
     auto a2 = (x * 164 + y * 76) >> 8;
     auto a3 = (x * 148 + y * 104) >> 8;
     return max_of_five_elements(a0,a1,a2,a3,a4);
}

This set of mostly even coefficients converts quite nicely to SSSE3 instruction set with _mm_maddubs_epi16 and _mm_max_epu16 instrinsics: each dot product but a1 can be calculated readily with one instruction from interleaved x,y and interleaved coefficients. Naturally it makes more sense to calculate 16 adjacent approximations at the same time to combat the latencies and to not to waste any computations from _mm_packus_epi16, sorting or averaging the uint8_t inputs.

auto a0 = _mm_maddubs_epi16(xy, coeffs0); // coeffs0 = 90 9 90 9 ...
auto a1 = _mm_maddubs_epi16(xy, coeffs1); // coeffs1 = 87 23 87 23 ...
auto a2 = _mm_maddubs_epi16(xy, coeffs2); // coeffs2 = 82 38 82 38 ...
auto a3 = _mm_maddubs_epi16(xy, coeffs3); // coeffs3 = 74 52 74 52 ...
auto a4 = _mm_maddubs_epi16(xy, coeffs4); // coeffs4 = 64 64 64 64 ...
a1 = _mm_add_epi16(a1, x_per_2); // LSB of the coefficient 87.5

// take the maximum, shift right by 7 and pack to uint8_t
a0 = _mm_max_epu16(a0, a1);
a0 = _mm_max_epu16(a0, a2);
a0 = _mm_max_epu16(a0, a3);
a0 = _mm_max_epu16(a0, a4);
a0 = _mm_srli_epi16(a0, 7);
a0 = _mm_packus_epi16(a0, a0);

Using just 8 coefficients is also suitable for ARM Neon implementation, which can now use 16-bit by 16-bit scalar multiplication, storing all the coefficients in just a single full width registers.

For perfect results the dot-product algorithm must be compensated to the other direction, as it may give values, that are just one element below the reference implementation of floor(sqrt((x*x+y*y)/2):

uint8_t fix_if_smaller_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
    auto P = (x*x + y*y) / 2;
    auto Q = (sqrt+1)*(sqrt+1);
    return sqrt + (Q <= P);
}

Other approximation algorithms typically use either division, or scaling, which is difficult to vectorise in Intel before AVX2, due to lack of variable per-lane shift.

Well, you can write x in polar form:

x = r[cos(w) + i sin(w)]

where w = arctan(Im(x)/Re(x)), so

|x| = r = Re(x)/cos(w)

No big numbers here but maybe you'll lose precision on trigonometric functions (that is, if you have access to trigonometric functions :-/ )

A cheap and dirty method that may or may not be suitable is to use

|x| ~ max(|Re{x}|,|Im{x}|) + min(|Re{x}|,|Im{x})/2;

This will tend to overestimate |x| by somewhere between 0 and 12%.

If you're subsequently going to be converting the magnitude to dB, then you dispense with the sqrt operation altogether. I.e. if your calculation is:

magnitude = sqrt(re*re+im*im); // calculate magnitude of complex FFT output value
magnitude_dB = 20*log10(magnitude); // convert magnitude to dB

you can rewrite this as:

magnitude_sq = re*re+im*im; // calculate squared magnitude of complex FFT output value
magnitude_dB = 10*log10(magnitude_sq);  // convert squared magnitude to dB

You might be limited with just 2 registers, but you could look at this code at http://www.realitypixels.com/turk/opensource/index.html Fixed Point Square Root Fixed-Point Trigonometric Functions using CORDIC

继续阅读：assembly microcontroller numerical-methods signal-processing

Approximating the square root of sum of two squares on a microcontroller

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？