Overflows in fixed point math

2023-03-15 22:59 问答作者：

As part learning exercise, part hobby project, I am implementing my own interpretation of the Cooley-Tukey FFT algorithm on an AVR using fixed point math. I haven't dealt much with fixed point math before, and am wondering how best to go about part of the implementation. I guess the gist of this question is a request to confirm that I'm thinking about the issues involved correctly.

The heart of the C-T algorithm involves a set of multiplications and additions on complex-valued data in the following fashion (in pseduocode):

temp1 = cosine(increment)*dataRealPart[increment1] 
-sine(inc开发者_开发技巧rement)*dataImaginaryPart[increment1]

temp2 = cosine(increment)*dataImaginaryPart[increment1] 
+ sine(increment)*dataRealPart[increment1]

dataRealPart[increment1] = dataRealPart[increment2] - temp1

etc.

The cosine and sine data will be 8 bit signed binary fractions of the form S.XXX'XXXX, the input data will also be 8 bit signed binary fractions of the form SXXX.XXXX, and the multiplications will generate a 16 bit signed fraction product. As I see it, for particularly "bad" values of sine and cosine and the real and imaginary part of the data, temp1 or temp2 will come pretty close to the limits of a 16 bit signed integer.

If both the real part and imaginary part of the data are, say, b0111.1111, a little work in Wolfram Alpha shows that, with "bad" values of sine and cosine, the output can be up to 1.4 times larger than what simply multiplying the maximum value of a sine times the maximum value of the input would be.

For example, if the sine argument is b0111.1111 and the input value is b0111.111, the output would be b0111111.00000001, or 16129 in decimal. 1.4 times that would be about 22580. This won't overflow the positive range of a signed 16 bit int, but in the next lines these products are added and subtracted from the input data, and assuming the input data here is converted to sixteen bit, it's likely that an overflow will occur.

It looks to me like the trade off is: either increase the internal processing resolution of the data, which increases the processing time, or make sure the input data stays lower than the amplitude which causes overflow, decreasing signal-to-noise ratio. Is that about the size of things?

One option is to reduce your sine and cosine values to Q6 (6 bit to the right of the decimal) This would make them +/-64. Notice that by going one bit more precision, -1 is representable but +1 is not (i.e. +128). Also, after you multiply there will be 2 sign bits in the result which means a sum of 2 products can probably be added with no problem. With this extra bit from reduced resolution you should really be able to avoid overflow. Another point - if your complex value is limited to a magnitude of 1 (real*real+img*img <=1) then the resulting sum will not exceed 1, as opposed to the 1.4 figure you found - that's because when sine is 1, cosine is zero. You're essentially taking a dot-product of a unit vector (cos,sin) and the complex vector. Beyond that, you can shift your 16bit products right a few bits prior to adding them, probably with rounding - you don't really have 16 bits precision anyway since both data and trig functions were rounded to 7 bits.

One last point. If you're taking a sum of many numbers and you know the result will always be within your representable range, you do not need to worry about overflows of the intermediate results. It will all just work out in the end (the bits that you throw away all sum to zero anyway).

I know this is a long time ago, but here are some notes on fixed point math from a library I wrote.

Overview of Using Fixed Point Math Key ideas that come up in the implementation of quantized or DSP type math areas include how to represent fractional units and what to do when math operations cause integer registers to overflow.

Floating point (float, or double types in C,C++) are most often used to represent fractional units but these are not always available (no hardware support or on low end microcontrollers no library software support). If hardware support is not available then floating point can be provided via a software library, however these libraries are generally atleast an order of magnitude slower than hardware implementations. If the programmer is clever, integer math can be used instead of software floating point resulting in much faster code. In these cases fractional units can be represented with integer registers (short, int, long) if a scaling factor is used.

Scaling and Wrap Around Two common issues that come up in creating scaled-integer representations of numbers are

Scaling - The programmer must keep track of the fractional scaling factor by hand whereas when using floating point numbers the compiler and floating point library do this for the programmer.

Overflow and Wrap Around - if two large numbers are added the result may be larger than can fit in a integer representation, resulting in wrap-around or overflow errors. Worse yet these errors may go unnoticed depending on compiler warning settings or types of operations. For example take two 8 bit numbers (usually a char in C/C++)

char a = 34, b= 3, c;

//and compute

c=a*b;

//Multiply them together and the result is 102, which still is fits in a 8 bit result. But //what happens if b = 5?

c=a*b; // the real answer is 170, but the result will be -86

This type of error called an overflow or wrap around error and can be dealt with in several ways. We could use larger integer types such as shorts, or ints, or we could test the numbers beforehand to see if the results would produce an overflow. Which is better? It depends on the circumstances. If we are already using the largest natural type (such as a 32 bit integer on a 32 bit CPU) then we may have to test the value before the operation is performed even if this incurs some runtime performance penalty.

Loss of Precision True floating point representations can maintain a relatively arbitrary set of precision across a range of operations, however when using fixed point (scaled) math the process of scaling means we end up dedicating a portion of the available bits to the precision we desire. This means that mathematical information that is smaller than the scaling factor will be lost causing quantization error.

A Simple Naive Scaling Example Lets try to represent the number 10.25 with an integer we could do something simple like multiply the the value by 100 and store the result in a integer variable. so we have:

10.25 * 100 ==> 1025

If we wanted to add another number, say 0.55 to it we would take 1.55 and scale it too up by 100. so

0.55 * 100 ==> 155

Now to add them together we add the integerized numbers

1025+55 ==> 1070

But lets look at this with some code:

void main (void)
{
    int myIntegerizedNumber  = 1025;
    int myIntegerizedNumber2 =  155;
    int myIntegerizedNumber3;

    myIntegerizedNumber3 = myIntegerizedNumber1 + myIntegerizedNumber2;

    printf("%d + %d = %d\n",myIntegerizedNumber1,myIntegerizedNumber2,myIntegerizedNumber3);
}

But now comes the first of a few challenges. How do we deal with the integer and fractional parts? How do we display the results? There is no compiler support, we as the programmer must separate the integer and fractional results. The program above prints the result as 1070 not 10.70 because the compiler only knows about the integer variables we have used not our intended scaled up definitions.

Thinking in powers of 2 (radixes) In the previous example we used base10 math, which while useful for humans is not an optimal use of bits as all the numerics in the machine will be using binary math. If we use powers of two we can specify the precision in terms of bits instead of base10 for the fractional and integer parts and we get several other advantages:

Ease of notation - for example with a 16 bit signed integer (typically a short in C/C++), we can say the number is "s11.4" which means its a number that is signed with 11 bits of integer and 4 bits of fractional precision. In fact one bit is not used for a sign representation but the number is represented as 2's complement format. However effectively 1 bit is used for sign representation from the point of precision represented. If a number is unsigned, then we can say its u12.4 - yes the same number now has 12 integer bits of precision and 4 bits of fractional representation. If we were to use base10 math no such simple mapping would be possible (I won't go in to all the base10 issues that will come up). Worse yet many divide by 10 operations would need to be performed which are slow and can result in precision loss.

Ease of changing radix precision. Using base2 radixes allows us to use simple shifts (<< and >>) to change from integer to fixed-point or change from different fixed point representations. Many programmers assume that radixes should be a byte multiple like 16bits or 8 bits but actually using just enough precision and no more (like 4 or 5 bits for small graphics apps) can allow much larger headroom because integer portion will get the remaining bits. Suppose we need a range of +/-800 and a precision of 0.05 (or 1/20 in decimal). We can fit this in to 16bit integer as follows. First one bit is allocated to sign. This leaves 15 bits of resolution. Now we need 800 counts of range. Log2(800) = 9.64.. So we need 10 bits for the integer dynamic range. Now lets look at the fractional precision, we need log2(1/(0.05))= 4.32 bits which rounded up is 5 bits. So we could, in this application use a fixed radix of s10.5 or signed 10bit integer and 5 bit fractional resolution. Better yet it still fits in a 16 bit integer. Now there are some issues: while the fractional precision is 5 bits (or 1/32 or about 0.03125) which is better than the requirement of 0.05, it is not identical and so with accumulated operations we will get quantization errors. This can be greatly reduced by moving to a larger integer and radix (e.g. 32bit integer with more fractional bits) but often this is not necessary and manipulating 16 bit integers is much more efficient in both compute and memory for smaller processors. The choice of these integer sizes etc should be made with care. A few notes on fixed point precision When adding fixed point numbers together its important to align their radix points (e.g. you must add 12.4 and 12.4 numbers or even 18.4 + 12.4 + 24.4 will work where the integer portion represents the number of bits in use not the physical size of the integer register declared). Now, and here begins some of the trickyiness, the result, purely speaking is a 13.4 number! But this becomes 17 bits - so you must be careful to watch for overflow. One way is to put the results in a larger 32 bit wide register of 28.4 precision. Another way is to test and see if the high order bits of either of the operands is actually set - if not then the numbers can be safely added despite the register width precision limitations. Another solution is to use saturating math - if the result is larger than can be put in the register we set the result to the maximum possible value. Also be sure to watch out for sign. Adding two negative numbers can cause wrap around just as easy as two positive numbers.

An interesting problem that comes up in designing fixed radix pipelines is keeping track of how much of the available precision is actually in use. While you may be using This is especially true for operations like Fourier transforms and the like which can cause some array cells to have relatively large values while others have near zero mathematical energy.

Some rules: Adding 2 M bit numbers results in a M+1 bit precision result (without testing for overflow) Adding N M bit numbers results in M + log2(N) bit precision result (without testing for overflow)

Multiply an M bit number by a N bit number results in a N+M bit precision result (without testing for overflow)

Saturation can be useful in some circumstance but may result in performance hits or loss of data.

Adding... When adding or subtracting fixed radix numbers the radix points must be aligned beforehand. For example: to add a A is a s11.4 number and B is a 9.6 number. We need to make some choices. We could move them to larger registers first, say 32 bit registers. resulting in A2 being a s27.4 number and B2 being a s25.6 number. Now we can safely shift A2 up by two bits so that A2 = A2<<2. Now A2 is a s25.6 number but we haven't lost any data because the upper bits of the former A can be held in the larger register without precision loss.

Now we can add them and get the result C=A2+B2. The result is a s25.6 number but the precision is actually 12.6 (we use the larger integer portion which comes from A which is 11 bits and the larger fractional portion, which comes from B which is 6 bits, plus 1 bit from addition operation). So this s12.6 number has 18+1 bits of precision without any accuracy loss. But if we need to convert it back to a 16 bit precision number we will need to make choices as to how much fractional accuracy to keep. The simplest way is to preserve all the integer bits so we shift C down by enough bits that the sign and integer bits fit in a 16 bit register. So C=C>>3 results in a s12.3 number. As long as we keep track of the radix point we will have maintained accuracy. Now if we had tested A and B before hand we may have learned that we could keep more of the fractional bits.

Multiplying... Multiplying does not require that the radix points be aligned before the operation is carried out. Lets assume we have two numbers as we had in our Adding example. A is a s11.4 precision number which we move to A2 now a s27.4 large register (but still a s11.4 number of bits are in use). B is a s9.6 number which we move to B2 now a s25.6 large registor (but still a s9.6 number of bits are in use). C=A2*B2 results in C being a s20.10 number. note that C is using the entire 32 bit register for the result. Now if we want the result squished back in to a 16 bit register then we must make some hard choices. Firstly we already have 20 bits if integer precision - so any attempt (without looking at the number of bits actually used) to fit the result in to a 16 bit regist must result in some type of truncation. If we take the top 15 bits of the result (+1 bit for sign precision) we the programmers must remember that this scaled up by 20-5 = 5 bits of precision. So when even though we can fit the top 15 bits in a 16 bit register we will lose the bottom 16 bits of precision and we have to remember that the result is scaled by 5 bits (or integer 32). Interestingly if test both A and B beforehand we may find that while they had the stated incoming precion by wrote they may not actually contain that number of live, set bits (e.g. if A is a s11.4 number by programmer's convention but its actual value is integer 33 or fixed-radix 2and1/16 then we may not have as many bits to truncate in C).

(also library here: https://github.com/deftio/fr_math)

继续阅读：fixed-point math signal-processing

Overflows in fixed point math

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？