You are on page 1of 50

Floating Point to Fixed Point Conversion

Audio DSP Ksharing: Rajesh Sharma


www.rajeshsharma.co.in sharma.rajesh@gmail.com

Numbers Real Numbers


Rational, irrational, positive, negative, zero An infinite decimal representation such
2.487154145934798874502350989093. as

Any point on a infinitely long number line

Integer Numberswhole numbers, no fractional part Discrete points on number line e.g. -10, -4, 0, 7, 9

www.rajeshsharma.co.in 2

7/17/2010

Number Representation Computer


Finite Word Length is a major issue Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit.. A limitation on the possible representations in a register

Floating Point Numbers:


IEEE754 Standard Pseudo Float

Fixed Point Numbers: No Defined Standard


Fractional Numbers: Q Format Integer Numbers
www.rajeshsharma.co.in 3
7/17/2010

Format- Representing numbers in binary system. e.g. for signed number we have
Float: Mantissa Bits & Exponent Bits Fixed: Integer Bits & Fractional Bits Total Bits = Mantissa / Integer + Exponent /fractional + sign bit

Few Terms

Precision - Effective Word Length = N Accuracy Maximum error between actual number and its
representation.
e.g. representation is accurate till 8th decimal place e.g. representation has maximum of 0.001% error

Resolution- smallest non zero magnitude representable with given


format www.rajeshsharma.co.in 4
7/17/2010

Dynamic Range
While Processing/Storing variables: we need to find
Var Var
max

absolute maximum value a variable can have absolute minimum value a variable can have

min

Range of a Variable: : -Var to +Var Resolution of a Variable: : Var


max res

max

Resolution can be determined from..


Absolute Minimum value a variable can have: Var How precisely two consecutive values of a variable may differ what is the accuracy required (e.g. till 8th decimal place)
min

www.rajeshsharma.co.in 5

7/17/2010

Dynamic Range
The ratio of the largest and smallest possible values of the
variable quantity
Dynamic Range is more often expressed as Ratio 5000:1 or as dB 75 dB Or in Bits 12 Bits

For Fixed point, resolution

of the representation becomes the smallest value of the variable.

Varmax Varmax Dynamic Range ( Ratio) = = Varres Dynamic Range ( Bits ) = log 2 ( Varmax ) = N Bits Varres

Bits for signed representation = B = N + 1 Bits for unsigned representation = B = N


www.rajeshsharma.co.in 6
7/17/2010

Dynamic Range: Audio system


Audio system specification uses unsigned representation
The ratio of amplitude of loudest possible undistorted sine wave to rms
noise amplitude

e.g. of microphone, loudspeaker, ADC, DAC etc.. For display devices it is often called contrast ratio
Varmax Dynamic Range (dB) = 20 log10 Dynamic Range (dB) = 6.02 * N + 1.76
For Fixed point conversion normally Signed representation method is used to calculate dynamic range
www.rajeshsharma.co.in 7
7/17/2010

Computers- P, DSP, Controller


N

Finite Word Length

Store results in Registers having finite word length N Bit storage can have only 2 possibilities

How to represent a variable ?


Floating point format Fixed point format

Choice is governed by
Dynamic Range of Variable (Most Imp) Power consumption by specific type of ALU Cost of implementation of ALU.. Other possibilities. www.rajeshsharma.co.in 8
7/17/2010

Floating Point
The decimal point or binary point is floating Economical for very large/small number storage FPU silicon implementation is costly Usually represented as

Two ways of finding values.

r = (1) 2

s (eB)

0. f

or

r = (1) 2

s (eB)

1. f

S : sign Bit value;

B- Bias ; e exponent; f - Mantissa

IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits

www.rajeshsharma.co.in 9

7/17/2010

Floating Point Variable

The Range of variable is very high Still there are only 2 distinct values for N Bits The accuracy of variable varies with values Spacing between numbers is not constant IEEE 754 supports four precisions
N

Half precision, 16 Bit; Single precision, 32 Bit Double precision, 64 Bit; Quadruple precision, 128 Bit

www.rajeshsharma.co.in 10

Single Precision example


7/17/2010

Floating Point Arithmetic

Floating point arithmetic is very easy to use Floating point arithmetic: problems
It may not be distributive (processor implementation)*
a * (b + c) != a*b + a*c

Same mathematical operations on different processors


may not produce exactly same result*

Operations on large numbers has more error. Mantissa limits the resolution Exponent limits the largest possible number
*PS: Different floating point implementations of one DSP algorithm may not be Bit Exact.

www.rajeshsharma.co.in 11

7/17/2010

Number line
Double Precision: 64 bit ;
Mantissa = 52 Bits; exponent 11 Bits

Variable spacing example: 4 digit representation

Number 1.7687*107

Next Number 1.7688*107

Difference 0.0001*107 =1000

1.7687*10-2 1.7688*10-2

0.0001*10-2
7/17/2010

=0.000001 www.rajeshsharma.co.in 12

Number line

2 24 values/decade

Microsoft Equation 3.0

This is for decimal Representation For binary representation it is 224 values/octave

www.rajeshsharma.co.in 13

7/17/2010

Float to Fix: Motivation

To achieve higher speed of operation at lower cost To reduce silicon area of hardware -> cost Power saving during code execution Fixed point implementations may be directly used in
many DSPs

Floating point implementation on different processor


may not be bit-exact

Depending on given input range


it may be beneficial to use fixed point.
www.rajeshsharma.co.in 14
7/17/2010

Fixed Point Numbers

The binary point is fixed for a given format


Integer numbers Fractional numbers, Q format
fractional format

Range is directly limited by the number of Bits The spacing between numbers is constant Q format :written as MQN format or M.N
K bits for Integer part N bits for fractional part M = K + Sign bit = K + 1
www.rajeshsharma.co.in 15
7/17/2010

Fractional Format

signed Q format 13.3


13Q3

1Q3 number

www.rajeshsharma.co.in 16

7/17/2010

Fractional Formats: Range (example)

www.rajeshsharma.co.in 17

7/17/2010

How to Convert
A fractional number can be converted to fixed point number for a
given signed Q format MQN

Set all the N LSB bits to 1 and M Bits to 0; call it maxN e.g. for 1Q31 format, max31 = 0X7FFFFFFF For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191 Now multiply the given fractional number with this number
Number -----------Format 1Q15 HEX/Dec 5Q11 0.375 -0.625 1 1.25 2.005 0.0003

2FFF 12287 0BFF

B001 -20479 EC01

7FFF 32767 1FFF

Over Flow 27FE

Over Flow 4026

0009 9 2 2

HEX/Dec 3071 -5119 8191 www.rajeshsharma.co.in 18

10238 16422

7/17/2010

Setting Q Format

The variable to be converted is explored for Range of the variable value Absolute Maximum / Minimum values of variable Resolution of variable: smallest non zero number Accuracy: Maximum error for a representation
Accuracy required highly depends on the arithmetic involved. It also depends on the reference for final comparison Calculate accuracy from tolerable %error in representation.

www.rajeshsharma.co.in 19

7/17/2010

Exploration Process

If the variable is a part of code


Run the code for various test cases Find the previously described values (range, resolution) Range & Accuracy can be also found from arithmetic involved Find the Resolution

If the variable value comes from RO table then


These values values can be found directly

www.rajeshsharma.co.in 20

7/17/2010

Setting Q Format: MQN


Now calculate the Dynamic range required
Varmax Dynamic Range ( Ratio) = Varres Varmax Dynamic Range ( Bits ) = log 2 ( ) = B bits Varres
from MaxInt of the Range

Determine the K Integer bits

MaxInt = ceil (Varmax ) K = ceil (log 2 ( MaxInt ))

DR bits = B = K + N Total Bits = B + 1(S) => 1(S) + K + N; => M + N S is for Sign Bit
www.rajeshsharma.co.in 21
7/17/2010

RO Table Example
Const float Table[5]={ (5.0979557966e-001), (6.0134489352e-001), We will set the Q format (8.9997625993e-001), For this RO table and compare two choices of (2.5629159405e+000) resolution (-4.4628159401e+000) } Range: 4.4628159401e+000 (signed representation) Resolution: 5.0979557966e-001 = Variable minimum value Resolution = Table[1]-Table[0] = 0.09154931386 MaxInt = 5
www.rajeshsharma.co.in 22
7/17/2010

Example cont

DR = (4.4628159401) / 0.09154931386 = 48.74 DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5) M = 4 Bits, N = 3 Bits; Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N) The conversion format MQN is 4Q3 K = 4 Bits ; N = 3 Bits; max3 = 7; 5.0979557966e-001 * max3 = 3 ; 3 / max3 is equivalent to = 0.42857142857. Representation Error = 15.93% (very huge error)
This error is not acceptable

www.rajeshsharma.co.in 23

7/17/2010

The best would be to correctly represent the last digit of


0.50979557966 (accuracy till 11 i.e. Resolution = 0.00000000006
th

Example cont
digit) =

DR = (4.4628159401) / 0.00000000006 = 74380265668 DR Bits = 36 bits Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N) max33 = 0x1FFFFFFFF = 8589934591 5.0979557966e-001 * max33 = 4379110684 ; 4379110684 /max33 = 0.509795579652976. Representation Error = 1.37e-9% Q Format is set as 4Q33
www.rajeshsharma.co.in 24
7/17/2010

Final Table in 4Q33 Format

PolyPhase4ptDCT4toDCT3[]={
4379110684, Error = 1.37e-9% The 37 bits representation error 5165513301, Error = 1.87e-8% is very small and may not be needed for our application. 7730737206, Error = 3.25e-9% We may try with 32 Bits and check -38335297017, Error = 1.53e-9% } if it fits into application or not??

This is very good representation, But still many things matters e.g.
Precision of arithmetic involved Bits available for representation, e.g.
www.rajeshsharma.co.in 25 32 Bits

7/17/2010

Shifting Binary Point

Right Shift to a fractional number in MQN format


Binary point shift to right MQN >> n (M + n)Q( N - n)

Left Shift to a fractional number in MQN format


Binary point shift to left MQN << n (M - n)Q( N + n)

For example:

2Q30 << 1 is 1Q31

Binary point shifting only indicate the change in the maximum/minimum value representable with resulting format

www.rajeshsharma.co.in 26

7/17/2010

Exercise

If array num[] contains 2Q13 format data


for(i=0;i<200;i++){ sum = sum + num[i] }

What is the output format of sum ??

www.rajeshsharma.co.in 27

7/17/2010

Fixed Point Arithmetic

Overflow Vs Carry Flag

Carry Flag is used for unsigned numbers It is set when


Addition/subtraction of two unsigned operands doesnt fit into
the given Bits

Overflow flag is for signed arithmetic It is set when


When Add/sub/mult of two numbers does not give the expected
sign bit at the result

www.rajeshsharma.co.in 29

7/17/2010

Fractional Addition

Two fixed point numbers a & b can be added directly if


Binary point at same place for both numbers, a & b

Resulting format of fractional addition will have


Binary point at same place for both numbers, input & output

www.rajeshsharma.co.in 30

7/17/2010

Fractional Addition
Result format of output is same as that of input
i.e. Binary point at same place for both numbers, input & output

There is a possibility of growth of Integer Part


Example accumulation of an array with data in MQN format for( i=0; I < L; i++ )
sum = sum + ar[i];

m = ceil (log 2 ( L))

Sum will have output format of (M+m)QN

www.rajeshsharma.co.in 31

7/17/2010

Headroom/Guard Bits

Overflow during a addition/subtraction can be handled


with headroom

example addition

for( i=0; i<256; i++ ) sum = sum + ar[i]; There is possibility of 8 bit overflow

Variable sum should have 8 bit headroom in MSBs to


accommodate the overflow

www.rajeshsharma.co.in 32

7/17/2010

2s Complement Multiplication

www.rajeshsharma.co.in 33

7/17/2010

Fractional Multiplication

The product of two N bit numbers is 2N bits Example product of MQN and KQL formats Output format is: (M+K)Q(N+L) format

www.rajeshsharma.co.in 34

7/17/2010

Fractional Multiplication

1Q3 Numbers

2Q6 Numbers

www.rajeshsharma.co.in 35

7/17/2010

Fractional Multiplication Cont..

www.rajeshsharma.co.in 36

7/17/2010

Normalizing Fractions

A product of 2 twos complement numbers has


bits.

two sign

result can be left shifted by 1 bit. If one of the format is 1QN i.e. N+1 Bits, then
Result of left shift will have the format of other operand If other operand is 5Q11 then result of mult is 6Q(11+N) After 1 bit left shift, result is 5Q(11+N) i.e. 5Q11 + (N+1)LSBs

To make most out of available bits,


it is good to left shift the product
www.rajeshsharma.co.in 37
7/17/2010

Rounding

Why Rounding
To replace a Numerical value with approx. equal & shorter i.e. low precision representation

e.g. replacing 45.6782 with 45.68 Rounding is necessary evil... (..remember quantization..)
Many times it is unavoidable It introduces round-off errors Required in float to fixed conversion and.. floating & fixed point arithmetic and.. for function approximation on fixed point processors
www.rajeshsharma.co.in 38
7/17/2010

Rounding Methods

When rounding a number x to q an integer


Round to nearest : q is integer close to x Round towards zero (truncate): q is the integer part of x Round down (floor): q is largest integer that does not exceed x Round up (ceiling): q is smallest integer that is not less than x

Tie Break during rounding


When fractional part to be rounded is exactly half boundary e.g. rounding of 23.5 requires tie breaking

www.rajeshsharma.co.in 39

7/17/2010

Tie Breaking

Round half up: i.e. q = x + 0.5 (x is positive or negative )


It is asymmetric rounding i.e. biased (with round to nearest)

Round half away from zero:


q = x + 0.5 ; x is positive ; q = x 0.5 ; x is negative This method is free of bias

Round half to even


q is the even integer nearest to x This method is more unbiased +33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34 Known as: unbiased rounding , convergent rounding,
statistician's rounding, Dutch rounding, Gaussian rounding, or bankers' rounding www.rajeshsharma.co.in 7/17/2010 40

Dithering

Stochastic Rounding:
Choose q randomly between x+0.5 and x-0.5 It is also bias free because of random component

Dithering: when to use this method


when the signal being rounded / quantized is slowly varying All rounding methods produce monotonous round-off errors Moreover monotonous round-off error because of particular
rounding methods being used

All of them introduces non-linear response in a system Harmonics in the filter response because of rounding method
www.rajeshsharma.co.in 41
7/17/2010

Block Floating Point Processing

When Dynamic Range of a signal is very high When the input is fluctuating between very high and very
small values.

When very small values are as important as larger


values
A block of data values share a common radix point for processing

Instruction like CLZ & NORM are helpful.


www.rajeshsharma.co.in 42
7/17/2010

Assuming you are working on a 32 bit DSP


Convert following table into fixed point.
const float gain_table[8] ={ (8.2387952833e-002), (9.2387952833e-001), (7.0710676573e-001), (3.8268340208e-001) (1.9615705587e+000), (1.6629392064e+000), (1.1111404206e+000), (3.9018056901e+001) };
www.rajeshsharma.co.in 43
7/17/2010

Exercise1

Exercise2

Assume that inp[ ] array contains 16 Bit PCM data


Find the o/p format of out[ ] ? Convert following loop into fixed point. for (i=0;i<8;i++){ out[i] = inp[i] * gain_table[i]; }

www.rajeshsharma.co.in 44

7/17/2010

Exercise3
Now given below is the DCTIII code, with gain applied at end Convert this into a fixed point code

www.rajeshsharma.co.in 45

7/17/2010

Exercise4

Following IIR filter is used for filtering 16 bit PCM ip Convert it into fixed point and ensure that the filter
response is linear.
x[n] + Z-1 x[n-1] Z-1 x[n-2] y[n-2] b2 a2 b1 a1 Z-1 b0 Z-1 y[n-1] y[n]

y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]

www.rajeshsharma.co.in 46

7/17/2010

Exercise4: C code

www.rajeshsharma.co.in 47

7/17/2010

Exercise4: C code Filter Loop

www.rajeshsharma.co.in 48

7/17/2010

References

Fixed Point Arithmetic: an Introduction


By Randy Yates

Application Note 33
Fixed point arithmetic on the ARM
Document number ARM DAI 0033A

Wiki

www.rajeshsharma.co.in 49

7/17/2010

THANK YOU

You might also like