You are on page 1of 30

DEPARTMENT OF ELECTRONICS & CONTROL ENGINEERING

SECX1028 DIGITAL SIGNAL PROCESSING


UNIT III EFFECT OF FINITE REGISTER LENGTH

IIR filter realization


IIR filter transfer function can be expressed as:

Where, N is the filter order; bk the coefficient of non-recursive part of IIR filter; and ak the
coefficient of recursive part (feedback) of IIR filter.
The coefficients bk and ak are of interest for IIR filter realization (both hardware and
software). Figure 1 illustrates the block diagram of IIR filter

Figure 1 Block diagram of IIR filter


There are several types of IIR filter realization. This chapter covers direct, direct
transpose, direct canonic, direct transpose canonic and cascade realizations. All of

them are very convenient and most commonly used for both hardware and software IIR
filter realization. Each of them will be described in detail along with their advantages and
disadvantages.
Direct realization
Direct realization of IIR filters starts with this expression:

The first part of the expression refers to non-recursive part and the other refers to
recursive part of IIR filter. In IIR filter direct realization, these two parts are separately
considered and realized.
Direct Form - I
The realization of non-recursive part of IIR filter is identical to the direct realization of
FIR filter. Figure 2 illustrates the block diagram of direct realization of non-recursive part
of IIR filter.

Figure 2 Direct realization of non-recursive part of IIR filter


As seen from Figure 2 above, multiplication coefficients are identical to those of the
transfer function. Realization of non-recursive part of IIR filter is similar to that of
recursive part. Figure 3 illustrates the direct realization of the filter recursive part.

Figure 3 Direct realization of non-recursive part of IIR filter


As non-recursive and recursive part of IIR filter are separately realized, it doesnt matter
which of them will be used first in filtering process. Figures 4a and 4b illustrate block
diagrams of IIR filter realization when non-recursive part is used before and after
recursive part of IIR filter, respectively.

Figure 4a. IIR filter direct realization, non-recursive part is used first

Figure 4b. IIR filter direct realization, recursive part is used first
This structure is also known as a direct form I structure. As seen from Figures 4a and
4b, direct realization requires in total of 2N delay lines, (2N+1) multiplications and 2N
additions.
Direct realization is very convenient for software implementation and this is where it is
most commonly used. Some of disadvantages of this realization are the greatest
sensitivity to accuracy of realized coefficients (i.e. the largest finite word-length effect),
and the greatest complexity due to implementation (i.e. needs most resources).
Direct Canonical Realization or Direct Form - II
Direct canonical realization structure has reduced number of delay lines to the
minimum, that is, N delay lines. This way, one of the main disadvantages of direct and

direct transpose realization structures is eliminated. Recursive and non-recursive parts


of IIR filter are not considered separately, which causes implementation to be more
complex than for direct realization structure. A good thing is that the coefficients are the
same as for direct realization.
Figure 5 illustrates the block diagram describing direct canonic realization structure of
IIR filter.

Figure 5 Direct canonic realization structure block diagram

Similarities between direct canonic structure block diagram and direct realization
structure shown in Figure 4b are obvious. The difference between realization structures
shown in Figures 4b and 5 is that non-recursive and recursive part for direct canonic
realization structure cannot be treated separately, although it is easy to differentiate
between them.
Direct canonic structure uses N delay elements, (2N+1) multipilications and 2N
additions. Sensitivity to the accuracy of coefficients is the same as for all previously
described structures, which is the main disadvantage of this realization structure.
Cascade Realization
Cascade realization structure is the most difficult to obtain from the transfer function. It
is very convenient for its modular structure and less sensitivity to the accuracy of nonrecursive and recursive coefficients realization. On cascade IIR filter realization, a filter
is divided into several, mutually independent sections of the first or second order.
Individual sections are mostly realized in direct canonical or direct transpose canonical
structure.
Since the sections are mutually independent after design process, the finite word-length
effect on the accuracy of coefficients, modulation of frequency response and IIR filter
stability are separately examined for each section. The analyse is simplified this way.
The IIR filter transfer function is expressed as:

where:

bi are the coefficients of transfer function numerator (non-recursive part);

aj are the coefficients of transfer function denominator (recursive part);

H0 is a constant;

qi are the zeros of the transfer function;

pj are the poles of the transfer function;

B(z) is the transfer function of non-recursive part;

A(z) is the transfer function of recursive part (feedback); and

M is the number of sections in cascade realization structure.

Cascade realization requires the given expression to be factorized so that the transfer
function is expressed as follows:

where:
a[i, k] are the coefficients of recursive part of the ith IIR filter section;
b[i, k] are the coefficients of non-recursive part of the ith IIR filter section.
Individual sections are of the first or second order. Direct transpose canonical structure
is most frequently used in realization. Figure 6 illustrates a first-order section. Figure 7
illustrates a second-order section.

Figure 6 First-order section

Figure 7 Second-order section


The use of direct transpose realization structure reduces necessary number of delay
lines and adders as well. Filter dividing in independent sections reduces the sensitivity
to the accuracy of quantization coefficients and simplifies analysing the stability of the
resulting filter. Besides, the possibility that IIR filter becomes instable after quantization
is drastically reduced as the coefficients quantization is performed after dividing filter in
sections, so the changes of poles locations are smaller, therefore.
Software realization requires M buffer of length 2 or 1. Each section must have its own
buffer for saving samples of intermediate signals. Such complexity and needed
factorization are two main disadvantages of this realization structure. Figure 8 illustrates
the block diagram describing cascade IIR filter structure.

Figure 8 Cascade IIR filter structure

Parallel Form
A direct partial-fraction expansion of the transfer function in z leads to the parallel form
structure. Assuming simple poles, the transfer function, H(z) can be expressed as,

The two basic parallel realizations of a 3rd-order IIR transfer function are shown in
figure 9

Figure 9 Parallel form realization Structure

Fixed-point Representation
One possibility for handling numbers with fractional parts is to add bits after the decimal
point: The first bit after the decimal point is the halves place, the next bit the quarters
place, the next bit the eighths place, and so on.

Suppose that we want to represent 1.625(10). We would want 1 in the ones place,
leaving us with 0.625. Then we want 1 in the halves place, leaving us with
0.625 0.5 = 0.125. No quarters will fit, so put a 0 there. We want a 1 in the eighths
place, and we subtract 0.125 from 0.125 to get 0.

So the binary representation of 1.625 would be 1.101(2).


The idea of fixed-point representation is to split the bits of the representation between
the places to the left of the decimal point and places to the right of the decimal point.
For example, a 32-bit fixed-point representation might allocate 24 bits for the integer
part and 8 bits for the fractional part.

To represent 1.625, we would use the first 24 bits to indicate 1, and we'd use the
remaining 8 bits to represent 0.625. Thus, our 32-bit representation would be:
00000000 00000000 00000001 10100000.
Fixed-point representation works reasonably well as long as you work with numbers
within the supported range. The 32-bit fixed-point representation described above can
represent any multiple of 1/256 from 0 up to 224 16.7 million. But programs frequently

need to work with numbers from a much broader range. For this reason, fixed-point
representation isn't used very often in today's computing world.
A notable exception is financial software. Here, all computations must be represented
exactly to the penny, and in fact further precision is rarely desired, since results are
always rounded to the penny. Moreover, most applications have no requirement for
large amounts (like trillions of dollars), so the limited range of fixed-point numbers isn't
an issue. Thus, programs typically use a variant of fixed-point representation that
represents each amount as an integer multiple of 1/100, just as the fixed-point
representation described above represents each number as a multiple of 1/256.
Normalized floating-point
Floating-point representation is an alternative technique based on scientific notation.
Floating-point basics
Though we'd like to use scientific notation, we'll base our scientific notation on powers
of 2, not powers of 10, because we're working with computers that prefer binary. For
example, 5.5(10) is 101.1(2) in binary, and it converts a to binary scientific notation of
1.011(2) 22. In converting to binary scientific notation here, we moved the decimal point
to the left two places, just as we would in converting 101.1 (10) to scientific notation: It
would be 1.011(10) 102.)
Once we have a number in binary scientific notation, we still must have a technique for
mapping that into a set of bits. First, let us define the two parts of scientific
representation: In 1.011(2) 102, we call 1.011(2) the mantissa (or the significand), and
we call 2 the exponent. In this section we'll use 8 bits to store such a number.

We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits
for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents),

and the last three bits for the mantissa's fractional part. Note that we omit the integer
part of the mantissa: Since the mantissa must have exactly one nonzero bit to the left of
its decimal point, and the only nonzero bit is 1, we know that the bit to the left of the
decimal point must be a 1. There's no point in wasting space in inserting this 1 into our
bit pattern, so we include only the bits of the mantissa to the right of the decimal point.
For our example of 5.5(10) = 1.011(2) 22, we add 7 to 2 to arrive at 9(10) = 1001(2) for the
exponent bits. Into the mantissa bits we place the bits following the decimal point of the
scientific notation, 011. This gives us 0 1001 011 as the 8-bit representation of 5.5(10).
We call this floating-point representation because the values of the mantissa bits float
along with the decimal point, based on the exponent's given value. This is in contrast to
fixed-point representation, where the decimal point is always in the same place among
the bits given.
Suppose we want to represent 96(10).
a. First we convert our desired number to binary: 1100000(2).
b. Then we convert this to binary scientific notation: 1.100000 (2) 26.
c. Then we fit this into the bits.
1. We choose 1 for the sign bit since the number is negative.
2. We add 7 to the exponent and place the result into the four exponent bits.
For this example, we arrive at 6 + 7 = 13(10) = 1101(2).
3. The three mantissa bits are the first three bits following the leading 1: 100.
If it happened that there were 1 bits beyond the 1/8's place, we would
need to round the mantissa to the nearest eighth.)
Thus we end up with 1 1101 100.
Conversely, suppose we want to decode the number 0 0101 100.

1. We observe that the number is positive, and the exponent bits represent
0101(2) = 5(10). This is 7 more than the actual exponent, and so the actual
exponent must be 2. Thus, in binary scientific notation, we have 1.100 (2) 22.
2. We convert this to binary: 1.100(2) 22 = 0.011(2).
3. We convert the binary into decimal: 0.011(2) = 1/4 + 1/8 = 3/8 = 0.375(10).
Representable numbers
This 8-bit floating-point format can represent a wide range of both small numbers and
large numbers. To find the smallest possible positive number we can represent, we
would want the sign bit to be 0, we would place 0 in all the exponent bits to get the
smallest exponent possible, and we would put 0 in all the mantissa bits. This gives us
0 0000 000, which represents
1.000(2) 20 7 = 27 0.0078(10).
To determine the largest positive number, we would want the sign bit still to be 0, we
would place 1 in all the exponent bits to get the largest exponent possible, and we
would put 1 in all the mantissa bits. This gives us 0 1111 111, which represents
1.111(2) 215 7 = 1.111(2) 28 = 111100000(2)=480(10).
Thus, our 8-bit floating-point format can represent positive numbers from about
0.0078(10) to 480(10). In contrast, the 8-bit two's-complement representation can only
represent positive numbers between 1 and 127.
But notice that the floating-point representation can't represent all of the numbers in its
range this would be impossible, since eight bits can represent only 2 8 = 256 distinct
values, and there are infinitely many real numbers in the range to represent. What's
going on? Let's consider how to represent 51 (10) in this scheme. In binary, this is
110011(2) = 1.10011(2) 25. When we try to fit the mantissa into the 3-bit portion of our
scheme, we find that the last two bits won't fit: We would be forced to round to

1.101(2) 25, and the resulting bit pattern would be 0 1100 101. That rounding means
that we're not representing the number precisely. In fact, 1 1100 101 translates to
1.101(2) 212 7 = 1.101(2) 25 = 110100(2) = 52(10).
Thus, in our 8-bit floating-point representation, 51 equals 52! That's pretty irritating, but
it's a price we have to pay if we want to be able to handle a large range of numbers with
such a small number of bits.
(By the way, in rounding numbers that are exactly between two possibilities, the typical
policy is to round so that the final mantissa bit is 0. For example, taking the number 19,
we end up with 1.0011(2) 24, and we would round this up to 1.010(2) 24 = 20(10). On
the other hand, rounding up the number 21 = 1.0101(2) 24 would lead to 1.011(2) 24,
leaving a 1 in the final bit of the mantissa, which we want to avoid; so instead we round
down to 1.010(2) 24 = 20(10). Doubtless you were taught in grade school to round up
all the time; computers don't do this because if we consistently round up, all those
roundings will bias the total of the numbers upward. Rounding so the final bit is 0 ensure
that exactly half of the possible numbers round up and exactly half round down.)
While a floating-point representation can't represent all numbers precisely, it does give
us a guaranteed number of significant digits. For this 8-bit representation, we get a
single digit of precision, which is pretty limited. To get more precision, we need more
mantissa bits. Suppose we defined a similar 16-bit representation with 1 bit for the sign
bit, 6 bits for the exponent plus 31, and 9 bits for the mantissa.

This representation, with its 9 mantissa bits, happens to provide three significant digits.
Given a limited length for a floating-point representation, we have to compromise
between more mantissa bits (to get more precision) and more exponent bits (to get a
wider range of numbers to represent). For 16-bit floating-point numbers, the 6-and-9
split is a reasonable tradeoff of range versus precision.

Finite word-length effects


There are hardware and software FIR filter realizations. Regardless of which of them is
used, a problem known as the finite word-length effect exists in either case. One of the
objectives, when designing filters, is to lessen the finite word-length effects as much as
possible, thus satisfying the initiative requirements (filter specifications).
On software filter implementation, it is possible to use either fixed-point or floating-point
arithmetic. Both representations of numbers have some advantages and disadvantages
as well.
The fixed-point representation is used for saving coefficients and samples in memory.
Most commonly used fixed-point format is when one bit denotes a sign of a number, i.e.
0 denotes a positive, whereas 1 denotes a negative number, and the rest of bits denote
the value of a number. This is mostly used to represent numbers in the range -1 to +1.
Numbers represented in the fixed-point format are equidistantly quantized with the
quantization step N-1, where N is the number of a bit used for saving the value. As
one bit is a sign bit, there are N-1 bits available for value quantization. The maximum
error that may occur during quantization is 1/2 quantization step, that is N. It can be
noted that accuracy increases as the number of bits increases. Table 1 shows the
values of quantization steps and maximum errors made due to quantization process in
the fixed-point presentation.
Table 1 Quantization of numbers represented in the fixed-point format

Bit

Range

number

numbers

of

Quantization step

Max. quantization error

Number of exact
decimal points

(-1, +1)

0.125

0.0625

(-1, +1)

0.0078125

0.00390625

16

(-1, +1)

32

(-1, +1)

64

(-1, +1)

3.0517578125*10-5

4.6566128730774*1010

1.0842021724855*1019

1.52587890625*10-5

2.3283064365387*10-10

5.4210108624275*10-20

19

The advantage of this presentation is that quantization errors tend to approximate 0. It


means that errors are not accumulated in operations performed upon fixed-point
numbers. One of disadvantages is a smaller accuracy in coefficients representation.
The difference between actual sampled value and quantized value, i.e. the quantization
error, is smaller as the quantization level decreases. In other words, the effects of the
quantization error are negligible in this case.
The floating-point arithmetic saves values with better accuracy due to dynamics it is
based on. Floating-point representations cover a much wider range of numbers. It also
enables an appropriate number of digits to be faithfully saved. The value normally
consists of three parts. The first part is, similar to the fixed-point format, represented by
one bit known as the sign bit. The second part is a mantissa M, which is a fractional part
of the number, and the third part is an exponent E, which can be either positive or
negative. A number in the floating-point format looks as follows:

Where, M is the mantissa and E is the exponent.


As seen, the sign bit along with mantissa represent a fixed-point format. The third part,
i.e. exponent provides the floating-point representation with dynamics, which further
enables both extremely large and extremely small numbers to be saved with
appropriate accuracy. Such numbers could not be represented in the fixed-point format.

Table 2 below provides the basic information on floating-point representation for several
different lengths.
Table 2 Quantization of numbers represented in the floating-point format
Bit

Mantissa

Exponent

number

size

size

16

32

23

Number

Band

2.3x10-38

3.4x1038

exact

decimal points
..

3.4x1038
1.4x10-45

of

..

6-7

It is not possible to determine the quantization step in the floating-point representation


as it depends on exponent. Exponent varies in a way that the quantization step is as
small as possible. In this number presentation, special attention should be paid to the
number of digits that are saved with no error.
The floating-point arithmetic is suitable for coefficient representation. The errors made in
this case are considerably less than those made in the fixed-point arithmetic. Some of
disadvantages of this presentation are complex implementation and errors that do not
tend to approximate 0. The problem is extremely obvious when the operation is
performed upon two values of which one is much less than the other.
Example
FIR filter coefficients:
{0.151365, 0.400000, 0.151365}
Coefficients need to be represented as 16-bit numbers in the fixed-point and floatingpoint formats. If we suppose that numbers range between -1 and +1, then quantization

level amounts to 1 / 2^16 = 0.0000152587890625. After quantization, the filter


coefficients have the following values:
{0.1513671875, 0.399993896484375, 0.1513671875}
Quantization errors are:
{-0.0000021875, 0.000006103515625, -0.0000021875}
If filter coefficients are represented in the floated-point format, it is not possible to
determine quantization level. In this case, the coefficients have the following values:
{0.151364997029305, 0.400000005960464, 0.151364997029305}
Quantization errors produced while representing coefficients as 16-bit numbers in the
floating-point format are:
{0.000000002970695, -0.000000005960464, 0.000000002970695}
As seen, a coefficient error is less in the floating-point representation.
Floating-point arithmetic can also be expressed in terms of fixed-point arithmetic. For
this reason, the fixed-point arithmetic is more often implemented in digital signal
processors.
The finite word-length effect is the deviation of FIR filter characteristic. If such
characteristic still meets the filter specifications, the finite word-length effects are
negligible.
As a result of greater error in coefficients representation, the finite word-length effects
are more prominent in fixed-point arithmetic.
These effects are more prominent for IIR filters for their feedback property than for FIR
filters. In addition, coefficient representation can cause IIR filters to become instable,
whereas it cannot affect FIR filters that way.
FIR filters keep their linear phase characteristic after quantization. The reason for this is
the fact that the coefficients of a FIR filter with linear phase characteristic are symmetric,

which means that the corresponding pairs of coefficients will be quantized to the same
value. It results in the impulse response symmetry remaining unchanged.
After all mentioned, it is easy to notice that finite word length, used for representing
coefficients and samples being processed, causes some problems such as:
Coefficient quantization errors;
Sample quantization errors (quantization noise); and
Overflow errors.
Coefficient Quantization
The coefficient quantization results in FIR filter changing its transform function. The
position of FIR filter zeros is also changed, whereas the position of its poles remains
unchanged as they are located in z=0. Quantization has no effect on them. The
conclusion is that quantization of FIR filter coefficients cannot cause a filter to become
instable as is the case with IIR filters.
Even though there is no danger of FIR filter destabilization, it may happen that transfer
function is deviated to such an extent that it no longer meets the specifications, which
further means that the resulting filter is not suitable for intended implementation.
The FIR filter quantization errors cause the stop band attenuation to become lower. If it
drops below the limit defined by the specifications, the resulting filter is useless.
Transfer function changes occurring due to FIR filter coefficient quantization are more
effective for high-order filters. The reason for this is the fact that spacing between zeros
of the transfer function get smaller as the filter order increases and such slight changes
of zero positions affect the FIR filter frequency response.
Samples Quantization
Another problem caused by the finite word length is sample quantization performed at
multipliers output (after filtering). The process of filtering can be represented as a sum
of multiplications performed upon filter coefficients and signal samples appearing at

filter input. Figure 2-5-1 illustrates block diagram of input signal filtering and quantization
of result as well.

Figure 2-5-1. Signal samples filtering


Multiplication of two numbers each N bits in length, will give a product which is 2N bits
in length. These extra N bits are not necessary, so the product has to be truncated or
rounded off to N bits, producing truncation or round-off errors. The later is more
preferable in practice because in this case the mid-value of quantization error
(quantization noise) is equal to 0.
In most cases, hardware used for FIR filter realization is designed so that after each
individual multiplication, a partial sum is accumulated in a register which is 2N in length.
Not before the process of filtering ends, the result is quantized on N bits and
quantization noise is introduced, thus drastically reduced.
Quantization noise depends on the number of bits N. The quantization noise is reduced
as the number of bits used for sample and coefficient representation increases.
Both filter realization and position of poles affect the quantization noise power. As all
FIR filter poles are located in z=0, the effect of filter realization on the quantization noise
is almost negligible.
Overflow
Overflow happens when some intermediate results exceed the range of numbers that
can be represented by the given word-length. For the fixed-point arithmetic, coefficients
and samples values are represented in the range -1 to +1. In spite of the fact that both
FIR filter input and output samples are in the given range, there is a possibility that an
overflow occurs at some point when the results of multiplications are added together. In
other words, an intermediate result is greater than 1 or less than -1.

Example:
Assume that it is needed to filtrate input samples using a second-order filter.
Such filter has three coefficients. These are: {0.7, 0.8, 0.7}.
Input samples are: { ..., 0.9, 0.7, 0.1, ...}
By analyzing the steps of the input sample filtering process, shown in the table 3 below,
it is easy to understand how an overflow occurs in the second step. The final sum is
greater than 1.
Table 3 Overflow
Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 = 1.19

0.7

0.1

1.19 + 0.07 = 1.26

As the range of values, defined by the fixed-point presentation, is between -1 and +1,
the results of the filtering process will be as shown in the table 4.
Table 4 Overflow effects
Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 - 2 = -0.81

0.7

0.1

-0.81 - 0.07 = -0.88

As mentioned, an overflow occurs in the second step. Instead of desired value +1.19,
the result is an undesirable negative value -0.81. This difference of -2 between these
two values is explained in Figure 10.

Figure 10 Signal samples filtering


However, if some intermediate result exceeds the range of presentation, it does not
necessarily cause an overflow in the final result. The apsolute value of the result is less
than 1 in this case. In other words, as long as the final result is within the word-length,
overflow of partial results is not of the essence. This situation is illustrated in the
following example.
Example:
The second-order filter has three coefficients. These are: {0.7, 0.8, 0.7}
Input samples are: { ..., 0.9, 0.7, -0.5, ...}
The desired intermediate results are given in the table 5.

Table 5 Desired intermediate results


Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 = 1.19

0.7

-0.5

1.19 - 0.35 = 0.84

As seen, some intermediate results exceed the given range and two overflows occur.
Refer to the table 6 .
Table 6 Obtained intermediate results
Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 - 2 = -0.81

0.7

-0.5

-0.81 - 0.35 + 2 = 0.84

So, in spite of the fact that two overflows have occured, the final result remained
unchanged. The reason for this is the nature of these two overflows. The first one has
decremented the final result by 2, whereas the second one has incremented the final
result by 2. This way, the overflow effect is annuled. The first overflow is called a
positive overflow, whereas the later is called a negative overflow.
Note:
If the number of positive overflows is equal to the number of negative overflows, the
final result will not be changed, i.e. the overflow effect is canceled.

Overflow causes rapid oscillations in the input sample, which further causes high
frequency components to appear in the output spectrum. There are several ways to
lessen the overflow effects. Two most commonly used are scaling and saturation.
It is possible to scale FIR filter coefficients to avoid overflow. A necessary and sufficient
condition required for FIR filter coefficients in this case is given in the following
expression:

where:
bk

are

the

FIR

filter

coefficients;

and

N is the number of filter coefficients.


If, for any reason, it is not possible to apply scaling then the overflow effects can be
lessened to some extend via saturation. Figure 11 illustrates the saturation
characteristic.

Figure 11 Saturation characteristic

When the saturation characteristic is used to prevent an overflow, the intermediate


result doesnt change its sign. For this reason, the oscillations in the output signal are
not so rapid and undesirable high-frequency components are attenuated.
Lets see what happens if we apply the saturation characteristic to the previous
example:
Example
Again, it is needed to filtrate input samples using a second-order filter.
Such filter has three coefficients. These are: {0.7, 0.8, 0.7}
Input samples are: { ..., 0.9, 0.7, 0.1, ...}
The desirable intermediate results are shown in the table 7.
Table 7 Desirable intermediate results
Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 = 1.19

0.7

0.1

1.19 + 0.07 = 1.26

As the range of values, defined by the fixed-point presentation, is between -1 and +1,
and the saturation characteristic is used as well, the intermediate results are as shown
in the table 8.

Table 8 Intermediate results and saturation characteristic


Filter coefficients

Input sample

Intermediate result

0.7

0.9

0.63

0.8

0.7

0.63 + 0.56 = 1

0.7

0.1

1 + 0.07 = 1

The resulting sum is not correct, but the difference is far smaller than when there is no
saturation:
Without saturation: = 1.26 - (-0.88) = 2.14
With saturation: = 1.26 - 1 = 0.26
As seen from the example above, the saturation characteristic lessens an overflow
effect and attenuates undesirable components in the output spectrum.

A round-off error, also called rounding error, is the difference between the calculated
approximation of a number and its exact mathematical value due to rounding. This is a
form of quantization error. One of the goals of numerical analysis is to estimate errors in
calculations, including round-off error, when using approximation equations and/or
algorithms, especially when using finitely many digits to represent real numbers (which
in theory have infinitely many digits).
When a sequence of calculations subject to rounding error is made, errors may
accumulate, sometimes dominating the calculation. In ill-conditioned problems,
significant error may accumulate.
Representation error
The error introduced by attempting to represent a number using a finite string of digits is
a form of round-off error called representation error. Here are some examples of
representation error in decimal representations:
Notation Representation

Approximation

Error

1/7

0.142 857

0.142 857

0.000 000 142 857

ln 2

0.693 147 180 559 945 309 41... 0.693 147

0.000 000 180 559 945 309 41...

log10 2

0.301 029 995 663 981 195 21... 0.3010

0.000 029 995 663 981 195 21...

1.414 213 562 373 095 048 80... 1.41421

0.000 003 562 373 095 048 80...

2.718 281 828 459 045 235 36... 2.718 281 828 459 045 0.000 000 000 000 000 235 36...

3.141 592 653 589 793 238 46... 3.141 592 653 589 793 0.000 000 000 000 000 238 46...

Increasing the number of digits allowed in a representation reduces the magnitude of


possible round-off errors, but any representation limited to finitely many digits will still
cause some degree of round-off error for uncountably many real numbers. Additional
digits used for intermediary steps of a calculation are known as guard digits.
Rounding multiple times can cause error to accumulate. For example, if 9.945309 is
rounded to two decimal places (9.95), then rounded again to one decimal place (10.0),
the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single

step introduces less error (0.045309). This commonly occurs when performing
arithmetic operations.

You might also like