Professional Documents
Culture Documents
Where, N is the filter order; bk the coefficient of non-recursive part of IIR filter; and ak the
coefficient of recursive part (feedback) of IIR filter.
The coefficients bk and ak are of interest for IIR filter realization (both hardware and
software). Figure 1 illustrates the block diagram of IIR filter
them are very convenient and most commonly used for both hardware and software IIR
filter realization. Each of them will be described in detail along with their advantages and
disadvantages.
Direct realization
Direct realization of IIR filters starts with this expression:
The first part of the expression refers to non-recursive part and the other refers to
recursive part of IIR filter. In IIR filter direct realization, these two parts are separately
considered and realized.
Direct Form - I
The realization of non-recursive part of IIR filter is identical to the direct realization of
FIR filter. Figure 2 illustrates the block diagram of direct realization of non-recursive part
of IIR filter.
Figure 4a. IIR filter direct realization, non-recursive part is used first
Figure 4b. IIR filter direct realization, recursive part is used first
This structure is also known as a direct form I structure. As seen from Figures 4a and
4b, direct realization requires in total of 2N delay lines, (2N+1) multiplications and 2N
additions.
Direct realization is very convenient for software implementation and this is where it is
most commonly used. Some of disadvantages of this realization are the greatest
sensitivity to accuracy of realized coefficients (i.e. the largest finite word-length effect),
and the greatest complexity due to implementation (i.e. needs most resources).
Direct Canonical Realization or Direct Form - II
Direct canonical realization structure has reduced number of delay lines to the
minimum, that is, N delay lines. This way, one of the main disadvantages of direct and
Similarities between direct canonic structure block diagram and direct realization
structure shown in Figure 4b are obvious. The difference between realization structures
shown in Figures 4b and 5 is that non-recursive and recursive part for direct canonic
realization structure cannot be treated separately, although it is easy to differentiate
between them.
Direct canonic structure uses N delay elements, (2N+1) multipilications and 2N
additions. Sensitivity to the accuracy of coefficients is the same as for all previously
described structures, which is the main disadvantage of this realization structure.
Cascade Realization
Cascade realization structure is the most difficult to obtain from the transfer function. It
is very convenient for its modular structure and less sensitivity to the accuracy of nonrecursive and recursive coefficients realization. On cascade IIR filter realization, a filter
is divided into several, mutually independent sections of the first or second order.
Individual sections are mostly realized in direct canonical or direct transpose canonical
structure.
Since the sections are mutually independent after design process, the finite word-length
effect on the accuracy of coefficients, modulation of frequency response and IIR filter
stability are separately examined for each section. The analyse is simplified this way.
The IIR filter transfer function is expressed as:
where:
H0 is a constant;
Cascade realization requires the given expression to be factorized so that the transfer
function is expressed as follows:
where:
a[i, k] are the coefficients of recursive part of the ith IIR filter section;
b[i, k] are the coefficients of non-recursive part of the ith IIR filter section.
Individual sections are of the first or second order. Direct transpose canonical structure
is most frequently used in realization. Figure 6 illustrates a first-order section. Figure 7
illustrates a second-order section.
Parallel Form
A direct partial-fraction expansion of the transfer function in z leads to the parallel form
structure. Assuming simple poles, the transfer function, H(z) can be expressed as,
The two basic parallel realizations of a 3rd-order IIR transfer function are shown in
figure 9
Fixed-point Representation
One possibility for handling numbers with fractional parts is to add bits after the decimal
point: The first bit after the decimal point is the halves place, the next bit the quarters
place, the next bit the eighths place, and so on.
Suppose that we want to represent 1.625(10). We would want 1 in the ones place,
leaving us with 0.625. Then we want 1 in the halves place, leaving us with
0.625 0.5 = 0.125. No quarters will fit, so put a 0 there. We want a 1 in the eighths
place, and we subtract 0.125 from 0.125 to get 0.
To represent 1.625, we would use the first 24 bits to indicate 1, and we'd use the
remaining 8 bits to represent 0.625. Thus, our 32-bit representation would be:
00000000 00000000 00000001 10100000.
Fixed-point representation works reasonably well as long as you work with numbers
within the supported range. The 32-bit fixed-point representation described above can
represent any multiple of 1/256 from 0 up to 224 16.7 million. But programs frequently
need to work with numbers from a much broader range. For this reason, fixed-point
representation isn't used very often in today's computing world.
A notable exception is financial software. Here, all computations must be represented
exactly to the penny, and in fact further precision is rarely desired, since results are
always rounded to the penny. Moreover, most applications have no requirement for
large amounts (like trillions of dollars), so the limited range of fixed-point numbers isn't
an issue. Thus, programs typically use a variant of fixed-point representation that
represents each amount as an integer multiple of 1/100, just as the fixed-point
representation described above represents each number as a multiple of 1/256.
Normalized floating-point
Floating-point representation is an alternative technique based on scientific notation.
Floating-point basics
Though we'd like to use scientific notation, we'll base our scientific notation on powers
of 2, not powers of 10, because we're working with computers that prefer binary. For
example, 5.5(10) is 101.1(2) in binary, and it converts a to binary scientific notation of
1.011(2) 22. In converting to binary scientific notation here, we moved the decimal point
to the left two places, just as we would in converting 101.1 (10) to scientific notation: It
would be 1.011(10) 102.)
Once we have a number in binary scientific notation, we still must have a technique for
mapping that into a set of bits. First, let us define the two parts of scientific
representation: In 1.011(2) 102, we call 1.011(2) the mantissa (or the significand), and
we call 2 the exponent. In this section we'll use 8 bits to store such a number.
We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits
for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents),
and the last three bits for the mantissa's fractional part. Note that we omit the integer
part of the mantissa: Since the mantissa must have exactly one nonzero bit to the left of
its decimal point, and the only nonzero bit is 1, we know that the bit to the left of the
decimal point must be a 1. There's no point in wasting space in inserting this 1 into our
bit pattern, so we include only the bits of the mantissa to the right of the decimal point.
For our example of 5.5(10) = 1.011(2) 22, we add 7 to 2 to arrive at 9(10) = 1001(2) for the
exponent bits. Into the mantissa bits we place the bits following the decimal point of the
scientific notation, 011. This gives us 0 1001 011 as the 8-bit representation of 5.5(10).
We call this floating-point representation because the values of the mantissa bits float
along with the decimal point, based on the exponent's given value. This is in contrast to
fixed-point representation, where the decimal point is always in the same place among
the bits given.
Suppose we want to represent 96(10).
a. First we convert our desired number to binary: 1100000(2).
b. Then we convert this to binary scientific notation: 1.100000 (2) 26.
c. Then we fit this into the bits.
1. We choose 1 for the sign bit since the number is negative.
2. We add 7 to the exponent and place the result into the four exponent bits.
For this example, we arrive at 6 + 7 = 13(10) = 1101(2).
3. The three mantissa bits are the first three bits following the leading 1: 100.
If it happened that there were 1 bits beyond the 1/8's place, we would
need to round the mantissa to the nearest eighth.)
Thus we end up with 1 1101 100.
Conversely, suppose we want to decode the number 0 0101 100.
1. We observe that the number is positive, and the exponent bits represent
0101(2) = 5(10). This is 7 more than the actual exponent, and so the actual
exponent must be 2. Thus, in binary scientific notation, we have 1.100 (2) 22.
2. We convert this to binary: 1.100(2) 22 = 0.011(2).
3. We convert the binary into decimal: 0.011(2) = 1/4 + 1/8 = 3/8 = 0.375(10).
Representable numbers
This 8-bit floating-point format can represent a wide range of both small numbers and
large numbers. To find the smallest possible positive number we can represent, we
would want the sign bit to be 0, we would place 0 in all the exponent bits to get the
smallest exponent possible, and we would put 0 in all the mantissa bits. This gives us
0 0000 000, which represents
1.000(2) 20 7 = 27 0.0078(10).
To determine the largest positive number, we would want the sign bit still to be 0, we
would place 1 in all the exponent bits to get the largest exponent possible, and we
would put 1 in all the mantissa bits. This gives us 0 1111 111, which represents
1.111(2) 215 7 = 1.111(2) 28 = 111100000(2)=480(10).
Thus, our 8-bit floating-point format can represent positive numbers from about
0.0078(10) to 480(10). In contrast, the 8-bit two's-complement representation can only
represent positive numbers between 1 and 127.
But notice that the floating-point representation can't represent all of the numbers in its
range this would be impossible, since eight bits can represent only 2 8 = 256 distinct
values, and there are infinitely many real numbers in the range to represent. What's
going on? Let's consider how to represent 51 (10) in this scheme. In binary, this is
110011(2) = 1.10011(2) 25. When we try to fit the mantissa into the 3-bit portion of our
scheme, we find that the last two bits won't fit: We would be forced to round to
1.101(2) 25, and the resulting bit pattern would be 0 1100 101. That rounding means
that we're not representing the number precisely. In fact, 1 1100 101 translates to
1.101(2) 212 7 = 1.101(2) 25 = 110100(2) = 52(10).
Thus, in our 8-bit floating-point representation, 51 equals 52! That's pretty irritating, but
it's a price we have to pay if we want to be able to handle a large range of numbers with
such a small number of bits.
(By the way, in rounding numbers that are exactly between two possibilities, the typical
policy is to round so that the final mantissa bit is 0. For example, taking the number 19,
we end up with 1.0011(2) 24, and we would round this up to 1.010(2) 24 = 20(10). On
the other hand, rounding up the number 21 = 1.0101(2) 24 would lead to 1.011(2) 24,
leaving a 1 in the final bit of the mantissa, which we want to avoid; so instead we round
down to 1.010(2) 24 = 20(10). Doubtless you were taught in grade school to round up
all the time; computers don't do this because if we consistently round up, all those
roundings will bias the total of the numbers upward. Rounding so the final bit is 0 ensure
that exactly half of the possible numbers round up and exactly half round down.)
While a floating-point representation can't represent all numbers precisely, it does give
us a guaranteed number of significant digits. For this 8-bit representation, we get a
single digit of precision, which is pretty limited. To get more precision, we need more
mantissa bits. Suppose we defined a similar 16-bit representation with 1 bit for the sign
bit, 6 bits for the exponent plus 31, and 9 bits for the mantissa.
This representation, with its 9 mantissa bits, happens to provide three significant digits.
Given a limited length for a floating-point representation, we have to compromise
between more mantissa bits (to get more precision) and more exponent bits (to get a
wider range of numbers to represent). For 16-bit floating-point numbers, the 6-and-9
split is a reasonable tradeoff of range versus precision.
Bit
Range
number
numbers
of
Quantization step
Number of exact
decimal points
(-1, +1)
0.125
0.0625
(-1, +1)
0.0078125
0.00390625
16
(-1, +1)
32
(-1, +1)
64
(-1, +1)
3.0517578125*10-5
4.6566128730774*1010
1.0842021724855*1019
1.52587890625*10-5
2.3283064365387*10-10
5.4210108624275*10-20
19
Table 2 below provides the basic information on floating-point representation for several
different lengths.
Table 2 Quantization of numbers represented in the floating-point format
Bit
Mantissa
Exponent
number
size
size
16
32
23
Number
Band
2.3x10-38
3.4x1038
exact
decimal points
..
3.4x1038
1.4x10-45
of
..
6-7
which means that the corresponding pairs of coefficients will be quantized to the same
value. It results in the impulse response symmetry remaining unchanged.
After all mentioned, it is easy to notice that finite word length, used for representing
coefficients and samples being processed, causes some problems such as:
Coefficient quantization errors;
Sample quantization errors (quantization noise); and
Overflow errors.
Coefficient Quantization
The coefficient quantization results in FIR filter changing its transform function. The
position of FIR filter zeros is also changed, whereas the position of its poles remains
unchanged as they are located in z=0. Quantization has no effect on them. The
conclusion is that quantization of FIR filter coefficients cannot cause a filter to become
instable as is the case with IIR filters.
Even though there is no danger of FIR filter destabilization, it may happen that transfer
function is deviated to such an extent that it no longer meets the specifications, which
further means that the resulting filter is not suitable for intended implementation.
The FIR filter quantization errors cause the stop band attenuation to become lower. If it
drops below the limit defined by the specifications, the resulting filter is useless.
Transfer function changes occurring due to FIR filter coefficient quantization are more
effective for high-order filters. The reason for this is the fact that spacing between zeros
of the transfer function get smaller as the filter order increases and such slight changes
of zero positions affect the FIR filter frequency response.
Samples Quantization
Another problem caused by the finite word length is sample quantization performed at
multipliers output (after filtering). The process of filtering can be represented as a sum
of multiplications performed upon filter coefficients and signal samples appearing at
filter input. Figure 2-5-1 illustrates block diagram of input signal filtering and quantization
of result as well.
Example:
Assume that it is needed to filtrate input samples using a second-order filter.
Such filter has three coefficients. These are: {0.7, 0.8, 0.7}.
Input samples are: { ..., 0.9, 0.7, 0.1, ...}
By analyzing the steps of the input sample filtering process, shown in the table 3 below,
it is easy to understand how an overflow occurs in the second step. The final sum is
greater than 1.
Table 3 Overflow
Filter coefficients
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.7
0.1
As the range of values, defined by the fixed-point presentation, is between -1 and +1,
the results of the filtering process will be as shown in the table 4.
Table 4 Overflow effects
Filter coefficients
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.7
0.1
As mentioned, an overflow occurs in the second step. Instead of desired value +1.19,
the result is an undesirable negative value -0.81. This difference of -2 between these
two values is explained in Figure 10.
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.7
-0.5
As seen, some intermediate results exceed the given range and two overflows occur.
Refer to the table 6 .
Table 6 Obtained intermediate results
Filter coefficients
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.7
-0.5
So, in spite of the fact that two overflows have occured, the final result remained
unchanged. The reason for this is the nature of these two overflows. The first one has
decremented the final result by 2, whereas the second one has incremented the final
result by 2. This way, the overflow effect is annuled. The first overflow is called a
positive overflow, whereas the later is called a negative overflow.
Note:
If the number of positive overflows is equal to the number of negative overflows, the
final result will not be changed, i.e. the overflow effect is canceled.
Overflow causes rapid oscillations in the input sample, which further causes high
frequency components to appear in the output spectrum. There are several ways to
lessen the overflow effects. Two most commonly used are scaling and saturation.
It is possible to scale FIR filter coefficients to avoid overflow. A necessary and sufficient
condition required for FIR filter coefficients in this case is given in the following
expression:
where:
bk
are
the
FIR
filter
coefficients;
and
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.7
0.1
As the range of values, defined by the fixed-point presentation, is between -1 and +1,
and the saturation characteristic is used as well, the intermediate results are as shown
in the table 8.
Input sample
Intermediate result
0.7
0.9
0.63
0.8
0.7
0.63 + 0.56 = 1
0.7
0.1
1 + 0.07 = 1
The resulting sum is not correct, but the difference is far smaller than when there is no
saturation:
Without saturation: = 1.26 - (-0.88) = 2.14
With saturation: = 1.26 - 1 = 0.26
As seen from the example above, the saturation characteristic lessens an overflow
effect and attenuates undesirable components in the output spectrum.
A round-off error, also called rounding error, is the difference between the calculated
approximation of a number and its exact mathematical value due to rounding. This is a
form of quantization error. One of the goals of numerical analysis is to estimate errors in
calculations, including round-off error, when using approximation equations and/or
algorithms, especially when using finitely many digits to represent real numbers (which
in theory have infinitely many digits).
When a sequence of calculations subject to rounding error is made, errors may
accumulate, sometimes dominating the calculation. In ill-conditioned problems,
significant error may accumulate.
Representation error
The error introduced by attempting to represent a number using a finite string of digits is
a form of round-off error called representation error. Here are some examples of
representation error in decimal representations:
Notation Representation
Approximation
Error
1/7
0.142 857
0.142 857
ln 2
log10 2
2.718 281 828 459 045 235 36... 2.718 281 828 459 045 0.000 000 000 000 000 235 36...
3.141 592 653 589 793 238 46... 3.141 592 653 589 793 0.000 000 000 000 000 238 46...
step introduces less error (0.045309). This commonly occurs when performing
arithmetic operations.