Professional Documents
Culture Documents
by
Mohammad Ariful Haque
DOCTOR OF PHILOSOPHY
2009
CANDIDATES DECLARATION
It is hereby declared that this dissertation or any part of it has not been submitted
elsewhere for the award of any degree or diploma.
Board of Examiners
1.
Prof. Md. Kamrul Hasan
Department of Electrical & Electronic Engineering
BUET, Dhaka 1000
Chairman
(Supervisor)
2.
Prof. M. Rezwan Khan
The Vice Chancellor
United International University, Dhaka 1209
Member
Member
Member
Member
3.
4.
5.
6.
Prof. Satya Prasad Majumder
Head of the Department
Department of Electrical & Electronic Engineering
BUET, Dhaka 1000
Member
(Ex-officio)
7.
Prof. Keikichi Hirose
Department of Information and Communication Engineering
The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
ii
Member
(External)
Dedication
To the people who are working for real change of the humanity.
iii
Contents
Acknowledgements
xviii
Abstract
xix
1 Introduction
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
1.2.1
Background noise . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
. . . . . . . . . . . . . .
1.4
Dereverberation problem . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2 Literature Review
2.1
13
Non-channel-information-based Techniques . . . . . . . . . . . . . . . .
14
2.1.1
Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1.2
LP residual processing . . . . . . . . . . . . . . . . . . . . . . .
15
iv
CONTENTS
2.2
2.3
2.1.3
Spectral enhancement . . . . . . . . . . . . . . . . . . . . . . .
17
2.1.4
LIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.1.5
HERB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Channel-information-based Techniques . . . . . . . . . . . . . . . . . .
20
2.2.1
Direct equalization . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2.2
22
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
26
3.1
26
3.1.1
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . .
26
3.1.2
Identifiability condition . . . . . . . . . . . . . . . . . . . . . . .
27
28
3.2.1
29
3.2.2
30
3.2.3
32
3.3
33
3.4
36
3.5
Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.6
42
3.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2
CONTENTS
vi
47
4.1
48
4.2
49
4.3
55
4.3.1
55
4.3.2
58
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4.1
60
4.4.2
62
4.4.3
64
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.4
4.5
68
5.1
68
5.1.1
. . . . . . . .
69
5.1.2
72
5.1.3
75
5.1.4
. . . .
77
5.1.5
Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.1.6
82
82
5.2.1
86
5.2
Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
5.3
vii
5.2.2
89
5.2.3
91
5.2.4
92
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.2
6.3
95
95
6.1.1
Delay-and-sum beamforming . . . . . . . . . . . . . . . . . . . .
96
6.1.2
Channel shortening . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.1.3
6.1.4
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
130
7.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2
CONTENTS
viii
134
136
List of Tables
5.1
5.2
5.3
72
75
87
6.1
6.2
Results of SNR, DRR and PESQ improvement with and without delayand-sum beamformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3
6.4
6.5
6.6
6.7
6.8
ix
LIST OF TABLES
6.9
6.10 Quality of the dereverberated speech in terms of WSS for the proposed
and other state-of-the-art techniques . . . . . . . . . . . . . . . . . . . 122
6.11 Quality of the dereverberated speech in terms of PESQ for the proposed
and other state-of-the-art techniques . . . . . . . . . . . . . . . . . . . 123
6.12 Quality of the dereverberated speech for the real acoustic channels in
terms of LLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.13 Quality of the dereverberated speech for the real acoustic channels in
terms of segSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.14 Quality of the dereverberated speech for the real acoustic channels in
terms of WSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.15 Quality of the dereverberated speech for the real acoustic channels in
terms of PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
10
2.1
14
2.2
17
2.3
LIME structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4
23
3.1
27
3.2
xi
41
LIST OF FIGURES
3.3
xii
3.4
3.5
43
44
45
4.1
54
4.2
58
4.3
Comparison of the computational complexities of the proposed VSSMCFLMS and NMCFLMS algorithms . . . . . . . . . . . . . . . . . .
60
4.4
61
4.5
4.6
62
62
4.7
. . . . . . . . . . . . . . . .
63
4.8
64
4.9
65
4.10 NPM of the VSS-MCFLMS and NMCFLMS algorithms for SIMO FIR
system at SNR = 25 dB . . . . . . . . . . . . . . . . . . . . . . . . . .
65
66
LIST OF FIGURES
xiii
5.2
88
5.5
81
5.4
80
5.3
67
88
89
5.6
90
5.7
5.8
5.9
91
92
(a) True acoustic channel obtained from the MARDY. (b) Estimated
channel using the Spectrally Constrained algorithm. . . . . . . . . . . .
92
93
6.1
6.2
96
LIST OF FIGURES
6.3
xiv
6.4
6.5
Comparison of mean time per iteration for the proposed and infinitynorm algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6
6.7
Block diagram of the signal path and noise path for the kth channel. . 108
6.8
6.10 Spectrogram of the (a) clean speech (b) noisy reverberated speech at
30 dB SNR (c) denoised speech (d) dereverberated using the proposed
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.11 Quality of the dereverberated speech at different block-lengths of the
proposed zero forcing equalization. . . . . . . . . . . . . . . . . . . . . 120
6.12 Impulse responses of real reverberant acoustic channels. The length of
each impulse response is L = 4400. . . . . . . . . . . . . . . . . . . . . 124
6.13 Convergence profile of the robust NMCFLMS algorithm for time-varying
channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Glossary
AIR
Avg-Seg-SNR
AR
Auto Regressive
BCI
CR
Cross-Relation
DRR
FIR
GSC
HERB
ISS
LIME
LLR
Log-Likelihood Ratio
LMS
Least-Mean-Square
LP
Linear Prediction
LPC
xv
glossary
xvi
LS
Least-Squares
MARDY
MCLMS
MCFLMS
MINT
MIMO
Multiple-Input Multiple-Output
MLP
MMSE
MRE
NMCFLMS
NPM
PESQ
RMCFLMS
RNMCFLMS
RT
Reverberation Time
SIMO
Single-Input Multiple-Output
SNR
Signal-to-Noise Ratio
SOS
Second-Order Statistics
VSS
glossary
xvii
ZFE
Zero-Forcing Equalizer
Acknowledgments
All praises and thanks belong to Allah, the creator and manager of this world, for
His innumerable favors and bounties on me. It is He who has granted me physical and
mental capabilities to complete the dissertation, without anything or anyone compelling
Him to do so.
I cannot find appropriate words to convey my appreciation and gratitude to Prof.
Md. Kamrul Hasan for allowing me to work with him. It was his idea to work with
dereverberation, an interesting and challenging problem. His guidance and amazing
insight helped me to overcome many obstacles throughout the period of my research.
I am grateful for his patience, support and encouragement. His care was extended not
only to academic affairs, but also during some tumultuous period of my life. For this,
I am ever grateful to him.
I would like to thank the esteemed committee members for willing to take the extra
burden of reading and discussing the dissertation with me. This gave me a deeper
insight on the research topic. I am personally grateful to Prof. Aminul Hoque, Head
of the department of EEE, for his encouragement to complete the work. I also express
my gratitude for the members of committee for advanced studies and research (CASR)
for approving the financial grant requested for the research work.
I also feel lucky for the excellent research environment we have in the DSP lab.
I got acquainted with many signal processing topics ranging from speech to image
and biomedical signal processing from the discussions and presentations of the fellow
students. I am particularly thankful to, Toufiqul Islam, for the interaction I had with
him on channel shortening technique.
Finally, this dissertation is as much of my effort as it is to my family. In fact it is a
result of the motivation and support provided by my parents, brother, sister, and my
wife.
xviii
Abstract
Reverberation is one of the primary factors that degrade the quality of speech
signals when recorded by a distant microphone in order to facilitate hands-free
communication. Undoing the effect of reverberation is still a challenging problem
especially when additive noise and time-varying acoustic channels are considered. In
this dissertation, several multimicrophone dereverberation techniques are developed
that can dereverberate the recorded speech as well as improve the signal-to-noise
ratio (SNR) considering a practical acoustic environment. The methods are based
on the adaptive estimation of the long acoustic impulse responses (AIRs) using the
multichannel LMS (MCLMS) algorithm. Although the MCLMS algorithm is attractive
for its simplicity and computational efficiency, it suffers from slow convergence rate,
step-size ambiguity, and last but not the least, lack of robustness in the presence
of noise. A variable-step-size frequency-domain MCLMS algorithm is proposed that
can ensure stability and optimal convergence speed both in the noise-free and noisy
conditions. To improve the noise-robustness of the class of MCLMS algorithms, two
novel solutions, namely, excitation-driven MCLMS and spectrally constraint MCLMS
algorithms are proposed that can successfully estimate the long AIRs with reasonable
accuracy.
Based on adaptive estimation of the AIRs, two different dereverberation techniques
are proposed. In the first approach, dereverberation is achieved by suppressing the
late reverberation using channel shortening technique and the SNR is improved by
delay-and-sum beamforming. The proposed shortening algorithm is also optimized
so that it makes a trade-off between shortening performance and spectral distortion
in the dereverberated speech. In the second approach, the power of the speech
components in the received microphone signals are first enhanced by an eigenfilter
and then a block-adaptive zero-forcing equalizer is employed to eliminate the channel
distortion introduced by the AIRs and eigenfilter. The eigenfilter is efficiently estimated
avoiding the tedious Cholesky factorization and it also resists spectral nulls so that
noise amplification is mitigated at the output of the zero-forcing equalizer. Extensive
experiments are conducted, using both simulated and real reverberant acoustic
channels, which demonstrate that the proposed methods can offer better speech quality
and SNR improvement as compared to the state-of-the-art dereverberation techniques.
xix
Chapter 1
Introduction
1.1
Motivation
Hands-free system is one of the key aspects of next generation speech communication.
The main user benefit in hands-free operation is the ability to walk freely without
wearing a headset or holding a microphone.
Chapter 1. Introduction
fan
speaker
microphone
1.2
Chapter 1. Introduction
ceiling
microphone
speaker
Figure 1.2: Acoustic wave propagation from the speaker to the microphone.
1.2.1
Background noise
Background noise is the most common unwanted signal in the recorded sound generally
arising from fans, traffic, audio equipment, or other speakers present in the room. One
of the widely used properties of a noise signal is the assumption that it is a stationary
signal. This mean that its statistical properties, such as, mean and variance do not
change with time. A good statistical model of the stationary noise in the time-domain
is a zero mean Gaussian process. This statistical model is known as Gaussian noise
which mean that every sample of the noise signal is a random value with a Gaussian
probability density function given by
p(x) =
x2
1
e 22
(2)
(1.1)
where 2 is the variance of the noise signal. The engine noise, air conditioner or fan
noise are good examples of stationary noise. There are non-stationary noises too, such
as sound from door opening/closing, passing cars etc. In this work, the stationary
Gaussian noise is considered only.
Chapter 1. Introduction
1.2.2
Reverberation
Chapter 1. Introduction
0.1
early reflections
late reflections
0.08
amplitude
0.06
direct path
0.04
0.02
0
0.02
50
100
150
time (ms)
200
250
300
Chapter 1. Introduction
1.3
In fact, speech and audio appears more pleasant to the listener when some reverberation
is present [1]. However, in highly reverberant environments the intelligibility of speech
signal drops considerably [2]. The effects of reverberation on speech are clearly visible
in the spectrogram as well as from the time waveform.
Fig.
spectrogram and waveform of a clean speech signal. It is observed from the spectrogram
that the clean speech has vivid harmonic structure whereas the time-waveform shows
that phonemes are well separated in time. The spectrogram and waveform of the
reverberated signal are shown in Fig. 1.5. The distortion of the speech signal that is
caused by reverberation is clearly visible. The frequency spectrum has been smeared
which is observed from the spectrogram and the phonemes are overlapped which
is visible from both the spectrogram and time-waveform. Due to this overlapping,
the empty spaces between words and syllabi are now filled by reverberation. These
distortions cause an audible difference between the clean and the reverberant speech,
and degrade perceptual quality and intelligibility of the speech.
The distortions resulting from the reverberation seem to go unnoticed by the normal
Chapter 1. Introduction
1.4
Dereverberation problem
Chapter 1. Introduction
1.5
Research Overview
Chapter 1. Introduction
fan
Dereverberation
speaker
microphone
array
Chapter 1. Introduction
Source
signal
Channel
10
Additive
noise
Microphone
signal
Enhanced
signal
Dereverberated
speech
v1(n)
H1(z)
y1(n)
x1(n)
v2(n)
s(n)
H2(z)
y2(n)
x2(n)
z (n)
SIGNAL
DEREVERBERATION
ENHANCING
BLOCK
BLOCK
s(n)
vM(n)
HM(z)
yM(n)
xM(n)
ROBUST
CHANNEL
ESTIMATE
Chapter 1. Introduction
1.6
11
Dissertation Outline
minimizes the misalignment of the estimated channel vector with the true one in each
iteration. It has been demonstrated that the proposed VSS guarantees the stability
of the MCLMS algorithm. Performance comparison of the proposed algorithm with
a state-of-the-art normalized MCFLMS (NMCFLMS) algorithm shows that the VSSMCFLMS algorithm is more noise-robust without sacrificing the speed of convergence
and computational efficiency.
None of the time- and frequency-domain MCLMS algorithms are sufficiently robust
for estimating the AIRs in the noisy condition. Two novel solutions are presented in
Chapter 5 that improve the noise-robustness of the class of MCLMS algorithms. The
first one is termed as excitation-driven MCLMS algorithm and the second one is called
spectrally-constrained MCLMS algorithm. It is demonstrated that the later algorithm
Chapter 1. Introduction
12
can successfully estimate the AIRs in a time-varying acoustic environment with noise.
Chapter 6 addresses the problem of speech dereverberation utilizing the estimates
of AIRs. Two dereverberation techniques are presented. The first approach focus
on the suppression of late reflections based on the fact that dereverberation does not
need complete equalization of the acoustic channel and, therefore, a shortened channel
which requires less computation with acceptable performance can serve the purpose.
The proposed shortening algorithm is optimized so that it makes a trade-off between
shortening performance and spectral distortion in the dereverberated speech. In the
second approach, both early and late reverberations are eliminated using a zero forcing
equalizer and the SNR is improved using an eigenfilter. The eigenfilter is efficiently
computed avoiding the tedious Cholesky decomposition, solely from the estimates of
AIRs. The design of the eigenfilter also resists spectral nulls in equivalent channel
so that noise amplification is significantly mitigated at the output of the equalization
process. Then the block-adaptive zero-forcing equalizer eliminate the channel distortion
introduced by the AIRs and eigenfilter. The proposed technique is found effective for
dereverberation in a situation where the speaker moves from his/her position causing
frequent changes in the AIR. Extensive simulation results are presented using both
simulated and real reverberant channels that demonstrate the superior performance of
the proposed methods as compared to the state-of-the-art dereverberation techniques.
The conclusion, Chapter 7, summarizes the main contributions of this work and
gives directions for further research.
Chapter 2
Literature Review
The effect of reverberation on speech perception was first reported in a Patent by Ryall
in 1938 [7]. Since then so many researchers have contributed to the dereverberation
problem that it became a complicated task, if not impossible, to categorize all these
methods. The classification may be based on different criteria such as the number of
required microphones (single-channel or multi-channel), the need for channel estimation
(channel-estimation based or non-channel-estimation based), or the way the technique
affects the reverberation (those affecting entire reverberation or those affecting late
reverberation). An extensive survey of the dereverberation techniques is presented
in [8] by grouping them in two major classes based on whether the AIRs need to be
estimated or not. The first category that does not require channel estimation is termed
as reverberation suppression technique. The second category that requires channel
knowledge is termed as reverberation cancelation technique. Although the survey seems
to be comprehensive, the classification does not fit with the nomenclature. Not all the
techniques that are based on channel estimation are capable of canceling reverberation
rather they can only reduce the reverberation effect. For this reason, we categorize the
dereverberation techniques as channel-information-based and non-channel-informationbased techniques, whether they cancel or suppress the reverberation.
13
14
e(n)
s(n)
w(n)
2.1
Non-channel-information-based Techniques
2.1.1
Beamforming
15
adaptive filter w(n) will try to match the interference in the adaptive branch as close
as possible to the interference in the nonadaptive branch. The optimal filter w(n) is
found by minimizing the energy of the error signal
(2.1)
Thus, with the GSC-based
2.1.2
LP residual processing
The speech signal is often modeled as the output of a time-varying all-pole filter excited
by a random noise for unvoiced speech and quasi-periodic pulses for voiced speech. The
all-pole filter coefficients can be estimated through Linear Prediction (LP) analysis of
16
the recorded speech and are commonly called Linear Prediction Coefficients (LPC).
The excitation sequence, or LP residual, can be obtained by inverse filtering of the
speech signal. The motivation for the LP residual based dereverberation techniques
is the observation that in reverberant environments, the LP residual of voiced speech
segments contains the original impulses followed by several other peaks due to multipath reflections. Furthermore, an important assumption is made that the LPCs are
unaffected by reverberation. Consequently, dereverberation is achieved by attenuating
the peaks in the excitation sequence due to multi-path reflections, and synthesizing
the enhanced speech waveform using the modified LP residual and the time-varying
all-pole filter with coefficients calculated from the reverberant speech.
Yegnanarayana and Murthy proposed a single microphone technique to
dereverberate speech [11] [12], and provided a comprehensive study on the effects of
reverberation on the LP residual. The technique is based on analysis of short (2 ms)
segments of data to enhance the regions in the speech signal having low Signal to
Reverberation Ratio (SRR) components. The short segment analysis shows that SRR is
different in different segments of speech. The processing technique involves identifying
and manipulating the LP residual in three different regions of the speech signal,
namely, high SRR region, low SRR region and only reverberation component region. A
weighting function is derived to modify the LP residual. The weighted residual samples
are used to excite the time-varying LP all-pole filter to obtain perceptually enhanced
speech.
Experiments performed by Gillespie [13] showed that the kurtosis of the LP residual
is a reasonable measure of reverberation.
17
P(w,m)
Computation of
suppression gain
G(w,m)
Reverberated
signal , y(k)
Y(w,m)
STFT
Dereverberated
signal, z(k)
Z(w,m)
Inv-STFT
2.1.3
Spectral enhancement
The spectral enhancement techniques achieve dereverberation by modifying the shorttime spectrum of the received microphone signal. The block diagram of the spectral
subtraction based dereverberation technique is illustrated in Fig. 2.2. In this technique,
an estimate of the late reverberant energy, P (w, m), is obtained directly from the
18
(2.2)
If G(w, m) 0, then it is replaced by a small positive number. Now the STFT of the
dereverberated speech is obtained as
Z(w, m) = G(w, m)Y (w, m).
(2.3)
The dereverberated signal, z(k), is reconstructed from the estimated STFT, Z(w, m),
through the inverse-STFT.
Lebart et al. proposed a single-microphone spectral enhancement technique for
speech dereverberation [17] which only requires an estimate of the reverberation time
to calculate the reverberation energy. Lebart et al. assumed that the reverberation time
was frequency independent, and implicitly assumed that the energy related to the direct
sound could be ignored. Wu et al. proposed a two-stage approach for multi-microphone
dereverberation [18]. In the first stage the LP residual enhancement technique proposed
by Gillespie [13] was used to enhance the Direct to Reverberation Ratio (DRR). In a
second stage spectral subtraction was used to reduce late reverberation. They used
a heuristic function to estimate the late reverberant energy, thereby assuming that
the first stage was able to reduce a significant amount of reverberation. In [19] a
single-microphone solution was proposed by the same authors using a similar two-stage
approach.
2.1.4
LIME
+
1. The matrix Q is calculated as Q = E{u(n 1)uT (n 1)} E{u(n1)uT (n)},
where u(n) is the multimicrophone received signal vector at nth instant, A+ is
the Moore-Penrose generalized inverse of matrix A.
19
Matrix Q
calculation
Prediction filter
calculation
u (n)
1
H1(z)
-1
z
AR polynomial
calculation
w (z)
1
~
e(n)
e(n)
s(n)
1/a(z)
u (n)
HM(z)
z-1
~
1/a(z)
^
s(n)
wM(z)
2.1.5
20
HERB
In the design of a
dereverberation filter, HERB explicitly uses the fact that the source signal has a
harmonic structure. The HERB dereverberation filter, W (k), is calculated as follows:
!
b k)
X(l,
W (k) = A
(2.4)
X(l, k)
b k) are discrete STFTs of an observed reverberant signal and the
where X(l, k) and X(l,
output of an adaptive harmonic filter at time frame l and frequency bin k, respectively.
b k)/X(l, k) for each
Here A() is a function that calculates the weighted average of X(l,
k over different time frames. The adaptive harmonic filter is a time-varying filter
that extracts frequency components whose frequencies correspond to multiples of the
fundamental frequency of a short speech segment. The filter, W (k), has been proven
to approximate the inverse filter of the acoustic transfer function between a speaker
and a microphone. In [26] Kinoshita et al. evaluated the effect on speech intelligibility,
and the potential to use HERB as a preprocessing algorithm for ASR. In both cases
HERB seems to be able to decrease the Word Error Rate (WER) of the ASR system.
The main disadvantage is that they required more than 5000 reverberant words, i.e.,
more than 60 minutes of speech data, to acquire the dereverberation filter under the
assumption that the system is time-invariant.
2.2
Channel-information-based Techniques
The inversion/equalization of the AIRs can fall under one of two broad
21
categories.
2.2.1
Direct equalization
The direct equalization methods bypass the channel identification stage and attempt to
find the inverse of the AIRs directly from the microphone signals. Bakir et al proposed
a multichannel direct equalization technique for speech dereverberation based on the
Mutually Referenced Equalizers (MRE) [27]. In MRE, second order statistics of each
channels output are used to find a set of equalizers, one for each possible delays. Let,
v0 , v1 , v2 , v3 , and v4 represent the multichannel equalizers with 0, 1, 2, 3, 4 sample
delay respectively, we can write the MRE cost function as
(2.5)
where, xn is the multichannel received signal vector at nth sample and xn+1 represents
one sample advanced received signal vector. The MRE equalizers can be obtained by
minimizing this cost function using the LMS or RLS adaptive algorithm. Regarding
the application of MRE equalizers for speech dereverberation, two relevant points arise.
First, the number of equalizers need to be increased to make it robust to noise in real
environments. This increases the computation time enormously. Second, the algorithm
is sensitive to channel order mismatch and thus unpractical for real situations.
A correlation based multichannel direct inverse filtering technique was proposed by
Furuya et al in [28]. Here, the inverse filters were estimated using the correlation matrix
between multichannel received signals assuming that the source signal is stationary and
statistically white. Since the whiteness assumption does not hold for speech input, the
received signal is needed to be pre-whitened before calculating the coefficients of the
inverse filter. The whitening filter is usually estimated by long-term averaging of the
reverberant speech spectrum with a short-time span. The filter thus obtained only
corresponds to the magnitude spectrum of the AR system transfer function and hence
the pre-whitening performance becomes erroneous and causes improper inversion of
the AIRs. Moreover, the inverse filter can only reduce the energy in early reflections
22
and significant energy remains in the late reflection and, therefore, post processing is
necessary to reduce the late reverberation.
2.2.2
The methods in this category are the most common and most studied. BCI has
long been a topic of interest in communication theory. The reason is that in wireless
environments, the cost of channel training and the rapid changes in the channel make
classical channel identification too costly and limited in value. The first generation
of BCI methods relied on higher-order statistics (HOS) to estimate the channel.
However, these methods suffer from one or more problems of slow convergence, high
computational complexity or local minima. The first breakthrough for equalization
methods came when Tong et. al published result on the feasibility of BCI, based solely
on the second-order statistics (SOS) of the output (receiver) signal [29]. The authors
showed that by having a single-input multiple-output FIR channel (SIMO) model, and
assuming the input s(n) was independent and identically distributed (i. i. d.), it is
possible to identify all the FIR channels in the SIMO system. A SIMO system is easily
obtained from SISO system either by having temporal or spatial oversampling. Since
the reporting of Tong et al, several closed form batch solutions for BCI have been
proposed and reviewed in [30],[31],[32].
Gannot and Moonen [33] use subspace methods for dereverberation both in the
fullband and in the subband domains. Huang and Benesty proposed the cross-relation
between the different microphone signals as an error function for adaptive filters and
used it to derive multichannel LMS and Newton adaptive filters both in the time
domain [34] and in the frequency domain [35]. The main short-coming of the MCLMS
algorithm, however, is related to the selection of appropriate step-size which greatly
influences the speed, final misalignment and stability of the algorithm. The stepsize is dependent on the power of the microphone signals and hence the optimum
value varies with acoustic environment. This dependency was relaxed by proposing
the normalized multichannel frequency-domain LMS (NMCFLMS) algorithm in which
23
0
2
NPM (dB)
4
6
8
SNR = 30 dB
SNR = 40 dB
SNR = 25 dB
10
12
14
0
0.5
1.5
2
iterations
2.5
3
4
x 10
24
Several approaches other than direct inversion have been studied. Least squares
(LS) inverse filters can be designed for speech dereverberation by minimizing the error
function |h(n) g(n) (n k)|2 [37] which can also be applied in an adaptive
framework [38]. Here, h(n) and g(n) represent the AIR and equalizer impulse response,
respectively. Although LS technique is more noise robust than direct inversion, the
advantage is obtained at the expense of computational complexity as the minimum of
the error function is to be searched for a wide range of delays. Homomorphic inverse
filtering has been investigated [37],[39], where the impulse response is decomposed into a
minimum phase component and an all-pass component. Consequently, magnitude and
phase are equalized separately, where an exact inverse can be found for the magnitude,
while the phase can be equalized, e.g., using matched filtering [39]. It is important to
note that magnitude compensation alone results in audible distortions in the processed
speech signal [36],[39].
Speech dereverberation does not need complete equalization of the acoustic
channel and, therefore, a shortened channel which requires less computation with
acceptable performance, can serve the purpose. The LS minimization [40] is a very
popular technique for channel shortening, however, it suffers from severe distortion
of the equalized channel showing nonuniform spectral attenuation.
To overcome
(2.6)
Thus, exact inverse filtering can be performed. However, it has been observed that
the MINT method has limited value for practical dereverberation problem. Even if
25
the channel estimate contains moderate estimation errors, equalization using MINT
inversion introduces significant spectral distortion.
2.3
Conclusion
All the existing dereverberation techniques can be divided into two broad classes based
on whether they equalize the AIR or not. The fundamental limitation of the approach
that does not equaluze the AIR is that it cannot eradicate the cause of reverberation
and hence always give a suboptimal performance. Therefore, the better approach from
theoretical point of view would be to equalize the AIRs that caused reverberation using
a proper inverse filtering technique. The AIRs can be equalized using an inverse filter
directly obtained from the received microphone signals, however, such methods are
very sensitive to additive noise.
Speech dereverberation can be perfectly done through blind identification folowed
by equalization of the AIRs using MINT method. But the MINT method requires
that the AIRs are to be exactly known in advance, which is a very difficult task in a
practical acoustic environment. The single-channel inversion of the AIRs are not as
much sensitive as the MINT method, however, narrowband noise amplification occurs
due to the presence of spectral nulls in the AIRs. In the subsequent chapters, we
present robust multichannel blind adaptive algorithms that can estimate the AIRs
with reasonable accuracy in the noisy condition. Then we consider the equalization
of AIRs mitigating the noise amplification problem and preserving the quality of the
speech signal utilizing these adaptive estimates.
Chapter 3
Multichannel LMS Algorithm for
Blind Channel Identification:
Robustness Issue
Generally blind identification technique aims to retrieve the unknown information of a
channel from the received signal only. At first glance, the problem may seem impossible
to solve. How is it possible to distinguish the signal from the channel when neither
is known? The beauty of blind channel identification rests on the exploitation of
structures of the channel and properties of the input to separate the input from the
channel.
3.1
3.1.1
Consider a speech signal recorded inside an echoic room using a linear array of
microphones. The block diagram of the speech acquisition system is shown in Fig. 3.1.
The received signals at the microphones can be modeled as convolutional mixtures of
the speech signal and the impulse responses of the acoustic paths between source and
26
Additive
Noise
Channels
s(n)
H1 (z)
H2 (z)
x1(n)
v 2 (n)
+
.
.
.
HM (z)
Microphone signal
v1(n)
y1(n)
y 2 (n)
yM (n)
27
vM (n)
.
.
.
x 2 (n)
xM (n)
(3.1)
i = 1, 2, , M
(3.2)
where M is the number of microphones, s(n), yi (n), xi (n), vi (n) and hi (n) denote,
respectively, the clean speech, reverberant speech, the reverberant speech corrupted
by background noise, observation noise, and impulse response of the source to ith
microphone. Using vector notation, (3.1) can be written as
yi (n) = hTi (n)s(n)
(3.3)
where, hi = [hi,0 hi,1 hi,L1 ]T denotes the impulse response vector of the ith channel
of length L and s(n) = [s(n) s(n 1) s(n L + 1)]T .
A BCI algorithm estimates hi , i = 1, 2, , M , solely from the observations xi (n),
n = 0, 1, , N 1, where N denotes the data length.
3.1.2
Identifiability condition
28
(3.3) that in the absence of noise, a given output sequence ym (n) can only at best imply
a unique input s(n) and a unique channel impulse response hm (n) up to an unknown
scalar. Given this constraint, the identifiability conditions can be listed as follows [31]:
1. The channel transfer functions dont contain any common zeros.
2. The autocorrelation matrix of the source signal is of full rank.
3. N 3L + 1.
The identifiability conditions shown above essentially ensures the following intuitive
requirements:
1. The channels cannot be identical. They must be sufficiently different enough
from each other.
2. The input must be a complex sequence. It cannot be constant or a single sinusoid.
3. There must be enough number of output samples available. A set of output data
cannot provide sufficient information about the system having a larger set of
unknown parameters.
3.2
The notion of blind channel identification has become known since early 80s. During
the 90s there was an increasing research interest devoted to BCI, when Tong et al. [29]
explored the possibility of BCI using SOS. As the SOS contains sufficient information
for blind identification, many other approaches for BCI have been developed such
as the least-squares approach [30], maximum-likelihood method [31], and subspace
method [33].
channel impulse response, they are generally computationally intensive and difficult
to implement in the adaptive mode. Among the various techniques proposed so far,
the adaptive multichannel least-mean-square (MCLMS) algorithm [34] outperforms
29
the aforementioned ones. The beauty of LMS is its less computational complexity
and efficiency in real-time applications. In the recent era, adaptive filtering in the
frequency domain has attracted a great deal of research interest with a view to reducing
the computational complexities of the convolution and correlation operations needed in
the time-domain algorithm. The frequency-domain implementation of the multichannel
Newton (MCN) algorithm known as the normalized multichannel frequency-domain
LMS (NMCFLMS) has been proposed as an efficient and effective method for BCI
[35]. The main short-coming of the frequency-domain MCLMS algorithm, however,
is related to the selection of appropriate step-size which greatly influences the speed,
final misalignment and stability of the algorithm. Although the step-size ambiguity
is resolved to some extent using the normalizing factor of the NMCFLMS algorithm,
it cannot ensure optimal convergence speed. Moreover, the algorithm diverges from
the desired solution even in a moderate SNR environment [46]. In the subsequent
chapters, a class of step-size optimized robust MCLMS algorithm will be developed for
blind estimation of the AIRs.
3.2.1
Fist, we briefly describe the basic MCLMS algorithm proposed in [34]. The method is
based on the cross-relation (CR) between the received signals and different channels
in the noise-free case. The CR is as follows:
yi (n) hj = yj (n) hi .
(3.4)
(3.5)
30
(3.6)
where, eii = 0. The LMS-type adaptive algorithms estimate h by minimizing the cost
function in (3.6). The update equation of the MCLMS algorithm is given by
b + 1) = h(n)
b
h(n
J(n)
(3.7)
T # T
J(n)
J(n)
J(n)
J(n) =
b1
bk
bM
h
h
h
PM 1 PM
2
[ i=1
J(n)
j=i+1 eij (n)]
=
bk
bk
h
h
=
k1
X
M
X
2eik xi (n) +
i=1
j=k+1
k1
X
M
X
2eik xi (n) +
i=1
= 2
j=k+1
M
X
i=1
where, the last step follows from the fact that ekk = 0. We may express this equation
concisely in matrix form as
J(n)
= 2X(n)ek (n)
bk
h
(3.8)
where, X(n) = [x1 (n) x2 (n) . . . xM (n)] and ek (n) = [e1k (n) e2k (n) . . . eM k (n)]T . It is
to be mentioned here that the channel estimate is always normalized after each update
in order to avoid a trivial estimate with all zero elements [34].
3.2.2
matrix, R(n),
as
b + 1) = h(n)
b
b
h(n
2R(n)
h(n)
(3.9)
31
where
P
21 (n)
R
i6=1 Rii (n)
12 (n)
R
i6=2 Rii (n)
R(n)
=
..
..
.
.
1M (n)
R
R(n)
2M
...
M 1 (n)
R
M 2 (n)
R
..
.
P
...
i6=M Rii (n)
...
..
.
b
b
b
h(n
+ 1) = h(n)
2Rh(n)
(3.10)
b
b
where, h(n)
= E{h(m)}
and R = E{R(n)}
and we assume statistical independence
b
between R(n)
and h(n).
The autocorrelation matrix, R can be diagonalized as
R = UUT
(3.11)
where U is the unitary matrix whose columns are the eigenvectors of R and is a
diagonal matrix with diagonal elements k , 1 k M L, equal to the eigenvalues of
R. Substituting (3.11) into (3.10) and premultiplying by UT , we obtain
o
o
b
b
h
(n + 1) = (I 2)h
(n)
(3.12)
b
b
where, I denotes the identity matrix and h
(n) = UT h(n)
represents an orthogonal
b
mapping of h(n)
in the transformed domain. The set of M L first-order difference
equations are now decoupled. Therefore, the solution of the kth equation can be
obtained as [49],
o
b
hk (n) = Ck (1 2k )n u(n), k = 1, 2, ..., M L
(3.13)
o
o
b
where b
hk (n), k = 1, 2, . . . , M L are the components of h
(n), Ck is an arbitrary constant
b
that depends on the initial value of h(n)
and u(n) is the unit step function. Now the
32
b
channel estimate h(n)
can be expressed as,
o
b
b
h(n)
= Uh
(n)
h
=
u1 . . . uk . . . uM L
o
b
h1 (n)
..
i o
b
hk (n)
..
.
o
b
hM L (n)
(3.14)
(3.15)
where represents a small number and min k . Substituting (3.15) into (3.14), the
final estimate of the channel can be approximated as
b
h(N
)|N very large min umin
where umin is the eigenvector corresponding to the minimum eigenvalue min . Therefore,
we find that the MCLMS algorithm converges to the eigenvector that corresponds to
the minimum eigenvalue of the data correlation matrix.
3.2.3
The time-domain MCLMS algorithm exploits the channel diversity and minimizes
a cross-relation error criterion between the different microphone signals to obtain
the desired channel estimate. The main advantage of the MCLMS algorithm is the
algebraic simplicity which makes it a potential choice in many applications requiring
BCI. However, the time-domain MCLMS algorithm is characterized with the following
limitations which make it unacceptable for practical applications.
33
3.3
(3.16)
where, m is the block time index, Cxi (m) is a circulant matrix and
ij (m) = [
y
yij (mL L) yij (mL L + 1) yij (mL) yij (mL + L 1)]
xi (mL L)
xi (mL + L 1) . . . xi (mL L + 1)
xi (mL L + 1)
xi (mL L)
. . . xi (mL L + 2)
..
..
..
...
.
.
.
Cxi (m) =
xi (mL)
xi (mL 1)
...
xi (mL + 1)
..
..
..
..
.
.
.
.
xi (mL + L 1) xi (mL + L 2) . . .
xi (mL L)
b 10 (m) = [h
b T (m) 0T ]T .
h
j
j
L1
34
(3.17)
where
yij (m) = [yij (mL) yij (mL + 1) yij (mL + L 1)]T
01
WL2L
= [0LL ILL ]
10
W2LL
= [ILL 0LL ]T
b j (m) = [b
h
hj,0 (m) b
hj,1 (m) b
hj,L1 (m)]T
where I denotes an identity matrix and 0 is a matrix of zeros. A block of error signal
based on the cross-relation between ith and jth channel is determined as
eij (m) = yij (m) yji (m)
01
10
b j (m)
= WL2L
[Cxi (m)W2LL
h
10
b i (m)].
Cxj (m)W2LL
h
(3.18)
Let FLL be the discrete Fourier transform (DFT) matrix of size L L. Then the
block error sequence in the frequency-domain can be expressed as
eij (m) = FLL eij (m)
01
10
b (m)
= WL2L
[Dxi (m)W2LL
h
j
10
b (m)]
Dxj (m)W2LL
h
i
(3.19)
decomposed as
Cxi (m) = F1
2L2L Dxi (m)F2L2L
(3.20)
35
where Dxi (m) is a diagonal matrix whose elements are obtained from the DFT
coefficients of the first column of Cxi (m) and
01
01
WL2L
= FLL WL2L
F1
2L2L
10
10
W2LL
= F2L2L W2LL
F1
LL
b (m) = FLL h
b j (m).
h
j
The frequency-domain cost function Jf (m) using the frequency-domain block error
signal eij (m) is defined as
Jf (m) =
M
1
X
M
X
eH
ij (m)eij (m)
(3.21)
i=1 j=i+1
where H denotes the Hermitian transpose. The MCFLMS algorithm approaches the
desired solution by going along the opposite direction of the gradient at each iteration.
The update equation of the MCFLMS algorithm is given by
b (m) f Jk (m),
b (m + 1) = h
h
k
k
k = 1, 2, , M
(3.22)
where f is the step-size in the frequency-domain and the gradient vector Jk (m) can
be obtained as
Jk (m) =
Jf (m)
b (m)
h
k
10
WL2L
M
X
01
Dxi (m)W2LL
eik (m)
(3.23)
i=1
M
X
Dxi (m)
i=1
01
W2LL
eik (m),
k = 1, 2, , M.
(3.24)
36
Concatenating the M impulse response vectors into a longer one, we can write the
update equation for the MCFLMS algorithm as
b + 1) = h(m)
b
h(m
f Jf (m)
(3.25)
where
b
b T (m) h
b T (m) h
b T (m)]T
h(m)
= [h
1
2
M
T
(m)]T .
Jf (m) = [J1T (m) J2T (m) JM
3.4
The MCFLMS algorithm converges to the desired solution with faster convergence
speed, however, the performance is critically dependent on the choice of a proper
step-size that influences the speed of convergence as well as the final misalignment
error. The selection of step-size is dependent on the power of the microphone signals
and hence the algorithm requires tuning with the change of acoustic environment.
Consequently, the normalized MCFLMS (NMCFLMS) algorithm was proposed that
relaxes the dependency of step-size parameter on the signal power. The algorithm also
reduces the eigenvalue spread of the autocorrelation matrix of the input signal and
thus accelerates the convergence speed.
The update equation of the NMCFLMS algorithm is expressed as
b 10 (m + 1) = h
b 10 (m) P1 (m)
h
k
k
k
M
X
Dxi (m)
i=1
e01
ik (m),
k = 1, 2, , M
(3.26)
37
where,
Pk (m) =
M
X
i=1,i6=k
b 10 (m)
h
k
10
b (m)
= W2LL
h
k
01
e01
ik (m) = W2LL eik (m).
Here 0 < < 2 is thestep-size parameter for NMCFLMS algorithm. The power
spectrum Pk (m) of the channel outputs is usually computed using a recursive scheme:
M
X
Pk (m) = Pk (m 1) + (1 )
i=1,i6=k
k = 1, 2, , M
where is a smoothing parameter.
(3.27)
3.5
Convergence Analysis
10
10
WL2L
= FLL WL2L
F1
2L2L , we get
b (m) W 10 P1 (m)
b (m + 1) = h
h
k
k
L2L k
M
X
Dxi (m)
i=1
01
W2LL
eik (m),
k = 1, 2, , M
(3.28)
For the ease of convergence analysis with noise, the update equation of the NMCFLMS
38
M
X
01
Dxi (m)W2LL
eik (m)
i=1
b (m) 2W 10 P1 (m)W 10 W 10
= h
k
L2L
2LL
L2L k
M
X
01
eik (m)
Dxi (m)W2LL
i=1
M
X
01
Dxi (m)W2LL
eik (m)
(3.29)
i=1
and
10
10
Pk (m) = WL2L
P1
k (m)W2LL .
Now, using the observation data correlation matrix, (3.29) can be expressed as [35]
b (m + 1) = h
b (m) 2Pk (m)[R
1k (m) R
2k (m)
h
k
k
X
b
ii (m) R
M k (m)]h(m)
R
i6=k
k = 1, 2, , M
where
h T
iT
b
b (m) h
b T (m) h
b T (m)
h(m)
= h
1
2
M
ij (m) are given by
and the entries R
ij (m) = SH (m)Sx (m)
R
xi
j
with
10
01
SH
xi (m) = WL2L Dxi (m)W2LL
01
10
Sxi (m) = WL2L
Dxi (m)W2LL
.
(3.30)
39
(3.31)
where
P(m) =
P1 (m)
0
..
.
...
P2 (m) . . .
..
..
.
.
0
..
.
0
x (m) is defined as
and R
P
x (m) =
R
12 (m)
R
..
.
1M (m)
R
. . . PM (m)
21 (m)
R
...
P
M 1 (m)
R
M 2 (m)
R
..
.
P
...
i6=M Rii (m)
b
b
b
h(m
+ 1) = h(m)
2PRx h(m)
(3.32)
b
b
x (m)}. To relate the channel
where, h(m)
= E{h(m)},
P = E{P(m)} and Rx = E{R
estimate with the eigenvectors of the clean data correlation matrix, we expand the
noisy data autocorrelation matrix Rx as
Rx = Ry + Rv + Ryv + Rvy = Ry + Rn
where Rn = Rv + Ryv + Rvy .
(3.33)
data and noise autocorrelation matrices, respectively, and Ryv and Rvy denote the
crosscorrelation matrices between them. In what follows is the eigen analysis of the
NMCFLMS algorithm with noise.
Since the autocorrelation matrix Ry is Hermitian, it can be represented as
Ry = Uy y UH
y
(3.34)
40
h(m
+ 1) = h(m)
2PUy y UH
y h(m)
2PRn h(m).
(3.35)
Premultiplying by UH
y , (3.35) can be represented as
o
o
o
(m + 1)
(m) 2(0y + UH
h
= h
y PRn Uy )h (m)
o
o
= h
(m) 2Th
(m)
(3.36)
where
o
h
(m) = UH
y h(m)
(3.37)
0y = P 0 y
(3.38)
P 0 = Diag UH
y PUy
(3.39)
T = 0y + UH
y PRn Uy
(3.40)
Therefore,
diagonalized as T = VDV1 , where V and D are the matrices whose columns and
diagonal elements are, respectively, the eigenvectors and eigenvalues of T. Substituting
T in (3.36) and premultiplying it by V1 , we obtain
g(m + 1) = (I 2D)g(m)
(3.41)
o
g(m) = V1 h
(m).
(3.42)
where
The set of M L first-order difference equations in (3.41) are now decoupled. The
solution of the k-th equation can be obtained as
gk (m) = Ck (1 2dk )m u(m), k = 1, 2, . . . , M L
(3.43)
Profile of gk(m)
41
8
6
4
2
0
0
10
5
0
100
200
300
Index, k
400
10
12
500
600
Figure 3.2: Amplitude distribution of the transform coefficients, gk (m), at the end of
60000 iterations.
where Ck is an arbitrary constant which depends on the initial value of g(m), and
u(m) is a unit step function. Clearly, gk (m) converges to zero exponentially for all dk
provided that |1 2dk | < 1 except for dk = 0.
In absence of noise, T = 0y = D and V becomes an identity matrix. Therefore,
the final value of gk (m) for noise-free case becomes
0, when d 6= 0
k
gk () =
C , when d = 0.
k
(3.44)
The profile of gk (m) for noise-free case is shown in Fig. 3.2 after 60000 iterations
which justifies the well known results in (3.44). Using (3.37), (3.42) and (3.44), the
h()
= Uy Vg() = Uy g() = Ck uky =0
(3.45)
3.6
42
In presence of noise, however, we can see from (3.40) that the diagonal matrix 0y is
additionally corrupted by the noise term UH
y PRn Uy . The resultant matrix T in (3.40)
would be diagonal only if Rn and P are diagonal matrices with equal diagonal entries.
However, by definition, Rv is a matrix with unequal diagonal entries. Also, in practical
cases, Rv contains off-diagonal entries and Ryv , Rvy are non-zero matrices. Therefore,
Rn = Rv +Ryv +Rvy and in turn T would be non-diagonal matrices containing unequal
diagonal entries. As a result, in presence of noise, none of the diagonal entries in matrix
D would be practically zero. Therefore, from (3.43), we can deduce that for a stable
system, gk (m), k = 1, 2, . . . , M L would decay exponentially to zero with iterations
unless a constraint is applied. Thus it appears that no fruitful final output can be
obtained from the NMCFLMS algorithm in the noisy case. However in this analysis,
rather than the actual values of gk (m), k = 1, 2, . . . , M L, the relative values of the
elements are important. Using (3.43), the ratio of the kth component to the i-th one
can be expressed as
Ck
gk (m)
=
gi (m)
Ci
1 2dk
1 2di
m
where i, k = 1, 2, . . . M L.
(3.46)
, when k = i
gk (m)|mvery large =
0, when k 6= i
(3.47)
where is an arbitrary constant. Using (3.42) and (3.47), we can now conclude that
o
after sufficiently large number of iterations, h
would be equal to the scaled version of
(a)
0.6
d
43
0.4
0.2
0
100
200
300
Index, i
400
500
600
400
500
600
66
g (m)
x 10
(b)
1
0.5
0
100
200
300
Index, k
in noisy case as
and (3.37) give the final estimate of h
o
h
= Uy h
(m)|mvery large = Uy Vg(m) = Uy vi
v1i
h
i
v2i
= u1y u2y . . . uM Ly .
..
vM Li
=
ML
X
vki uky .
(3.48)
k=1
The above equation reveals that, unlike the noise-free case solution in (3.45), the
estimate in noisy case is the weighted sum of the eigenvectors of Ry , where the weights
are the elements of the ith column of the matrix V, i.e. vi . To vividly show the relation
between the true and the noisy estimate, we rewrite (3.48) as
h
= 0
h
|{z}
true vector
h
| noisy
{z }
contribution f rom noise
(3.49)
44
2
5
10
15
20
25
30
35
40
0
1
100
200
300
Index
400
500
600
Figure 3.4: Amplitude distribution (normalized with respect to the 1st coefficient) of
the elements of vi (i = 1) after 15000 iterations for SNR=15 dB. Here, in addition to
the 1st element (corresponding to the zero eigenvalue in noise-free case), 2nd, 3rd, 6th,
7th, 9th, 16th, 17th and 18th elements are also significant.
where, the true impulse response vector h is the scaled value of u1y (e.g., h = 1/c u1y ,
noise contribution
Amplitude
Magnitude
Magnitude
Magnitude
0.2
(a)
0.1
0
0.2
0.4
0.6
0.5
0
0.8
1
1.2
1.4
Normalized frequency (x pi)
1.6
1.8
1.6
1.8
1.6
1.8
(b)
0.2
0.4
0.6
0.5
0
45
0.8
1
1.2
1.4
Normalized frequency (x pi)
(c)
0.2
0.4
0.2
0.1
0
0.6
0.8
1
1.2
1.4
Normalized frequency (x pi)
(d)
100
200
NMCFLMS
300
400
500
600
Samples
Figure 3.5: (a) Magnitude spectrum of the true channel, (b) Magnitude spectrum of the
linear combination of all the eigenvectors of Ry according to the weight profile shown
in Fig. 3.4, (c) Estimated magnitude spectrum using the NMCFLMS, (d) Estimated
impulse responses (concatenated) using the two methods in the time-domain.
all the eigenvectors of Ry is very close to that of the NMCFLMS estimate. The little
difference in the two curves can be attributed to incomplete convergence of the adaptive
algorithm. It is also interesting to observe the narrowband shape of the magnitude
spectra in Figs. 3.5 (b) and (c) of the estimates at 15 dB SNR. This shape in contrast
to the uniform spectrum of the true channel as shown in Fig. 3.5 (a)) is due to the
additional eigenvectors (i.e., 2nd, 3rd, 6th, 7th, 9th, 16th, 17th and 18th etc) with
dominant narrowband characteristics. The relative weight of the eigenvector of the
noise-free case solution is less as compared to unified strength of other vectors in the
estimate. The presence of noise thus invokes other eigenvectors in the solution and
deemphasize the relative effect of the desired one when the SNR is below a certain
46
threshold value.
3.7
Conclusion
In this chapter, we reviewed the blind channel identification technique and the
identifiability conditions of BCI. We introduced the time- and frequency-domain
multichannel LMS algorithms as an effective algorithm for BCI. The detail convergence
analysis of the NMCFLMS algorithm is also presented that gave a generalized view of
the final solution both in the noise-free and noisy conditions. It has been shown that
the final solution of the NMCFLMS algorithm comes from the weighted combination
of all the eigenvectors of the clean data correlation matrix. The presence of noise
paves the way for the other eigenvectors to become dominant over the eigenvector
corresponding to the minimum eigenvalue. As a result, the conventional minimum
mean-square cross-relation error solution cannot ensure desired channel estimate in
the presence of noise.
Chapter 4
Variable Step-size Multichannel
Frequency-Domain LMS for Blind
Identification of FIR Channels
The choice of step-size is a critical factor in blind identification of SIMO channels using
the MCFLMS algorithm. The proper step-size is dependent on the signal power and
it influences the speed of convergence as well as the final misalignment error. The
NMCFLMS algorithm can relax the dependency of step-size parameter on the signal
power, however, it cannot ensure the appropriate step-size for optimal convergence.
We propose a variable-step-size MCFLMS (VSS-MCFLMS) algorithm which optimizes
the performance of the algorithm in each iteration to achieve minimum misalignment
between the true and estimated channel vector. The proposed VSS ensures minimum
mean-squared-error solution in the mean for both noise-free and noisy conditions, and
more noise robust as compared to the NMCFLMS algorithm. Using theoretical analysis
and numerical example, it is shown that this step-size guarantees the stability of the
algorithm.
47
4.1
48
The update equation of the conventional MCFLMS algorithm can be expressed as (Eq.
3.24)
b (m + 1) = h
b (m) f W 10
h
k
k
L2L
M
X
Dxi (m)
i=1
01
W2LL
eik (m),
k = 1, 2, , M.
Concatenating the M impulse response vectors into a longer one, we can write the
update equation for the MCFLMS algorithm as
b + 1) = h(m)
b
h(m
f Jf (m)
(4.1)
where
b
b T (m) h
b T (m) h
b T (m)]T
h(m)
= [h
1
2
M
T
Jf (m) = [J1T (m) J2T (m) JM
(m)]T .
(4.2)
b
where is a constant used to resolve the scaling ambiguity between h and h(m
+ 1).
The MCFLMS update equation for adaptive step size f (m) can be written as
b + 1) = h(m)
b
h(m
f (m)Jf (m).
(4.3)
(4.4)
+2 2f (m)||Jf (m)||2
(4.5)
49
b H (m) 1 hH
h
=
Jf (m)
||Jf (m)||2
(4.6)
(4.7)
b H (m)
h
=
Jf (m).
||Jf (m)||2
(4.8)
In this work, we investigate the effectiveness of the MCFLMS algorithm in (4.1) but
replacing the fixed-step-size f by opt
f (m) both in noise-free and noisy conditions.
We also give a performance comparison of the proposed VSS-MCFLMS with the
NMCFLMS [35] in different noisy environments and show the superiority of our
method. Before we do this, the stability and convergence analysis of the proposed
VSS-MCFLMS and the algorithmic difference between the two approaches in presence
of noise would be interesting.
4.2
In this section, we give a theoretical analysis of the mean convergence of the VSSMCFLMS algorithm. In particular, we focus on the mechanism how the adaptive
algorithm for BCI converges to the eigenvector corresponding to the minimum
eigenvalue both in noise-free and noisy cases.
50
Jf (m) = R(m)
h(m)
where R(m)
is defined as [35]
P
(m)
21 (m)
R
R
i6=1 ii
P
12 (m)
R
i6=2 Rii (m)
R(m)
=
..
..
.
.
1M (m)
2M (m)
R
R
...
(4.9)
M 1 (m)
R
M 2 (m)
R
..
.
P
...
i6=M Rii (m).
...
..
.
Substituting (4.9) into (4.3), the update equation of the VSS-MCFLMS algorithm can
be written as
b + 1) = h(m)
b
b
h(m
f (m)R(m)
h(m).
(4.10)
h(m
+ 1) = h(m)
f Rh(m)
(4.11)
where, h(m)
= E{h(m)},
(4.12)
where U is the unitary matrix whose columns are the eigenvectors of R and is a
diagonal matrix with diagonal elements k , 1 k M L, equal to the eigenvalues of
R. Substituting (4.12) into (4.11), we obtain
o
o
h
(m + 1) = (I
f )h
(m)
(4.13)
o
(m) = UH h(m).
h
(4.14)
where,
The set of M L first-order difference equations in (4.13) are now decoupled. Therefore,
the solution of the kth equation can be obtained as
o
f k )m u(m), k = 1, 2, ..., M L
h
k (m) = Dk (1
(4.15)
51
o
o
where h
(m),
k
=
1,
2,
.
.
.
,
M
L
are
the
components
of
h
(m), Dk is an arbitrary
k
o
h(m)
= Uh
(m)
h
=
u1 . . . uk . . . uM L
o
h
1 (m)
..
.
o
h
k (m)
..
.
o
h
M L (m)
o
o
= u 1 h
(m)
+
.
.
.
+
u
(m) + . . .
h
1
k k
o
+uM L h
M L (m)
(4.16)
(4.17)
52
can be expressed as
o
h()
= Uh
()
0
..
.
h
=
u1 . . . umin . . . uM L
Dmin
..
.
= Dmin umin .
(4.18)
(4.19)
k=1
f
k=1
(4.20)
f =
k=1
ML
X
(4.21)
o
2 2
|h
k (m)| k
k=1
Now, we verify the similarity between f (m) obtained in (4.8) and that of (4.21).
53
h
h(m)
f (m) =
b H (m)R
b
H (m)R(m)
h
h(m)
h o
iH
b (m) (m)h
b o (m)
h
= h
iH
b o (m)
b o (m) 2 (m)h
h
ML
X
k=1
ML
X
b (m)|2 k (m)
|h
k
.
(4.22)
bo (m)|2 2 (m)
|h
k
k
k=1
It shows that although step-size is computed from complex vectors, it is a real number.
Equations (4.21) and (4.22) are similar in form. It reveals that the minimum norm
solution of (4.19) is identical to the LS solution of (4.2). Therefore, the variable step-size
represented by (4.8) gives the fastest convergence speed for the MCFLMS algorithm
in the noise-free case.
We now discuss the stability of the MCFLMS algorithm when the proposed variable
step-size is adopted. From (4.15) we see that the VSS-MCFLMS algorithm will be
bo (m)
unstable if for any value of k, |1
f k | becomes greater than 1 and hence h
k
will rise exponentially. In that case according to (4.22), f (m) becomes (neglecting all
other components as compared to the rising one)
o
b (m)|2 k
|h
k
f (m)
=
o
b
|h (m)|2 2
k
1
=
.
k
(4.23)
Therefore, we find that the proposed f (m) is auto regulated which forces the blowing
component to decay as quickly as possible and thus ensures the stability of the
MCFLMS algorithm. To verify the above statement, a fixed step-size, f was arbitrarily
selected for a random multichannel system with SNR = 20 dB that causes the estimated
variable step-size f (m). The results are shown in Fig. 4.1. The norm of h(m)
blows as
54
step size,
x 10
0.5
0
500
1000
1500
2000
2500
iterations, m
3000
3500
4000
4500
101
hnorm
10
x 10
5
0
500
1000
1500
2000
2500
iterations, m
3000
3500
4000
4500
500
1000
1500
2000
2500
iterations, m
3000
3500
4000
4500
NPM, dB
0
10
20
30
m
o
Dk
1
f k
h
k (m)
.
(4.24)
=
.
o
Dmin 1
f min
h
min (m)
Now for any k except one which corresponds to the minimum eigenvalue, we can write
|1
f k | < |1
f min |. As a result the expression in (4.24) diminishes exponentially
to zero as m tends to infinity. It means that after a large number of iterations all the
orthogonal components become negligible as compared to the minimum eigenvalue
component. Therefore, similar to the noise-free case, we can conclude that when
the received data is corrupted by noise, the MCFLMS algorithm converges to the
55
4.3
VSS-MCFLMS vs.
NMCFLMS: Algorithmic
Difference
In this section, we highlight the difference between the proposed VSS-MCFLMS and
the NMCFLMS algorithms in terms of (i) final solution in noisy environments and (ii)
computational complexity.
4.3.1
M
X
Dxi (m)
i=1
01
W2LL
eik (m), k = 1, 2, , M.
(4.25)
where,
Pk (m) =
M
X
i=1,i6=k
M
X
01
Dxi (m)W2LL
eij (m)
i=1
M
X
01
Dxi (m)W2LL
eik (m)
i=1
(4.26)
56
and
10
10
WL2L
Pk (m)1 W2LL
= Pk (m).
h(m
2P(m)R(m)
h(m)
where
P(m) =
(4.27)
P1 (m)
0
..
.
0
...
P2 (m) . . .
..
..
.
.
0
..
.
(4.28)
. . . PM (m)
b
b
b
h(m
+ 1) = h(m)
2PRh(m)
(4.29)
b
where, P = E{P(m)} and we assume statistical independence between P, R and h(m).
Substituting (4.12) into (4.29) and premultiplying by UH we obtain,
o
o
o
(m) 2UH PUh
h
(m + 1) = h
(m)
(m)
= (I 2p )h
(4.30)
where
p = Diag UH PU
(4.31)
and Diag[] refers to a diagonal matrix with diagonal elements of UH PU. We have
found that UH PU is very close to a diagonal matrix. Therefore, our approximation in
(4.30) introduces insignificant error.
Observing (4.13) and (4.30), we can clearly visualize the algorithmic difference
between the proposed VSS-MCFLMS and NMCFLMS. We find that an additional
multiplying factor, p appears in the NMCFLMS algorithm which modulates the
eigenvalues of the data correlation matrix. From (4.31) we can derive the analytic
expression of the diagonal components of p which can be expressed in vector form as
57
u211 p1
u221 p1
u212 p2
u222 p2
+ ... +
u21(M L) pM L
+ ... +
..
.
u22(M L) pM L
58
Magnitude
15
0.026
27
5
0
0
100
0.4
Magnitude
(a)
0.027
10
28
200
0.16
0
0
27
100
30
300
Index
400
31
32
500
33
600
(b)
0.14
0.2
29
28
200
29
300
Index
30
400
31
32
500
33
600
Magnitude
x 10
4.5
4
3.5
27
(c)
28
29
30
31
32
33
0.5
0
0
100
200
300
Index
400
500
600
Figure 4.2: (a) Eigenvalue profile of the data correlation matrix. (b) Scaling factor p .
(c) Resultant eigenvalue profile of the NMCFLMS algorithm.
the NMCFLMS case. The eigenvalue in position 31 takes the minimum position. As
a result, the NMCFLMS algorithm will misconverge completely which can be verified
from the adaptive solution in the simulation section.
4.3.2
b 10 (m + 1) = h
b 10 (m) f J 10 (m)
h
f
(4.32)
b 10 (m) = W 10 h(m)
b
h
2LL
(4.33)
where,
59
(4.34)
(4.35)
10
Premultiplying (4.33) and (4.34) by WL2L
we get
10
b
b 10 (m)
h(m)
= WL2L
h
(4.36)
10
Jf (m) = WL2L
Jf10 (m)
(4.37)
10
10
where, we have used the relation WL2L
W2LL
= ILL . Substituting (4.36) into (4.8),
10
b 10 (m)]H
[WL2L
h
W 10 J 10 (m).
=
10
||WL2L
Jf2L (m)||2 L2L f
(4.38)
= 0.25 I2L2L .
(4.39)
opt
f (m)
b (m)]H
[h
Jf10 (m).
=
||Jf10 (m)||2
(4.40)
Therefore, the optimal step-size of the proposed VSS-MCFLMS can be evaluated using
the 2L representation of the algorithm given in (4.32). With this optimal step-size,
(4.32) can be written as
b 10 (m) opt (m)J 10 (m).
b 10 (m + 1) = h
h
f
f
(4.41)
60
x 10
VSSMCFLMS (2L)
NMCFLMS (2L)
VSSMCFLMS (L)
7
6
5
4
3
2
1
0
0
Figure 4.3:
50
100
150
200
Channel Length, L
250
300
4.4
Simulation Results
hT h(m)
h(m)
(m) = h
2
||h(m)||
NPM(m) = 20 log10
4.4.1
First we present blind channel identification results for random coefficient SIMO FIR
system. The impulse responses are generated using the randn function of MATLAB.
61
4
2
0
2
10
15
20
25
30
20
25
30
20
25
30
Amplitude
h2
5
0
5
10
15
h3
5
0
5
10
15
Samples
62
NPM, dB
NMCFLMS
10
VSSMCFLMS
15
20
1000
2000
3000
4000
iterations, m
5000
6000
Figure 4.5: NPM profile of the VSS-MCFLMS and NMCFLMS algorithms for M = 3
channels L = 32 random coefficients SIMO system at SNR = 20 dB.
5
10
NMCFLMS
NPM, dB
15
20
VSSMCFLMS
25
30
35
40
45
2000
4000
6000
8000
iterations, m
10000
12000
Figure 4.6: NPM profile of the VSS-MCFLMS and NMCFLMS algorithms for M = 3
channels L = 32 random coefficients SIMO system at SNR = 40 dB.
4.4.2
A virtual acoustic room is used throughout the dissertation for evaluating the
conventional and proposed channel estimation as well as speech dereverberation
algorithms. The dimension of the room is taken to be (5 4 3) m. The schematic
diagram of the room is shown in Fig. 4.7. A linear array consisting of M = 5
63
z-coordinate
2.5
2
microphone array
1.5
speaker
1
0.5
0
4
3
2
1
0
0.5
y-coordinate
1.5
2.5
3.5
4.5
x-coordinate
64
virtual source
source
listener
room
Figure 4.8: Virtual sources in a rectangular room. The dotted line from the source
to the listener represents a reflected sound path which is equivalent to the free field
contribution from the indicated virtual source.
4.4.3
VSSMCFLMS
NMCFLMS
10
NPM (dB)
65
20
30
40
50
60
0
10
20
30
SNR (dB)
40
50
60
Figure 4.9: Noise robustness of the NMCFLMS and the proposed VSS-MCFLMS
algorithms for a M = 5 channel L = 128 coefficients acoustic system at low SNR.
0
5
NPM, dB
10
15
20
VSSMCFLMS
25
NMCFLMS
2
6
iterations, m
10
4
x 10
66
5
10
NPM, dB
15
20
NMCFLMS
25
30
VSSMCFLMS
35
40
2
6
iterations, m
10
4
x 10
67
0
2
NPM (dB)
VSSMCFLMS
6
8
NMCFLMS
10
12
0
500
1000
1500
2000
2500
iterations, m
3000
3500
4000
4.5
Conclusion
In this chapter, we have proposed a variable step-size (VSS) multichannel frequencydomain LMS (MCFLMS) algorithm. The expression of VSS has been derived in such
a way that it minimizes the misalignment of the estimated channel vector with the
true one at each iteration. It has been demonstrated that the proposed variable stepsize guarantees the stability of the MCFLMS algorithm. The convergence analysis has
revealed that the VSS-MCFLMS algorithm is more noise-robust as compared to the
NMCFLMS algorithm. In spite of the relative robustness, the VSS-MCFLMS algorithm
is unable to estimate the acoustic channels with speech input even at moderate SNR of
20 dB. Therefore, we need to improve the robustness of the class of MCLMS algorithms
for speech dereverberation.
Chapter 5
Noise Robust Multichannel Timeand Frequency-Domain LMS-type
Algorithms
In this chapter, we present two novel solutions to improve the noise robustness
of multichannel LMS-type algorithms for both time- and frequency-domain
implementations. The proposed algorithms are termed as excitation-driven MCLMS
algorithm and spectrally constrained MCLMS algorithm. The former converges to a
steady-state multi-eigenvector solution instead of converging to the traditional singleeigenvector solution and thus provides improved robustness. The second approach
relies on the fact that the misconvergence characteristic is associated with nonuniform
spectral attenuation of the estimated channel coefficients. Therefore a novel cost
function is formulated that inherently opposes such spectral attenuation and thus
contribute to ameliorating the misconvergence of the MCLMS algorithm.
5.1
It is demonstrated in Chap. 3 that the final estimate of the MCLMS algorithm comes
from only one eigenvector that corresponds to the minimum eigenvalue of the data
68
69
correlation matrix. Here we show that the single eigenvector solution cannot produce
a reasonable estimate of the channel impulse response when observations are corrupted
by noise.
5.1.1
hd = R1 Rh
(5.1)
(5.2)
correlation matrix and noise correlation matrix, respectively and Ryv and Rvy denote
the crosscorrelation matrices between them. For a SIMO system, the true channel
vector lies in the null space of the clean data correlation matrix. Therefore, Ry h = 0,
and we can write (5.2) as
hd = U1 UT Rn h.
(5.3)
In order to achieve this solution in the final estimate of the adaptive algorithm, let
us simplify Rn . Although, Rn is a non-diagonal matrix, the off-diagonal components
are generally much smaller than the diagonal ones. If we assume that the signal and
noise are uncorrelated, Rn reduces to Rv . Now, in a multichannel speech enhancement
system, noise may be introduced from the system level (sampling jitter, quantization
noise) as well as from the environment. The system level noise is usually uncorrelated.
70
The environmental noise, in the presence of multiple noise sources, contains both
correlated and uncorrelated noise.
0LL
0LL
i6=2 vi ILL . . .
..
..
..
...
.
.
.
P
2
0LL
0LL
...
i6=M vi ILL
where v2i is the noise power in the ith channel. Therefore, we can write (5.3) as
hd U1 UT Rv h.
(5.4)
Now, the diagonal components of Rv are divided into M blocks with L identical
elements. Each block corresponds to a specific channel whose value for the jth channel
P
can be defined as j = i6=j v2i . Since the sensors are close to one another and j
is the sum of noise powers on all channels except the self one, we can approximate
1 2 M = .
matrix with equal components, which significantly reduces the systemic complexity
and computational load for the adaptive implementation of the noise robust MCLMS
71
h
=
u1 . . . uk . . . uM L
ho1
..
i ho
..
ho
ML
M L
(5.5)
(5.6)
where ho = UT h, and hok is the kth component of ho . Therefore, we find that the
desired channel estimate is a weighted combination of all the eigenvectors with weight
profile inversely proportional to the eigenvalues. This relationship is a generalization
of the single-eigenvector solution applicable for the noise-free case. Since the minimum
eigenvalue is zero in the noise-free case, the corresponding eigenvector should receive
infinite weight in the linear combination of the eigenvectors. As a result, the MCLMS
algorithm, which always converges to the eigenvector corresponding to the smallest
eigenvalue, can converge to the desired solution only in the noise-free condition.
Now, let us see the impact of different approximations on the formulation of the
desired estimate. Table 5.1 shows the NPM values for M = 5 acoustic channels
of length L = 512, obtained using (5.3)-(5.5), and compares these results with
the conventional single-eigenvector solution of the MCLMS algorithm. Eq. (5.3)
corresponds to the ideal solution, which gives 218.6 dB NPM. Considering signal
and noise are uncorrelated, we obtain a final estimate of 23.3 dB. Next, due to
the assumption that diagonal components of Rv are equal (made only for analytical
simplification), the NPM of the desired solution increases to 18.3 dB. Although
it seems that the approximation in each stage introduces some degree of error, the
weighted combination of the eigenvectors in the noisy condition is a much better
solution than the single-eigenvector solution corresponding to the smallest eigenvalue.
72
NPM (dB)
Equation (5.3)
218.6
Equation (5.4)
23.3
Equation (5.5)
18.3
MCLMS solution
5.1.2
0.0
h(n
J(n) + h(n)
(5.7)
T (n) h
T (n) h
T (n)]T acts as an excitation function for the original
where h(n)
= [h
1
2
M
MCLMS algorithm coupled through a tunable parameter which is estimated as in
[48]:
h
T
(n)J(n)
=
.
2
||h(n)||
(5.8)
Here, h(n)
resembles the true channel vector in large dominant components and is
determined from the adaptive estimate as
sign[b
b i (n)|
hi,l (n)], |b
hi,l (n)| > 0.75 max |h
i
hi,l (n) =
0,
otherwise
(5.9)
of the ith channel coefficients. After the initial transient phase, we can approximate
73
h(n)
as a constant vector because of the fact that the position of the large dominant
components in the channel estimate are essentially fixed within the first few iterations.
Therefore the following convergence analysis of the proposed RMCLMS algorithm can
b
b
b
h(n
+ 1) = h(n)
2Rh(n)
+ h
(5.10)
(5.11)
b
b
o = [h
o h
o h
o ]T = UT h.
The set of M L firstwhere, h
(n) = UT h(n)
and h
1
ML
k
order difference equations in (5.10) are now decoupled. Therefore, the kth equation in
the transformed-vector domain can be expressed as
o
o
b
o.
hk (n + 1) = (1 2k )b
hk (n) + h
k
(5.12)
(5.13)
o
where hkH (n) and hkP denote the homogeneous and particular solutions of b
hk (n),
respectively. Similar to (4.15), the homogeneous solution can be written as
hkH (n) = Ck (1 2k )n u(n).
(5.14)
(5.15)
2k
74
and therefore
hkP
o
h
k
=
.
2k
Substituting the values of hkH (n) and hkP into (5.13), the solution of the kth difference
equation of the robust MCLMS algorithm can be written as
o
h
o
b
hk (n) = Ck (1 2k )n u(n) + k .
2k
In the presence of noise, none of the k is zero and therefore the transient term
represented by (5.14) decays exponentially to zero for any k. Therefore, the final
o
value of b
hk (n) can be obtained as
o
h
o
k
b
,
hk () =
2k
k = 1, 2, , M L.
(5.16)
o
b
b
Utilizing the relations h(n)
= Uh
(n) and (5.16), the final estimate of the channel can
be expressed as
b
h()
= U
o /21
h
1
o /22
h
2
..
.
o /2M L
h
ML
1/1
0
0
1/2
U
=
.
..
2 ..
.
0
0
=
o.
U1 h
2
...
...
...
0
..
.
. . . 1/M L
o
h
1
o
h2
..
.
o
h
ML
(5.17)
We now compare the final solution given by (5.17) resulting from the RMCLMS
algorithm with the desired solution in (5.5). Both expressions represent the final
solution as a weighted combination of all the eigenvectors with similar weight profile
which include constant terms, inverse of the eigenvalue, and a transform-domain
variable. It is known that the MCLMS-type algorithms can only estimate the channel
impulse response with a scaling factor ambiguity [34]. As a result, the difference in
75
Table 5.2: Comparison of the final solution using conventional and robust MCLMS
algorithm: M = 5, L = 512
(SNR = 10 dB)
Algorithm
MCLMS solution
RMCLMS solution
NPM (dB)
0.0
9.3
the constant terms (/2 and ) is not significant at all. The second term, which is the
o
inverse of the eigenvalue 1/, is common in both expressions. Finally, by definition, h
resembles the true transform vector ho . Therefore, the proposed RMCLMS algorithm
can reasonably approach the desired solution in terms of a weighted combination of
o
the noisy eigenvectors. The error introduced due to the approximation of ho by h
is illustrated in Table 5.2 using a numerical example. Here, we see that the NPM
value achieved by the RMCLMS solution is 9.3 dB, which is a good estimate of
the channel for the given 10 dB noise level. To the contrary, the single-eigenvector
corresponding to the smallest eigenvalue shows almost 0 dB NPM, and this figure
indicates a complete failure of the conventional MCLMS algorithm to estimate the
channel in a noisy condition.
5.1.3
The step size controls the speed of convergence as well as stability of the LMS algorithm.
We first formulate a variable step size for the RMCLMS algorithm that guarantees the
stability and at the same time ensures fast decay of the transient response, giving
rapid convergence to the steady-state solution. The update equation of the RMCLMS
algorithm in the transform-domain is given by (5.11). For the stability of the algorithm,
we should choose a step size, (n), that causes the transient terms in (5.11) to decay
rapidly with iteration. For this, (n) is selected such that the squared norm of the
transient part of (5.11) is minimum at each iteration. Then, a cost function Jo (n) is
76
defined as
o
b
Jo (n) = ||(I 2(n))h
(n)||2
ML
X
o
=
| [1 2(n)k ] b
hk (n)|2 .
(5.18)
k=1
h o iT
o
b
b
h
(n) h
(n)
=
(n) = k=1
h
i
T
ML
o
o
X
2h
b
b
o
2
h
(n)
(n)
2 b
2
2 k |hk (n)|
k=1
iT
b
b
UT h(n)
UT h(n)
=
.
h
iT
T
T
T
b
b
2 U h(n) U U Uh(n)
(5.19)
b
b
h
(n)Rh(n)
(n) = T
.
b
b
2h
(n)RT Rh(n)
(5.20)
h
h(n)
(n) =
.
b T (n)R
b
T (n)R(n)
2h
h(n)
(5.21)
b T (n)J(n)
h
||J(n)||2 +
(5.22)
|b
hok (n)|2 k
2|b
ho (n)|2 2
k
1
.
=
2k
(5.23)
77
(5.24)
where, Gi (n) = diag[gi,0 (n) gi,1 (n) . . . gi,l (n) . . . gi,L1 (n)], and gi,l (n) is expressed as
gi,l (n) =
1
|b
hi,l (n)|
+ (1 + ) L1
2L
X
hi,l (n)| +
2 |b
(5.25)
l=0
5.1.4
(5.26)
78
We formulate this update equation with a length-2L vector instead of a length-L vector
which reduces the cost of computation as shown in Sec. 4.3.2. To get an equivalent
length-2L vector update equation of the MCFLMS algorithm, we pre-multiply (5.26)
10
by W2LL
, which gives
b 10 (m) f J 10 (m)
b 10 (m + 1) = h
h
f
(5.27)
b 10 (m) = W 10 h(m)
b
h
2LL
(5.28)
10
W2LL
= F2L2L [ILL 0LL ]T F1
LL .
(5.29)
where,
(5.30)
(m)| = 1
|h
k
6
10 (m) =
h
k
6
(5.31)
10 (m), k = 1, 2, , M L.
h
k
(5.32)
(5.33)
(m)]H
[h
f (m) =
Jf10 (m).
10
2
||Jf (m)|| +
(5.34)
(5.35)
79
where
10
Jfi (m) = WL2L
Jk10 (m)
i (m) = W 10 h
10 (m)
h
L2L
10
= FLL [ILL 0LL ]F1
WL2L
2L2L
and Gi (m) = diag[gi,0 (m) gi,1 (m) . . . gi,l (m) . . . gi,L1 (m)] where
gk,l (m) =
k,l (m)|
1
|h
+ (1 + ) L1
.
2L
X
k,l (m)| +
2 |h
l=0
M
X
Dxi (m)
i=1
e01
ik (m)
k (m), k = 1, 2, , M
+ h
(5.36)
where,
Pk (m) =
M
X
i=1,i6=k
5.1.5
Simulation results
80
NPM, dB
FSSMCLMS
VSSMCLMS
PVSSMCLMS
FSSRMCLMS
VSSRMCLMS
PVSSRMCLMS
6
8
10
0
2
3
iterations, n
5
4
x 10
NPM, dB
2
4
VSSMCLMS
PVSSMCLMS
VSSRMCLMS
PVSSRMCLMS
8
10
12
0
2
iterations, n
4
5
x 10
(a)
PVSSMCFLMS
NPM, dB
VSSMCFLMS
NMCFLMS
10
PVSSRMCFLMS
VSSRMCFLMS
15
20
0
RNMCFLMS
1000
2000
3000
frame index, m
4000
(b)
2
NPM, dB
81
NMCFLMS
PVSSMCFLMS
VSSMCFLMS
6
PVSSRMCFLMS
8
10
0
RNMCFLMS
1000
VSSRMCFLMS
2000
frame index, m
3000
In both the
cases, the conventional algorithms show good initial convergence as revealed from
the lower NPM values in the early stage of iterations. But following this apparent
convergence, NPM starts to increase until complete misconvergence. To the contrary,
the proposed algorithms converges to a steady-state solution with no sacrifice in the
speed of convergence. The accuracy of the final estimate is, however, dictated by the
noise level of the observation data. Moreover, the proportionate version can improve
the final misalignment performance while keeping the same speed of convergence.
82
Next, we present comparative performance results for the same acoustic systems
considered in Chapter 4 using the frequency-domain MCLMS algorithm. In Fig. 5.2,
we show the results of channel estimation at 10 dB and 15 dB SNR using random
and speech inputs, respectively. The advantage of the frequency domain is readily
understood from the number of iterations it requires to reach the final convergence as
compared to the time-domain algorithm. However, the misconvergence phenomenon
is still prevalent in the frequency domain. Here we find that the proposed excitation
function added to the original update equation brings noise robustness in the channel
estimation for all variants of the MCFLMS algorithm. Particularly in Fig. 5.2(a) we
note that the NMCFLMS algorithm, which shows the highest speed of convergence
with the least robust characteristics, has now become the superior of all algorithms.
5.1.6
5.2
It is observed that both the NMCFLMS and VSS-MCFLMS algorithms give good
initial estimate of the channels followed by rapid divergence from this better estimate in
the presence of additive noise. This misconvergence is associated with the nonuniform
spectral attenuation of the estimated channel impulse response as illustrated in Section
83
3.6. Therefore we propose a modified cost function Jmod (m) = Jf (m) + (m)Jp (m),
where Jf (ref. to Eq. 3.21) and Jp are the original and penalty cost functions,
respectively which are coupled through the Lagrange multiplier, (m). The penalty
function that can ameliorate the misconvergence of the MCFLMS-type algorithms is
defined in this work as
maximize Jp (m) =
ML
Y
b (m)|2
|h
i
(5.37)
i=1
subject to
b (m)|2 + |h
b (m)|2 + + |h
b (m)|2 = 1
|h
1
2
ML
ML
(5.38)
where (5.38) is ensured by the unit norm constraint imposed on the update equation.
Now substituting (5.38) into (5.37), we obtain
2
b (m)|2 |h
b (m)|2 |h
b
Jp (m) = |h
1
2
(M L1) (m)|
1
2
2
b (m)| |h
b
|h
.
1
(M L1) (m)|
ML
(5.39)
b (m), we get
Differentiating (5.39) with respect to h
k
1
b (m)|2 2|h
b (m)|2
|h
1
k
ML
ML
Y
2
b
b (m)|2 .
|hM L1 (m)| }
|h
i
b (m){
Jp k (m) = 2h
k
(5.40)
i=1,i6=k
We know that the penalty function, Jp (m) will be either maximized or minimized when
Jp k (m) = 0 for all k. From (5.40), we see that Jp k (m) can be zero for two different
conditions.
hk (m) = 0 for any k, the Jp k (m) becomes zero. But this condition
1. When b
minimizes the penalty function.
2. When the following condition is satisfied,
1
2
b (m)|2 + + |h
b
b (m)|2 + + 2|h
|h
1
k
M L1 (m)| =
ML
(5.41)
Jp k (m) also becomes zero. However, this condition maximizes the penalty
function.
84
Therefore it is clear that if we maximize the penalty function by going along the
gradient of the penalty function, the later condition will be satisfied. Formulating Eq.
(5.40) for each value of k, we may obtain (M L 1) simultaneous linear equations of
the same form as (5.41). Adding all such equations we get
ML 1
2
b (m)|2 + + |h
b (m)|2 + + |h
b
|h
.
1
k
M L1 (m)| =
M 2 L2
(5.42)
Subtracting (5.42) from (5.41), we obtain the condition for penalty function
b (m)|2 =
maximization as |h
k
1
,
M 2 L2
will be maximum when the estimated channel coefficients have uniform magnitude
spectra in the frequency-domain. Therefore to combat nonuniform spectral attenuation
problem in the misconvergence phase, spectral flatness can be attached as a constraint
with the original cost function, Jf (m) via the penalty term, Jp (m) using the Lagrange
multiplier. The total regularized cost function to be minimized can be defined as
Jt (m) = Jf (m) + (m){Jp (m)}. The negative sign before Jp (m) ensures that while
Jt (m) is minimized, Jp (m) is maximized. The adaptive update rule for this constrained
minimization can be readily obtained as
b
f (m) + (m)f (m)Jp (m)
b + 1) = h(m) f (m)J
h(m
.
b
M L ||h(m)||
(5.43)
The beauty of the proposed penalty function is that its gradient remains almost inactive
as compared to the original signal gradient in the initial phase of iterations. This
phenomena stems from the fact that the true channel vector whether it is acoustic
or random is spectrally wide band. Thus the original cost function is expected to
be better minimized with a wide band estimate of the channel. This leads to almost
negligible gradient of the penalty term and thereby making no noticeable effect on
the undate equation. However, when the misconvergence starts because of nonuniform
spectral attenuation in the estimate, the initially dormant gradient of the penalty term
becomes active, enforces spectral flatness and eventually eradicates misconvergence.
In order to simplify the expression of the penalty gradient, we take natural logarithm
on both sides of (5.37). This does not relax the functionality of the penalty term.
85
ML
X
b (m)|2 ).
ln(|h
i
(5.44)
i=1
Jp (m)
2
Jp (m)
b (m)
+j i
=
h
r
k
2
b
b (m)
b
|
h
(m)|
h
hk (m)}
k
k
br (m) = real{h
b (m)} and h
bi (m) = imag{h
b (m)}. Therefore, we can write
where h
k
k
k
k
Jp (m) as
b
Jp (m) = Q(m)h(m)
(5.45)
b (m)|2 , k = 1, 2, M L.
where Q(m) is a diagonal matrix with diagonal elements 2/|h
k
The coupling factor, (m) is estimated such that the total gradient becomes zero
(Jt (m) = 0) in the steady-state condition. This gives Jf (m) = (m)Jp (m) and
premultiplying both sides by JpH (m), we can obtain (m) as
JH (m)J (m)
f
p
(m) =
.
2
||Jp (m)||
(5.46)
Similarly, the spectral constraint, Jp (m), can also be attached to the update
equation of the normalized MCFLMS (NMCFLMS) algorithm in order to improve the
noise robustness. The update equation of the original NMCFLMS algorithm can be
expressed as [35]
b 10 (m + 1) = h
b 10 (m) P1 (m)J 01 , k = 1, , M
h
k
k
k
k
(5.47)
where
b 10 (m) = F2L2L [ILL OLL ]T h
b k (m)
h
k
M
X
Pk (m) =
Dxi (m)Dxi (m)
(5.48)
(5.49)
i=1,i6=k
Jk01 (m) =
M
X
i=1
01
Dxi (m)W2LL
eik (m)
(5.50)
86
Concatenating the M impulse response vectors into a longer one, we can write the
update equation as
b 10 (m + 1) = h
b 10 (m) P1 (m)J 01 (m)
h
(5.51)
(5.52)
where
Jn 01 (m) = P1 (m)J 01 (m)
(5.53)
b 10 (m)
Jp 10 (m) = Q10 (m)h
[J 10 (m)]H J 01 (m)
p
n
n (m) =
10
2
||Jp (m)||
(5.54)
(5.55)
In this work, the AIRs will be estimated using (5.52) for speech dereverberation. The
extra computational cost required to implement the proposed penalty term is not
significant. For example, the total number of multiplications and divisions required
by the NMCFLMS algorithm is 4M 2 L + 5M L2 + 4M L per iteration, whereas the
increase in the computational cost due to the added penalty term is only 5M L + 1.
The implementation of our blind channel estimation algorithm is shown in Table 5.3.
5.2.1
Simulation results
87
Step 2 k = 1, 2, , M
Compute
b k (m) xT (m)L1 h
b i (m),
eik (mL) = xTi (m)L1 h
k
eik (m) = [eik (mL) eik (mL + 1) eik (mL + L 1)]T ,
eik (m) = FLL eik (m).
10
NPM, dB
88
NMCFLMS ( = 0.5)
VSSMCFLMS
5
10
RNMCFLMS ( = 0.5)
RVSSMCFLMS
15
20
0
RNMCFLMS ( = 0.25)
500
1000
Frames, m
1500
Figure 5.3: NPM profile of the spectrally constrained and conventional MCFLMS
algorithms for acoustic channel identification with white Gaussian input at SNR = 10
dB.
0
NPM, dB
NMCFLMS
RNMCFLMS
VSSMCFLMS
10
RVSSMCFLMS
15
0
Figure 5.4:
5000
10000
Frames, m
15000
algorithms for acoustic channel identification with speech input at SNR = 15 dB.
without sacrificing the speed of convergence for the reason explained after (5.43). It can
also be seen that the step-size parameter, , acts as a trade-off between the convergence
speed and final misalignment error for the proposed RNMCFLMS algorithm.
We now present the NPM profile of the estimated channel using speech input
at SNR 15 dB in Fig.
5.4.
algorithms we see good initial convergence. With increased iterations, the NMCFLMS
completely misconverges. As stated earlier, the VSS-MCFLMS is more robust than
the NMCFLMS and shows slow misconverging trend. To the contrary, the proposed
NPM (dB)
89
4
6
8
10
0
500
1000
5.2.2
Now, the spectrally constrained RNMCFLMS algorithm will be used for estimating
the long AIR commonly encountered in dereverberation problem. The AIRs were
generated considering the same virtual room described in Section 4.4.2 for reverberation
time T60 = 0.55 s and then truncated so as to make the length 4400. The sampling
frequency was 8 kHz. For speech input, we consider a female speech, sampled at 8 kHz.
For noise, we consider computer generated Gaussian random sequence.
Fig. 5.5 shows the channel estimation results with conventional and proposed
RNMCFLMS algorithms at 25 dB SNR with speech input. The conventional algorithm
shows good initial convergence as revealed from the lower NPM values in the early
stage of iterations. But following this apparent convergence, NPM starts to increase
until complete misconvergence. To the contrary, the spectrally constrained algorithm
converges to a steady-state solution with almost no sacrifice in the speed of convergence.
The accuracy of the final estimate is, however, dictated by the noise level of the
90
Ch3
Ch2
Ch1
Ch5
0
1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2.2
4
x 10
(b) estimated by the spectral constraints NMCFLMS
amplitude
1
0
1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2.2
4
x 10
(c) estimated by the NMCFLMS
1
0
1
0.2
0.4
0.6
0.8
1.2
sample index
1.4
1.6
1.8
2.2
4
x 10
Figure 5.6: (a) The true 5 acoustic channels with 4400 coefficients generated by the
Image model. (b) Estimated channel using the Spectrally Constrained RNMCFLMS
algorithm. (c) Estimated channel using the original NMCFLMS algorithm
observation data.
Fig. 5.6 illustrates the true 5 acoustic channels with 4400 coefficients generated by
the Image model and those estimated using the spectrally constrained and conventional
NMCFLMS algorithms. It is observed that the RNMCFLMS algorithm gives close
estimation of the AIRs as indicated by the NPM level of around 8 dB. However,
without this constraint, the algorithms fails to estimate the channel even in moderate
noise of 25 dB as shown in Fig. 5.6 (c).
Fig. 5.7 shows the convergence profile of the RNMCFLMS algorithm in a timevarying condition. Here the source position was shifted four times to the left by 1 cm
at each step during the channel estimation process. The notches in the NPM curve
shows the instant when the AIR became changed. It is observed that the algorithm
steadily converges to the final solution despite frequent changes in the AIRs without
91
0
1
NPM (dB)
2
3
4
5
6
7
0
500
1000
iteration index, m
1500
Figure 5.7: Channel estimation profile with iterations for time-varying channels using
the spectrally constrained RNMCFLMS algorithm.
significant perturbation.
5.2.3
92
NPM (dB)
8
0
2000
4000
6000
iterations, m
8000
10000
Ch1
0.5
Ch4
Ch3
Ch5
amplitude
0.5
1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2.2
4
x 10
(b) Estimated channel
1
Ch1
0.5
Ch5
Ch4
Ch3
Ch2
0
0.5
1
0.2
0.4
0.6
0.8
1.2
1.4
sample index
1.6
1.8
2.2
4
x 10
Figure 5.9: (a) True acoustic channel obtained from the MARDY. (b) Estimated
channel using the Spectrally Constrained algorithm.
5.2.4
93
0
1
NPM (dB)
2
3
4
M = 5 channels
5
M = 8 channels
6
7
0
500
1000
1500
Iteration index
2000
2500
As the number of
microphone increases, the number of zeros that are common to all the channels are
reduced. As a result the estimation quality improves. However, the more the number
of channels the higher the computational complexity. Fig. 5.10 shows the channel
estimation profile for real reverberant channels with 5 and 8 microphones. The final
NPM values with 8 microphones are around 1.21 dB better than that of using 5
microphones.
5.3
Conclusion
In this chapter, we have proposed two novel solutions to improve the noise-robustness
of both the NMCFLMS and the VSS-MCFLMS algorithms. The first algorithm which
is termed as excitation-driven MCLMS, converges in the steady-state, to a weighted
combination of all the eigenvectors giving a noise-robust solution.
However, the
algorithm is not suitable for dereverberation problem as the AIR does not remain
time-invariant in a practical situation allowing steady-state convergence. The second
technique is free from such limitation as it can estimate the time-varying AIRs required
94
for speed dereverberation. We have proposed a novel cost function that inherently
opposed nonuniform spectral attenuation resulting from the noisy update vector and
thus contributed to ameliorating the misconvergence of the MCFLMS algorithm.
Chapter 6
Robust Speech Dereverberation
Using Channel Information
Robust channel estimation algorithms developed in the previous chapters will now
be utilized for speech dereverberation. We present two different techniques that can
dereverberate speech as well as improve SNR of the signal recorded by an array of
microphones. In the first technique, the focus is primarily on the suppression of
late reverberation whereas in the second one, the elimination of both early and late
reflections are targeted. The proposed techniques do not require a priori information
about the AIRs, location of the source and microphones, or statistical properties of the
speech/noise, which are some common assumptions in the related literature.
6.1
Speech dereverberation does not need complete equalization of the acoustic channel
and, therefore, a shortened channel which requires less computation with acceptable
performance can serve the purpose.
minimization method is very common but it suffers from severe distortion in the
95
96
y1(n)
H1 (z)
H2 (z)
x1(n)
v 2 (n)
x 2 (n)
Input
s(n)
v1(n)
y 2 (n)
.
.
.
HM (z)
yM (n) vM (n)
.
.
.
xM (n)
Additive
Noise
_
x(n)
Output
W(z)
^s(n)
Shortening filter
perceptual quality of speech and signal-to-noise ratio. The schematic diagram of the
proposed dereverberation model is shown is Fig. 6.1. The method works in two stages:
delay-and-sum beamforming followed by channel shortening.
6.1.1
Delay-and-sum beamforming
In the first stage, we perform delay-and-sum beamforming which means the signals
received by the microphone array are time-aligned with respect to each other and then
added together. The output of the delay-and-sum beamformer can be expressed as
M
M
1 X
1 X
x(n) =
xk (n k ) =
[yk (n k ) + vk (n k )]
M k=1
M k=1
(6.1)
97
where k is the required delay to compensate for the propagation time of kth channel.
We assume that k is known to us. Since yk (n) = sT (n)hk , we can write (6.1) as
"
#
M
X
1
x(n) = sT (n)
hk,k + v(n k )
M k=1
+ v(n k )
= sT (n)h
where hk,k is the k samples delayed version of hk , v(n) =
(6.2)
1
M
PM
k=1
vk (n). Therefore,
channel, h.
speech source is attenuated at the output of the beamformer when the signals are
superimposed. Thus, delay-and-sum beamforming improves the signal-to-noise ratio of
the received microphone signal.
6.1.2
Channel shortening
The shortening filter shown in Fig. 6.1 minimizes the energy in the late reflections
and hence, reduces the reverberation effect. To design the shortening filter, we assume
that there is no significant variation in the AIRs until the LS minimization algorithm
b
converges. Let h,
w and c represent the estimated equivalent SISO channel, shortening
filter and equalized channel impulse responses of length Lh , Lw and Lc , respectively.
Therefore we can write
b
c = Hw
b
b is the tall convolution matrix of h,
where H
which is Lc Lw Toeplitz. The equalized
channel response c can be divided into two parts: early portion cearly and late portion
clate . Therefore, the cost function for minimizing the energy in the late portion can be
written as
Jlate =
b
b T D2 Hw
wT Aw
clate T clate
wT H
=
=
b
b T Hw
cT c
wT Bw
wT H
(6.3)
98
[58]
Jlate (l) = wT (l)Aw(l).
(6.4)
(6.5)
where Jlate (l) = (A + AT )w(l). Here the step-size governs the stability and
convergence speed of the algorithm. Now, we propose a variable step-size that ensures
optimal performance in the adaptation process. Let wopt be the optimum equalizer.
We define a cost function such as
J (l) = [wopt w(l + 1)]T [wopt w(l + 1)]
(6.6)
which measures the distance between w(l + 1) and wopt in each iteration. Substituting
(6.5) into (6.6) and setting J (l)/(l) = 0, we obtain the expression of variable
step-size which is
adap (l) =
(6.7)
where ||.|| is the l2 norm. In the above expression, wopt is unknown. However, it can
be easily shown that
T
Jlate
(l)wopt = wT (l)(A + AT )wopt = 0.
T
Jlate
(l)w(l)
.
||Jlate (l)||2
(6.8)
(6.9)
99
Lc
Y
|ci (l)|2 =
i=1
Lc
Y
|b
hi wi (l)|2
(6.10)
i=1
subject to
|c1 (l)|2 + |c2 (l)|2 + + |cLc (l)|2 =
1
Lc
(6.11)
b
ci , b
hi and wi represents ith elements of Lc point DFT of c, h
and w, respectively. Eq.
(6.11) can be easily ensured by imposing the unit norm constraint on the shortened
channel vector, c. Maximizing the penalty function J pc (l) ensures near spectral flatness
of c. The proof of this statement comes from the fact that the product of some elements
becomes maximum only when all the elements are equal (provided that their summation
remains constant). In order to simplify the expression of the penalty gradient, we take
natural logarithm on both sides of (6.10) which does not relax the functionality of the
penalty term. Therefore, we can rewrite the penalty cost function as
Jp c (l) =
Lc
X
ln(|ci (l)|2 ).
(6.12)
i=1
Jp (l)
2
= .
wi (l)
wi (l)
(6.13)
(6.14)
(6.15)
100
6.1.3
Simulation results
101
1
(a)
0.5
0
amplitude
500
1000
1500
2000
2500
3000
3500
4000
2500
3000
3500
4000
1
(b)
0.5
0
0
1
0.5
0
0.5
500
1000
1500
2000
(c)
1000
2000
3000
samples
4000
5000
6000
Figure 6.2: Channel impulse responses: (a) original (b) estimated using robust MCLMS
algorithm (c) shortened channel using the estimated impulse responses.
technique and compare the result with infinity-norm optimization, both using the
estimated channels, in terms of direct-to-reverberant energy ratio (DRR), perceptual
evaluation of speech quality (PESQ) and signal-to-noise ratio (SNR). DRR is a popular
objective measure of the room reverberation. If reverberation is considered as noise,
DRR is similar to SNR. The DRR is measured as
P 2
d (n)
DRR = 10 log10 Pn 2
n r (n)
(6.16)
where d(n) and r(n) are the direct sound and reverberant part of the recorded signal.
The time boundary between the direct sound and reverberant part is usually taken as
50 ms. Table 6.3 shows that the proposed algorithm gives higher DRR as compared to
infinity-norm optimization. PESQ scores measure the level of speech distortion at the
output of the shortened channel. The advantage of PESQ is that it maintains a good
correlation with subjective score in a very wide range of conditions. The highest value
of PESQ is 5 which indicates the compared signal is exactly identical to the original
signal. The proposed technique gives better PESQ score and higher SNR as compared
to the infinity-norm optimization. Both PESQ and SNR results indicate that channel
shortening can be a good approach for speech dereverberation.
102
0
fixed = 0.13
fixed
10
J (dB)
fixed = 0.01
= 0.05
adap
15
20
0
100
200
300
400
500
600
iterations
700
800
900
1000
Figure 6.3: Convergence profile of the shortening algorithm with fixed step-size and
proposed variable step-size.
1
(a)
0.5
magnitude
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1.2
1.4
1.6
1.8
1.2
1.4
1.6
1.8
1
(b)
0.5
0
0
0.2
0.4
0.6
0.8
1
(c)
0.5
0
0
0.2
0.4
0.6
0.8
1
/
Figure 6.4: Frequency spectrum of the original and shortened channels: (a) original
(b) shortened using the proposed spectral constraint algorithm (c) shortened using LS
minimization.
Although it may appear that the performance improvement of the proposed
algorithm as compared to the infinity-norm optimization is not significant in terms
of PESQ and SNR improvement, the strength of the proposed algorithm lies in
its computational efficiency. For example, the proposed algorithm converges in 500
iterations, whereas the infinity-norm optimization requires 50, 000 iterations to produce
steady-state results. Not only that, mean time per iteration is higher in infinity-norm
103
DRR(dB)
(TIMIT database)
PESQ
SNR(dB)
rev
-norm
proposed
rev
-norm
proposed
rev
-norm
proposed
6.56
12.65
13.54
2.09
2.44
2.47
20
25.30
25.80
5.36
13.72
15.12
1.98
2.26
2.30
20
22.27
23.00
4.31
13.67
14.97
1.98
2.17
2.28
20
22.04
22.55
5.37
13.30
14.27
1.99
2.14
2.30
20
23.06
23.94
5.09
13.64
14.93
2.09
2.28
2.31
20
23.43
24.00
5.40
13.29
14.40
2.22
2.33
2.41
20
23.18
23.54
4.11
12.08
13.05
2.12
2.17
2.29
20
23.73
24.03
4.86
12.36
13.21
2.32
2.38
2.43
20
24.40
24.78
Female
Male
35
infinitynorm
proposed
30
25
20
15
10
2000
2500
3000
3500
channel length, L
4000
Figure 6.5: Comparison of mean time per iteration for the proposed and infinity-norm
algorithms.
algorithm than the proposed one. The mean time per iteration for the two comparing
methods is shown in Fig. 6.5 using a system comprising 2.4 GHz Intel(R) Core(TM)2
Quad CPU with 996 MB RAM. Therefore, the proposed algorithm is much faster than
infinity-norm optimization.
Now we investigate the effectiveness of the beamformer in the proposed
dereverberation model. The shortening filter could be designed from each of the
estimated AIR in stead of obtaining it from the equivalent SISO channel after
beamforming operation.
104
Table 6.2: Results of SNR, DRR and PESQ improvement with and without delay-andsum beamformer
Shortening filter
SNR (dB)
DRR (dB)
PESQ
20
6.55
2.09
Ch-1
23.62
12.92
2.15
Ch-2
17.53
12.19
2.18
Ch-3
23.41
11.86
2.52
Ch-4
18.38
12.49
2.15
Ch-5
18.27
12.95
2.35
25.86
13.53
2.47
Individual filter
Delay-and-sum beamformed
individual and delay-and-sum shortening filters in terms of SNR, PESQ and DRR.
The SNR of the reverberated and dereverberated signals are estimated as the ratio
of the energy of the speech component to that of noise component in those signals,
respectively. The results show that the shortening filter followed by delay-and-sum
beamforming gives the highest improvement in SNR and DRR. The PESQ points are
also better than most of the individual outputs. Considering all these factors, it can be
remarked that the beamformer plays a positive role in the dereverberation technique.
Finally we present speech dereverberation performance using the real reverberant
channels obtained from MARDY. The average improvement in DRR is 7.25 dB, PESQ
improves by 0.38 points and improvement in SNR is 4.96 dB.
6.1.4
105
DRR(dB)
(TIMIT database)
PESQ
SNR(dB)
rev
proposed
rev
proposed
rev
proposed
8.70
15.87
2.18
2.65
30
34.74
8.57
15.94
2.21
2.61
30
34.77
8.25
15.20
2.25
2.61
30
35.04
9.64
17.10
2.31
2.65
30
35.40
7.63
15.11
2.55
2.89
30
34.81
7.98
15.48
2.13
2.55
30
35.40
7.46
14.17
2.33
2.73
30
35.08
8.92
16.34
2.44
2.79
30
34.44
Female
Male
the AIRs. The psychoacoustic properties of the human auditory system allows us to
keep the early reverberation unchanged but the late reverberation must be suppressed
to the extent as much as possible to improve the perceptual quality of the speech
signal. In spite of these advantages, the channel shortening technique has the following
limitations.
1. The shortening filter obtained from the LS minimization is a narrowband filter.
As a result the early reflections does not remain unchanged after the shortening
process. This introduces speech distortion in the dereverberated signal.
2. The early reverberation causes a spectral distortion called coloration effect. The
shortening filter cannot eliminate this distortion.
3. No sophisticated SNR improvement technique can be incorporated with the
shortening filter. The reason is that the SNR improvement filter heavily distorts
the AIR and hence the distortion in the early portion of the shortened channel
is so severe that the perceptual quality of the speech drastically falls.
6.2
106
6.2.1
If the ZFE is cascaded just after the channel estimation stage, the SNR deteriorates
at the output of the equalizer due to severe noise amplification near the spectral nulls.
Therefore, an SNR improvement scheme is essential before the channel equalization
stage. To this end, we propose a modified eigenfilter in this section. The modification
is made in two ways. First, the conventional design of eigenfilters is computationally
expensive. Therefore, an efficient eigensolver technique is proposed that reduces the
cost of computation by avoiding the Cholesky decomposition. Second, the equivalent
Channels
Additive
Noise
s(n)
x1(n)
y 2 (n)
v 2 (n)
.
.
.
hM
Block IDFT
Output
F -1
^s(n)
z(n)
z2(n)
yM (n) vM (n)
x 2 (n)
.
.
.
Block DFT
z1(n)
h1
h2
v1(n)
y1(n)
107
.
.
.
x M (n)
zM(n)
.
.
.
Figure 6.6:
.
.
.
.
.
.
Channel Estimation
^
h1
h2
...
^
,
technique.
channel becomes extremely narrowband when the AIRs are filtered through the
eigenfilters. As a result, the speech signals get distorted at the output of the eigenfilters.
To overcome this limitation as well as remove the spectral nulls, a frequency-domain
constraint is attached to the eigenfilters that improves the quality of the dereverberated
speech.
It is reported in the literature that an eigenfilter that maximizes the SNR can
be obtained from the eigenvector of the data correlation matrix corresponding to the
largest eigenvalue [51]. However, speech dereverberation is a blind problem and we
do not have access to the speech signal. One may use the correlation matrix obtained
from the received microphone signals, with the assumption that the desired signal (here
speech signal) is a wide-sense stationary (WSS) random process. But speech signal is
highly nonstationary and the WSS assumption does not hold at all. Therefore, the
eigenfilter (eigenvector corresponding to the largest eigenvalue) estimated from the
data correlation matrix is not a proper choice.
108
Signal path
signal
zk
Noise path
v
noise
zk
Figure 6.7: Block diagram of the signal path and noise path for the kth channel.
We propose an improved eigenfilter technique utilizing the estimates of the AIRs
that enhances the energy in the signal path as compared to that of the noise path.
Let, gk (n) of length Lg represents the eigenfilter in the kth channel. Now, we can
separate the signal and noise path as shown in Fig. 6.7 where zksignal and zknoise are the
speech and noise components at the output of the filter gk , respectively. If we design
gk (n) such that the energy of the signal path maximizes with respect to that of the
noise path, the SNR will increase at the output of the eigenfilter. Let heq (n) of length
Leq = L + Lg 1 represents the equivalent channel impulse response at the output of
the eigenfilter block. Then, we can write
heq (n) = H(n)g(n)
(6.17)
T
where g(n) = [g1T (n) g2T (n) gM
(n)]T is the composite eigenfilter and H(n) =
[H1 (n) H2 (n) . . . HM (n)] is the composite convolution matrix, where Hi (n) of size
Leq Lg is the convolution matrix of hi (n). Now, the desired objective function can
be written as
hTeq (n)heq (n)
gT (n)A(n)g(n)
=
Jc (n) = T
g (n)g(n)
gT (n)g(n)
(6.18)
where A(n) = HT (n)H(n). The optimal method for maximizing the signal path energy
would find g(n) so as to maximize gT (n)A(n)g(n) while satisfying gT (n)g(n) = 1. We
see, therefore, that this problem may be viewed as an eigenvalue problem and the
optimum FIR filter that maximizes Jc (n) can be obtained as
gopt (n) = qmax
(6.19)
109
where qmax is the eigenvector associated with the largest eigenvalue of A(n). We can
b
easily estimate A(n) from the estimates of AIRs, h(n).
Since the AIRs may vary with
time, a fixed qmax obtained from A(n) of a particular instant would not work for the
entire speech waveform. Therefore, we need to update the matrix A(n) with a new
b
set of h(n)
at regular intervals. Moreover, a sharp change in the AIRs usually gives
an abrupt rise in the cost function, and this fluctuation may also be used for updating
b
the matrix A(n) with a new set of h(n)
whenever the AIRs change.
An iterative algorithm is proposed here for finding the eigenfilter which gives a
number of advantages over the onepass solution. First, we can avoid computationally
intensive Cholesky decomposition which may become unstable for large A(n) [58].
Second, the optimum eigenfilter is extremely narrowband in the frequency-domain
causing speech distortion in the output. Moreover, spectral nulls are present in the
equivalent channels which causes significant noise amplification at the output of the
equalization process.
(6.20)
for finding the eigenvector corresponding to the maximum eigenvalue. The update
equation at lth iteration can be expressed as
l
b
gl+1 (n) = A(n)g
(n)
(6.21)
b
where = 1/Tr{A(n)}
is the step size and Tr{} represents the trace of a matrix. The
proof of (6.21) is provided in the Appendix which shows that the algorithm converges in
b
the mean to the eigenvector of A(n)
corresponding to the largest eigenvalue. In order
to enforce spectral flatness of gl (n) in the frequency-domain, we formulate a penalty
function similar to (5.37) which is
maximize Jf c (n) =
Lg
Y
i=1
|g li (n)|2
(6.22)
110
where g li (n) represents the ith elements of Lg point discrete Fourier transform (DFT)
of gl (n). Maximizing the penalty function, Jf c (n), tries to make each and every
component of gl (n) uniform in the frequency-domain. Thus it resists spectral nulls
in the equivalent channel impulse response, heq . In order to simplify the expression of
the penalty gradient, we take natural logarithm on both sides of (6.22) which does not
relax the functionality of the penalty term. Therefore, we can rewrite the penalty cost
function as
Jf c (n) =
Lg
X
ln(|g li (n)|2 ).
(6.23)
i=1
Jf c (n)
2
=
.
l
conj{g k (n)}
conj{g lk (n)}
(6.24)
(6.25)
(6.26)
6.2.2
Dereverberation of speech requires blind equalization of the AIRs. Among the various
linear equalization techniques proposed in the literature, the zero-forcing equalizer
111
b
b
Pre-compute A(n)
using the estimated channels, h(n).
b
Obtain = 1/Tr{A(n)}.
Step 2
l
b
Step 3 Compute A(n)g
(n), and Jf c (n) as defined in (6.25).
step 4 Update g:
l
b
gl+1 (n) = A(n)g
(n) + (1 )F1 Jf c (n)
(ZFE) and the minimum mean-square error (MMSE) equalizer are the most common
[60]. Although the MMSE technique is more noise robust than the ZFE, the advantage
is obtained at the expense of computational complexity as the minimum of the error
function is to be searched for a wide range of delays. The ZFE is computationally
efficient and gives direct equalization of the AIRs in the frequency domain. However,
the ZFE can lead to considerable noise amplification which makes it unsuitable for
practical applications.
along with the frequency-domain constraint can effectively compensate for such SNR
degradation by providing adequate signal power enhancement in the previous stage.
Moreover, the ZFE is implemented in block adaptive mode for the reasons stated after
(6.27).
We can obtain an estimate of the equivalent channel, heq (n), from the speaker to
b
the output of the eigenfilter from the estimates of AIRs, h(n),
and impulse response of
the eigenfilter, g(n). Let heq represents the equivalent channel vector in the frequencydomain. Therefore, the kth frequency-component of the required ZFE can be expressed
112
as
jk
a(e
heq (ejk )
) =
|heq (ejk )|2 +
(6.27)
where heq (ejk ) is the kth frequency-component of heq . A small positive number, , is
added in the denominator to avoid division by zero. However, this simple ZFE is not
implementable in practice. The main reason is that the DFT size for obtaining heq
should be, at least, the sum of the lengths of the signal vector and the channel impulse
response vector minus one. But the length of the speech signal is usually undefined.
It is not practically possible to store the entire speech waveform and then perform
zero forcing equalization. Moreover, the AIRs are slowly time-varying. We cannot
assume the same heq for the entire speech signal. The above mentioned problems can
be resolved by two different ways. The first one is the time-domain filter obtained from
the IDFT of heq (ejk ). However, the zeros of the AIRs are very close to the unit circle
in the z-plane and hence the FIR approximation of the inverse filter becomes very high
order. In other words, the noncausal part of the inverse filter is prohibitively large
for causal implementation. For example, the length of the noncausal part of a typical
inverse filter is around 2.5 105 which requires 31.25s delay for causal implementation.
The second approach is the block adaptive ZFE in the frequency-domain. Although a
block delay is unavoidably introduced in this case, such a delay is smaller than that
with the causal implementation in the time domain. For example, with a block size
of 9L and L = 4400, the block delay is 4.95s. In this work, we propose a block ZFE
utilizing the overlap-save method [61].
The combined output of the signal power enhancement eigenfilters as shown in Fig.
6.1 can be expressed as
z(n) = x1 (n) g1 (n) + + xM (n) gM (n)
= s(n) heq (n) + v(n)
(6.28)
where heq (n) = h1 (n) g1 (n) + + hM (n) gM (n) and v(n) = v1 (n) g1 (n) + +
vM (n) gM (n). The improved SNR at the output of the eigenfilter can be estimated
113
(6.29)
The power of the signal term in (6.28) is significantly enhanced as compared to that
of the noise term due to eigenfiltering in the previous stage. Therefore, the noise term
can be ignored in the derivation of the ZFE for simplification.
Now, we formulate a suitable transformation operation that can transform a block
of data at the output of the eigenfilter to a direct product of the source signal vector and
equivalent channel vector. Then it becomes easy to dereverberate that block of data
by canceling out the effect of heq (n) using the estimates. Let z(m) represents a vector
of length pL (p 2, an integer number), that results from the circular convolution of
the source signal vector, s, and the equivalent channel, heq ,
z(m) = Cs (m)h10
eq (m)
(6.30)
We can find from inspection that the last (p 1)L points of z(m) are identical to linear
convolution between s and heq which can be represented as
01
z(m)
z(m) = W(p1)LpL
10
01
heq (m)
Cs (m)WpLL
= W(p1)LpL
(6.31)
where
z(m) = [z(m(p 1)L) z((m + 1)(p 1)L 1)]T
114
and
01
W(p1)LpL
= [0(p1)LL I(p1)L(p1)L ]
10
WpLL
= [ILL 0L(p1)L ]T .
(6.32)
where Ds (m) is a diagonal matrix whose elements are obtained from the DFT
coefficients of the first column of Cs (m). Substituting Ds (m) into (6.31) and taking
DFT of z(m), we obtain the frequency-domain block vector as
01
Ds (m)h10
z(m) = W(p1)LpL
eq (m)
(6.33)
where
z(m) = F(p1)L(p1)L z(m)
01
01
W(p1)LpL
= F(p1)L(p1)L W(p1)LpL
F1
pLpL
10
h10
eq (m) = FpLpL WpLL heq (m).
01
Multiplying both sides of (6.33) with WpL(p1)L
, we get
01
01
WpL(p1)L
z(m) = WpLpL
Ds (m)h10
eq (m)
(6.34)
where
01
WpL(p1)L
= FpLpL [0(p1)LL I(p1)L(p1)L ]T
F1
(p1)L(p1)L
01
01
01
W(p1)LpL
= WpL(p1)L
WpLpL
0LL
0LL
= FpLpL
0(p1)LL I(p1)L(p1)L
F1
pLpL .
01
For acoustic channels, L is usually very large and hence WpLpL
can be approximated
p1
p
IpLpL .
(6.35)
115
01
WpL(p1)L
z(m)
p1
p
Ds (m)h10
eq (m).
(6.36)
Now, the right hand side of (6.36) is the product of source data matrix and equivalent
channel vector in the frequency-domain. The term, (p1)/p is simply a scalar quantity.
Therefore, its effect can be neglected in the derivation. Now, the block-adaptive ZFE
that can compensate for h10
eq (m) in z(m) can be
a 0 ...
1
0 a2 . . .
.. .. . .
.
. .
A(m) =
0 0 ...
.. .. . .
.
. .
0 0 ...
easily obtained as
0 ... 0
0 ... 0
..
..
.
.
0
ai . . . 0
..
..
.
.
0
0 . . . apL
(6.37)
10
10
2
where ai = conj{h10
eq,i (m)}/(|heq,i (m)| + ) and heq,i (m) is the ith component of the
vector h10
eq (m). An estimate of A(m) can be obtained from the estimated AIRs and
eigenfilter impulse response. Therefore, the source signal block,
sb (m) can be extracted
from z(m) as
01
b
sb (m) = F1
pLpL A(m)WpL(p1)L z(m)
(6.38)
where
sb (m) = [b
s(m(p 1)L L) sb(m(p 1)L) sb((m + 1)(p 1)L 1)]T .
Now, we can obtain the dereverberated speech block at the output of the equalizer
corresponding to each element of z(m) as
01
b
sb (m) = W(p1)LpL
sb (m)
(6.39)
(6.40)
116
6.2.3
Simulation results
model [53] and the the real reverberant channels were obtained from the multichannel
acoustic reverberation database at York (MARDY) [57]. We also present comparative
dereverberation performance using the correlation based multichannel Inversion with
Spectral Subtraction (ISS) [28], Multichannel Linear Prediction (MLP) [22], infinitynorm minimization (-norm) [41] algorithms. For speech input, a number of both
female and male utterances, sampled at 8 kHz were used. Different objective measures
that are used to evaluate the quality of speech are log-likelihood ratio (LLR), average
segmental SNR (segSNR), weighted spectral slope (WSS) and perceptual evaluation of
117
(a)
1
0.5
0
0.5
500
1000
1500
2000
2500
3000
3500
4000
2500
3000
3500
4000
(b)
amplitude
1
0.5
0
0.5
500
1000
1500
2000
(c)
1
0.5
0
0.5
1000
2000
3000
samples
4000
5000
6.2.4
The dimension of the room size was taken to be (543) m. A linear array consisting of
M = 5 microphones with uniform separation of d = 0.2 m was used in the experiment.
The first microphone was positioned at (1.0, 1.5, 1.6) m and the location of the other
microphones can be obtained by successively adding d = 0.2 m to the y-coordinate of
the first microphone. The initial position of the speaker was fixed at (2.0, 1.2, 1.6) m.
Wall reflection coefficients are 0.9 for all walls, ceiling, and the floor. The length of each
impulse response was L = 4400 samples and the reverberation time T60 = 0.55s was
considered. The additive noise was white zero-mean Gaussian. In all cases, = 0.5,
= 0.05, eigenfilter length, Lg = 1100 were used. The block length for ZFE was 9 L,
which means a block-delay of 4.95s.
Fig. 6.8 (a) depicts the original channel and (b) the estimated channel using the
118
1
(a)
magnitude
0.5
0
0
0.5
1.5
(b)
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Figure 6.9: Frequency spectrum of the equivalent channel from the speaker to the
output of the beamformer (a) without spectral constraint (b) with spectral constraint,
= 0.5.
Table 6.6: Effect of the eigenfilter on the dereverberation performance of the proposed
scheme
Cases
SNR
LLR
segSNR
WSS
PESQ
Reverberated
20
0.791
3.58
59.63
2.010
Case 1
19.49
0.697
0.74
39.89
2.380
Case 2
22.13
0.511
0.24
40.75
2.454
Case 3
24.58
0.485
0.20
41.49
2.553
robust NMCFLMS algorithm at 20 dB SNR. The NPM between the original and
estimated channels is 7.8 dB. The direct inversion using the MINT method fails
to equalize the AIRs with such an estimate. Fig. 6.8 (c) shows the IDFT sequence of
the equalized channel at the output of the ZFE using the proposed method. We see
that the equalized channel is near impulse like and both the early and late reflections
are significantly attenuated. As a result, we can say that the dereverberation of the
119
dereverberation performance of our algorithm. The results reveal that the proposed
technique is slightly dependent on this parameter giving almost similar performance
when the block-length is varied.
120
Figure 6.10: Spectrogram of the (a) clean speech (b) noisy reverberated speech at 30
dB SNR (c) denoised speech (d) dereverberated using the proposed method.
objective scores
LLR
SegSNR
PESQ
0
3
6
7
block length ( L)
121
SNR (dB)
LLR
segSNR
WSS
PESQ
(estimated channel)
input
output
rev
derev
rev
derev
rev
derev
rev
derev
8.23
25
28.2
0.74
0.39
1.15
1.55
51.00
32.88
2.23
2.86
7.76
20
23.0
0.94
0.59
1.34
1.08
50.54
35.58
2.16
2.75
7.60
15
18.0
1.25
0.84
1.77
0.57
49.64
36.71
2.05
2.58
5.33
10
12.7
1.66
1.32
2.63
-0.04
49.06
38.79
1.84
2.32
LLR
(TIMIT database)
rev
-norm
ISS
MLP
proposed
1.225
1.017
0.989
0.993
0.751
1.265
1.074
1.156
0.930
0.915
1.039
0.875
1.055
0.709
0.746
1.064
0.876
0.935
0.831
0.749
0.869
0.690
0.609
0.655
0.545
1.118
0.861
1.001
0.918
0.698
1.030
0.750
0.804
0.766
0.638
1.290
1.057
0.959
0.974
0.886
Female
Male
122
Table 6.9: Quality of the dereverberated speech in terms of segSNR for the proposed
and other state-of-the-art techniques
Speech
segSNR
(TIMIT database)
rev
-norm
ISS
MLP
proposed
3.44
4.93
1.31
2.03
2.41
4.51
4.97
1.29
3.69
0.55
3.53
3.72
2.64
2.57
0.39
3.69
3.69
2.02
4.33
0.45
5.89
4.42
2.50
2.76
1.40
4.92
5.14
2.33
3.62
1.27
5.18
5.34
4.29
2.29
0.49
5.30
6.36
2.39
3.96
0.69
Female
Male
Table 6.10: Quality of the dereverberated speech in terms of WSS for the proposed
and other state-of-the-art techniques
Speech
WSS
(TIMIT database)
rev
-norm
ISS
MLP
proposed
68.44
60.34
61.22
64.15
40.99
70.58
61.22
80.49
66.85
45.77
65.78
58.67
54.71
56.36
42.15
61.64
55.36
47.33
62.27
36.82
48.56
48.65
45.16
29.85
33.47
59.24
52.46
79.26
45.50
39.45
51.96
49.07
53.41
42.82
34.47
55.62
52.77
72.00
39.37
38.55
Female
Male
123
Table 6.11: Quality of the dereverberated speech in terms of PESQ for the proposed
and other state-of-the-art techniques
Speech
PESQ
(TIMIT database)
rev
-norm
ISS
MLP
proposed
1.883
1.852
2.117
1.756
2.634
1.823
2.181
2.051
1.923
2.600
1.893
1.964
2.095
1.935
2.624
2.061
2.166
2.333
1.978
2.750
2.329
2.327
2.346
2.246
2.923
1.542
2.169
1.594
2.024
2.572
1.960
2.240
2.030
1.962
2.741
2.063
2.315
1.958
2.207
2.659
Female
Male
Now, we compare the performance of the proposed method with other state-ofthe-art dereverberation techniques (-norm, ISS and MLP methods) in Tables 6.8 to
6.11. For the -norm method, the length of the shortening filter was taken 1100, the
step-size in the update equation was selected 0.00001 and the iteration was continued
until the cost function reaches a steady-state minimum value. The ISS and MLP
methods were implemented with the same parameters as in the respective papers.
SNR = 20 dB was considered for evaluating all the techniques. The results show that
the proposed technique performs better than the comparing methods in terms of LLR,
segSNR, WSS and PESQ. The average improvement in the LLR is 0.372 point, which is
0.160, 0.196, 0.106 points better than the -norm, ISS and MLP methods, respectively.
The average improvement in the segSNR is 4.76 dB, which is 5.02, 2.55, 3.36 dB better
than the -norm, ISS and MLP methods, respectively. The average improvement in
the WSS score is 21.26, which is 15.85, 22.73, 11.93 score better than the -norm, ISS
and MLP methods, respectively. The average improvement in PESQ is 0.743 point,
which is 0.536, 0.622, 0.684 points better than the -norm, ISS and MLP methods,
124
Amplitude
Ch2
Ch1
0.5
Ch3
Ch4
Ch5
Ch6
Ch8
Ch7
0
0.5
1
1.5
0.5
1.5
2.5
Samples
3.5
4
x 10
Figure 6.12: Impulse responses of real reverberant acoustic channels. The length of
each impulse response is L = 4400.
respectively.
6.2.5
125
Table 6.12: Quality of the dereverberated speech for the real acoustic channels in terms
of LLR
Speech
LLR
(TIMIT database)
rev
-norm
ISS
MLP
proposed
0.954
0.910
0.790
0.978
0.748
1.080
1.077
1.179
0.923
0.815
0.831
0.882
0.870
0.690
0.638
0.849
0.863
0.822
0.849
0.666
0.692
0.783
0.786
0.643
0.512
0.954
0.819
1.130
0.890
0.749
0.778
0.735
1.130
0.770
0.665
1.049
1.030
1.000
0.988
0.789
Female
Male
Table 6.13: Quality of the dereverberated speech for the real acoustic channels in terms
of segSNR
Speech
segSNR
(TIMIT database)
rev
-norm
ISS
MLP
proposed
2.24
2.49
1.82
5.03
1.75
3.29
3.36
1.32
6.50
0.39
2.79
1.98
2.36
5.55
1.83
2.69
2.06
4.17
6.07
2.40
4.62
4.97
2.50
5.07
1.96
5.24
4.77
2.66
6.71
0.50
5.08
2.94
3.09
5.72
0.14
4.84
5.32
1.75
6.43
0.02
Female
Male
126
Table 6.14: Quality of the dereverberated speech for the real acoustic channels in terms
of WSS
Speech
WSS
(TIMIT database)
rev
-norm
ISS
MLP
proposed
43.19
41.75
51.01
55.00
33.46
48.08
50.48
61.53
66.00
43.43
42.96
45.32
56.78
61.66
32.34
42.23
42.27
44.73
58.19
32.58
29.35
33.38
30.63
30.69
28.47
43.79
42.32
49.07
44.72
40.00
34.06
36.75
45.86
41.62
31.71
39.91
42.12
37.39
39.50
35.66
Female
Male
Table 6.15: Quality of the dereverberated speech for the real acoustic channels in terms
of PESQ
Speech
PESQ
(TIMIT database)
rev
-norm
ISS
MLP
proposed
2.534
2.691
2.301
1.840
2.739
2.469
2.573
2.340
1.939
2.681
2.576
2.669
2.344
1.937
2.818
2.490
2.700
2.573
2.042
2.792
2.650
2.758
2.755
2.301
2.896
2.412
2.504
2.091
2.030
2.721
2.510
2.671
2.390
2.020
2.711
2.580
2.678
2.566
2.237
2.741
Female
Male
127
improvement in the LLR is 0.200 point, which is 0.190, 0.266, 0.144 points better than
the -norm, ISS and MLP methods, respectively. The average improvement in the
segSNR is 4.81 dB, which is 4.45, 3.42, 6.84 dB better than the -norm, ISS and MLP
methods, respectively. The average improvement in the WSS score is 5.74, which is
7.09, 12.41, 14.96 score better than the -norm, ISS and MLP methods, respectively.
The average improvement in the PESQ is 0.235 point, which is 0.107, 0.342, 0.719
points better than the -norm, ISS and MLP methods, respectively. The average
improvement in SNR for these utterances using the real acoustic channels was 2.43 dB.
The inferior performance of the comparing methods can be reasoned as follows.
The ISS assumes that the source signal is white which does not hold for the speech
input. Therefore, the received signal is prewhitened before calculating the coefficients
of the inverse filter. The technique proposed in [28] for estimating the whitening
filter is based on the magnitude spectrum of the autoregressive (AR) system of the
speech signal. Since the phase spectrum of the AR system function is ignored, the
prewhitening performance becomes erroneous which causes improper inversion of the
AIRs. Moreover, the presence of noise which is not considered in the ISS further
deteriorates its performance. The MLP method [22] estimates the AR parameters
of the speech from the characteristic polynomials of the prediction matrix calculated
using the correlation between the current samples and one sample delayed version
of the multichannel received signals. The prediction matrix was estimated from a
2s speech samples. However, the AR parameters cannot be assumed stationary for
such a long duration. As a result, the estimated AR parameters is an average of
the actual variables which deteriorates the perceptual quality of the dereverberated
speech. Moreover, in our implementation of the MLP method, the characteristic
polynomials of the prediction matrix tend to diverge from the actual value when the
AIRs exceed a few hundred taps. For comparison purposes, we have used known AR
parameters for simulating the MLP method. The method further assumes that at
least one microphone is closer to the speaker than the noise source in order to obtain
the source LP residual from the noisy received signal. However, the assumption does
not hold for incoherent noise and the dereverberated output was found severely noise
128
6.2.6
Time-varying condition
In a realistic environment, the acoustic channels are time-varying. The slight movement
of the speakers head, which is very natural during conversation, causes the AIRs to
be changed. An adaptive channel estimation algorithm can track the time-varying
channels. Therefore, the proposed dereverberation technique is suitable for changing
acoustic condition. In order to simulate the time-varying condition, the length of each
impulse response was taken to be L = 2400 samples corresponding to the reverberation
time T60 = 0.3s. Fig. 6.13 shows the NPM of channel estimation in which the source
position was shifted six times from the original position. In the first four cases, the
speaker moved to the left by 1 cm at each step and in the last two cases, the speaker
moved to the right by 1 cm at each step. The notches in the NPM curve show the
instant when the speaker moved. We find that the algorithm remains in a good NPM
level despite frequent changes in the AIRs and quickly converges to the previous level
after the movement of the speaker. In order to visualize how fast the algorithm can
track the time-varying AIRs, the speaker was moved in faster paces in subsequent
experiments. The convergence profile of the adaptive channel estimation algorithm is
shown in Figs. 9 (a) to (d). Here, we see that the algorithm requires around 20 blocks
of data to converge to the previous NPM level after the speaker has moved. Since each
block of data requires L number of new speech samples, the algorithm requires around
6 seconds (for 8000 Hz sampling frequency) to converge in the time-varying condition.
In other words, if the AIRs remain same for around 20 blocks of data, we can obtain
an estimate of the AIRs from the noisy received signal. Since the estimated channels
show a good NPM value, the dereverberation performance using these estimates would
129
(a)
6.5
7
7.5
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3100
3150
(b)
6.5
NPM (dB)
7
7.5
2650
6
2700
2750
2800
2850
2900
2950
3000
3050
(c)
6.5
7
7.5
2650
2700
2750
2800
2850
2900
(d)
6
7
8
2660
2680
2700
2720
2740
Blockindex
2760
2780
Figure 6.13: Convergence profile of the robust NMCFLMS algorithm for time-varying
channels.
be reasonable.
6.3
Conclusion
In this chapter,
dereverberation techniques suitable for a noisy environment with slowly varying long
acoustic channels. In the first approach channel shortening algorithm was utilized to
suppress the energy in the late reflections. A step-size optimized iterative shortening
algorithm was proposed that maintained a trade-off between shortening performance
and spectral distortion of the dereverberated speech. In the second dereverberation
approach, a block-adaptive zero forcing equalizer along with eigenfilter based signal
power enhancement was proposed that eliminate both early and late reverberations.
The technique was found effective for practical room impulse responses and time
varying acoustic environment.
Chapter 7
Conclusions and Further Research
The speaker enjoys natural way of communication using hands-free systems. This is
because the microphones are placed a certain distance away from him and there is
no need for wearing a headset or holding a microphone during conversation. However
this freedom of movement usually comes at the expense of increased background noise
and reverberation recorded by the distant microphones. These contaminations may
lead to total loss of intelligiblity of the speech signal. Since the early days of acoustic
signal processing, researchers have developed numerous algorithms to counteract the
detrimental effect of reverberation and background noise but their performance is
limited and only few of these are useful in practice. In this dissertation, several
multi-microphone speech dereverberation techniques have been developed using robust
acoustic channel estimation and equalization in order to improve the performance
of hands-free systems. This chapter summarizes the obtained results, highlights the
contributions and provides a guideline for future research work.
7.1
Conclusions
Although various
alternatives are available in the literature, the blind channel identification and
130
131
132
for wide range of SNRs and channel lengths. Acoustic channels were obtained from
the Image model developed by Allen et al. as well as from experimental data stored
in the MARDY [57]. The noise considered was computer generated additive white
Gaussian random signal. The channel estimation accuracy was measured by normalized
projection misalignment (NPM) index. The perceptual quality of the dereverberated
speech was measured using a variety of objective measures such as LLR, avg-Seg-SNR,
WSS, and PESQ measures. Both female and male utterances, randomly taken from
the TIMIT database, have been used to evaluate and compare the performance of the
proposed techniques.
The channel estimation results of the proposed robust MCLMS algorithms show
that the noise-robustness was achieved with almost no sacrifice in the speed of
convergence. However, the final NPM was dictated by the noise level in the received
signal. The spectrally constrained NMCFLMS algorithm achieved a steady-state NPM
of 8 dB when a 5 channel 4400 coefficients long acoustic systems were estimated with
speech input at 25 dB SNR. The NPM decreased to 5 dB when the SNR is 10 dB.
The algorithm can also track the variation in AIRs when the speaker moves slowly
from his/her original position. For real reverberant channels, the obtained final NPM
was 7 dB at 30 dB SNR. The proposed dereverberation technique can provide around
3 dB SNR improvement in the range of 10 to 25 dB. The quality of the dereverberated
speech were significantly improved as compared to the state-of-the-art techniques.
7.2
Future Research
In this section, we provide some suggestions and guidelines for future research work.
The effectiveness of the proposed noise-robust MCLMS algorithm is studied
and verified considering white Gaussian noise.
includes colored background noise, which violates certain assumptions of the developed
algorithms. The simulation environment becomes more realistic when colored noise is
considered. Thus there is a room for improvement of the blind channel identification
133
Appendix A
Iterative solution for finding the
eigenvector corresponding to the
largest eigenvalue
b corresponding to the
The iterative update equation for finding the eigenvector of A
largest eigenvalue can be formulated as
b
g(l + 1) = Ag(l)
(A.1)
b = UUT
A
(A.2)
b can be diagonalized as
A
b and is a
where U is the unitary matrix whose columns are the eigenvectors of A
b
diagonal matrix with diagonal elements k , 1 k Lg , be the eigenvalues of A.
Substituting (A.2) into (A.1) and premultiplying by UT , we obtain
go (l + 1) = go (l)
(A.3)
where, go (l) = UT g. The set of Lg first-order difference equations in (A.1) are now
decoupled. Therefore, the solution of the kth equation can be obtained as [49],
gko (l) = Ck (k )l u(l)
134
(A.4)
135
where gko (l) is the component of go (l), Ck is an arbitrary constant that depends on the
initial value of g(l), and u(l) is the unit step function. Now g(l) can be obtained as,
g(l) = Ugo (l)
g1o (l)
h
=
u 1 . . . u k . . . u Lg
..
.
i
o
gk (l)
..
.
gLo g (l)
(A.5)
b
where, uk is the eigenvector corresponding to the eigenvalue k . Since = 1/T r{A},
we have k < 1 for all k. As a result, gko (l) decays with each iteration, where the rate
of decay is dependent on the value of k . The larger the value of k , the smaller the
rate of decay. Therefore, after a large number of iterations, the final value of gko (l) can
be expressed as
gko (N )|N large = k , when k 6= max
= max , when k = max
(A.6)
where represents a small number and max k . Substituting (A.6) into (A.5), the
final estimate of the channel can be approximated as
g(N )|N large max umax
where umax is the eigenvector corresponding to the largest eigenvalue max .
List of Publications
Journal
1. M. A. Haque and M. K. Hasan, Noise robust multichannel frequency-domain
LMS-type algorithms for blind channel identification, IEEE Signal Processing
Letters, pp.305-308, vol.15, 2008.
2. M. A. Haque and M. K. Hasan, Robust multichannel LMS-type algorithms
with fast decaying transient for blind identification of acoustic channels, Journal
IET Signal Processing (formerly IEE Proceedings, UK), vol. 2, no. 4, pp.431-441,
Dec. 2008.
3. M. A. Haque and M. K. Hasan, Variable step-size multichannel frequencydomain LMS algorithm for blind identification of finite impulse response
systems, Journal IET Signal Processing (formerly IEE Proceedings, UK), vol.
1, no. 4, pp.182-189, 2007.
4. M. A. Haque, M. S. A. Bashar, P. A. Naylor, K. Hirose and M. K. Hasan,
Energy constrained frequency-domain normalized LMS algorithm for blind
channel identification, Signal, Image and Video Processing (SIViP), Springer
(UK), pp.203-213, 2007.
International Conferences
1. M. A. Haque and M. K. Hasan, Performance comparison of the frequencydomain multichannel normalized and variable step-size LMS algorithms, Proc.
136
List of Publications
137
A.
Haque,
Robust Speech
Bibliography
[1] F. A. Everest, The Master Handbook of Acoustics, McGraw-Hill, 4th edition,
2001
[2] T. Houtgast and H. J. M. Steeneken, A review of the MTF concept in room
acoustics and its use for estimating speech intelligibility in auditoria, Journal of
the Acoustical Society of America, vol. 77, no. 3, pp. 1069-1077, Mar. 1985
[3] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localisation,
The MIT Press, 1983
[4] I. J. Tashev, Sound Capture and Processing: Practical Approaches, John Wiley
& Sons, West Sussex, U. K., 2009
[5] J. H. L. Hansen and M. A. Clements, Constrained iterative speech enhancement
with application to speech recognition, IEEE Trans. Signal Process., vol. 39, no.
4, pp. 795-805, Apr. 1991
[6] M. Omologo, P. Svaizer and M. Matassoni, Environmental conditions and acoustic
transduction in hands-free speech recognition, Speech Communication, vol. 25, no.
3, pp. 75-95, Aug. 1998
[7] L.E. Ryall, Improvements in electric signal amplifiers incorporating voice-operated
devices, G.B. Patent No. 509613, 1939.
[8] E. A. P. Habets, Single- and Multi-Microphone Speech Dereverberation using
Spectral Enhancement, Ph. D. Dissertation, Technische Universiteit Eindhoven,
2007
[9] S. Gannot, D. Burshtein and E. Weinstein, Signal enhancement using beamforming
and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol.
49, no. 8, pp. 1614-1626, 2001.
[10] J. Bitzer, K. Simmer and K.D. Kammeyer, Theoretical Noise Reduction Limits of
the Generalized Sidelobe Canceller for Speech Enhancement, in Proc. of the IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP99),
Mar. 1999, vol. 5, pp. 2965-2968
138
BIBLIOGRAPHY
139
BIBLIOGRAPHY
140
BIBLIOGRAPHY
141
BIBLIOGRAPHY
142
Database
at
York