Professional Documents
Culture Documents
Ry1
y,l [n 1]Y[n]
Kl [n] (4)
2l [n] + Y[n]H Ry1
y,l [n 1]Y[n]
and:
n
X nk
Ryy,l [n] Yi [k]Yi [k]H (5)
k=0
2l [k]
Th C
At each time step t, we concatenate the Grid-LSTM cell
(s)
output mt,k for all the frequency block k and give them to a
p
X X p
Y [3n] = Xc [3n + t] Hc,t (8)
t=Tl c=1
linear dimensionality reduction layer, followed by an LDNN.
p
Here, t Tl . . . Th denotes the time index of AR filter, Hc,t 6. Experimental Details
is the complex filter for the pth look direction of the factoring
layer for channel c and AR filter context t. The advantage of 6.1. Corpora
using an AR formulation is that the network could potentially
We conduct experiments on about 18,000 hours of noisy train-
learn spatial filters with a longer timespan. The output of the
ing data consisting of 22 million English utterances. This data
factoring layer is subsampled to 33 Hz and is passed to the CLP
set is created by artificially corrupting clean utterances using a
layer. Unlike the WS approach, the output of the CLP layer
room simulator to add varying degrees of noise and reverbera-
need to be stacked only across the P look directions.
tion. The clean utterances are anonymized and hand-transcribed
voice search queries, and are representative of Googles voice
5. Grid-LSTMs search traffic. Noise signals, which include music and ambi-
The output of the fCLP layer is passed to a Grid-LSTM [9], ent noise sampled from YouTube and recordings of daily life
which is a type of two dimensional LSTM that uses separate environments, are added to the clean utterances at SNRs rang-
LSTMs to model the variations across time and frequency [17]. ing from 0 to 30 dB, with an average SNR of 11 dB. Rever-
However, at each time-frequency bin, the grid frequency LSTM beration is simulated using the image model [19] room di-
(gF-LSTM) uses the state of the grid time LSTM (gT-LSTM) mensions and microphone array positions are randomly sam-
from the previous timestep, and similarly the gT-LSTM uses the pled from 3 million possible room configurations with RT60 s
state of the gF-LSTM from the previous frequency step. The ranging from 0 to 900 ms, with an average RT60 of 500 ms.
motivation for looking at the Grid-LSTM is to explore bene- The simulation uses a 2-channel linear microphone array, with
fits of having separate LSTMs to model the correlations in time inter-microphone spacing of 71 millimeters. Both noise and tar-
and frequency. In this work, a bidirectional Grid-LSTM [18] is get speaker locations change between utterances; the distance
adopted. It utilizes a bidirectional LSTM in the frequency di- between the sound source and the microphone array varies be-
rection to mitigate the directional dependency incurred by the tween 1 to 8 meters.
unidirectional LSTM. While for the time direction, we keep the We evaluate our models using simulated, rerecorded and
unidirectional LSTM. This way we can maintain the capability real noisy farfield data. For the simulated and rerecorded sets,
of processing the speech signal in an online fashion. Further- around 15 hours (13K utterances) of anonymized voicesearch
more in [18], we have found the use of bidirectional frequency utterances were used. For the simulated sets, noise is added us-
processing allows us to use non-overlapping filters which actu- ing the room simulator with a room configuration distribution
ally reduces the computation costs by a lot. that approximately matches the training configurations. The
The bidirectional Grid-LSTM consists of a forward Grid- noise snippets and room configurations do not overlap with
LSTM and a backward Grid-LSTM. The forward processing training. For the rerecorded sets, we played the clean eval set
(fwd) consists of following steps: and noise recordings separately in a living room setting (ap-
proximately 200ms RT60 ), and mixed them artificially at SNRs
(fwd) (fwd,t) (f wd,t) (fwd,k) (f wd,k)
qu,t,k =Wum mt1,k + Wum mt,k1 , u {i, f, c, o} ranging from 0 dB to 20 dB.
(9) For the real farfield sets, we sample anonymized and hand-
(fwd,s) (fwd,s) (fwd) (fwd,s) transcribed queries directed towards Google Home. The set
it,k =(Wix xt,k + qi,t,k + bi ) (10) consists of approximately 22,000 utterances, and are typically
(fwd,s)
ft,k =(Wf x
(fwd,s) (fwd)
xt,k + qf,t,k + bf
(fwd,s)
) (11) at a higher SNR compared to artificially created sets. We also
present WER breakdown under various noise conditions in Sec-
(fwd,t) (fwd,t) (fwd,t) (fwd,t) (fwd,t)
ct,k =ft,k ct1,k + it,k g(Wcx xt,k + tion 7.
(fwd)
qc,t,k + bc(fwd,t) ) (12)
6.2. Architecture
(fwd,k) (fwd,k) (fwd,k) (fwd,k) (fwd,k)
ct,k =ft,k ct,k1 + it,k g(Wcx xt,k + All experiments in this paper use CFFT or log-mel features
(fwd)
qc,t,k + bc(fwd,k) ) (13) computed with a 32-ms window and shifted every 10ms. Low
(fwd,s) (fwd,s) (fwd)
frame rate (LFR) [16] models are used, where at the current
ot,k =(Wox xt,k + qo,t,k + bo(fwd,s) ) (14) frame t, these features are stacked with 3 frames to the left and
(fwd,s) (fwd,s) (fwd,s) downsampled to a 30ms frame rate. An LDNN architecture [11]
mt,k =ot,k h(ct,k ) (15)
is consistent in all experiments, and consists of 4 LSTM layers
For the backward processing (bwd), we have a separate with 1,024 cells/layer unless otherwise indicated, and a DNN
(bwd,) (bwd,) layer with 1,024 hidden units. During training, the network is
set of weights parameters W and b . In the above
equations, instead of using the previous frequency blocks, i.e. unrolled for 20 time steps for training with truncated backprop-
(bwd,k) (bwd,k)
(k 1)-th, LSTM output mt,k1 and cell state ct,k1 , the agation through time. In addition, the output state label is de-
(bwd,k) layed by 5 frames, as we have observed that information about
next frequency blocks, i.e. (k + 1)-th, LSTM output mt,k+1 future frames improves the prediction of the current frame [16].
(bwd,k)
and cell state ct,k+1 are used. All neural networks are trained with the cross-entropy criterion,
using asynchronous stochastic gradient descent (ASGD) opti- simulated rerecorded
Model clean rerecorded
mization [20]. noisy noisy
No Drvb 11.7 18.2 20.5 32.2
7. Results With Drvb 11.7 17.5 19.7 30.1