Professional Documents
Culture Documents
Abstract
We propose Neural Transformation Machine (NTram), a novel architecture for sequenceto-sequence learning, which performs the task through a series of nonlinear transformations from the representation of the input sequence (e.g., a Chinese sentence) to the final
output sequence (e.g., translation to English). Inspired by the recent Neural Turing Machines [8], we store the intermediate representations in stacked layers of memories, and
use read-write operations on the memories to realize the nonlinear transformations of
those representations. Those transformations are designed in advance but the parameters
are learned from data. Through layer-by-layer transformations, NTram can model complicated relations necessary for applications such as machine translation between distant
languages. The architecture can be trained with normal back-propagation on parallel
texts, and the learning can be easily scaled up to a large corpus. NTram is broad enough
to subsume the state-of-the-art neural translation model in [2] as its special case, while
significantly improves upon the model with its deeper architecture. Remarkably, NTram,
being purely neural network-based, can achieve performance comparable to the traditional
phrase-based machine translation system (Moses) with a small vocabulary and a modest
parameter size.
Introduction
The work is done when the first author worked as intern at Noahs Ark Lab, Huawei Technologies.
Controller: a controller operates the read-heads and write-head, with discussion on its
mechanism put off to Section 2.1. For the simplicity in modeling, NTram is only allowed
to read from memory of lower layer and write to memory of higher layer. As another
constraint, reading memory can only be performed after the writing to it completes,
following a similar convention as in NTM [8].
Read-heads: a read-head gets the values from the corresponding memory, following
the instructions of the controller, which also influences the controller in feeding the
state-machine (see Section 2.1). NTram allows multiple read-heads for one controller,
with potentially different addressing strategies (see Section 2.2).
Write-head: a write-head simply takes the instruction from controller and modifies
the values at specific locations.
2.1
Controller
The core to the controller is a state machine, implemented as a Long Short-Term Memory
RNN (LSTM) [10], with state at time t denoted as st . With st , the controller determines the
reading and writing at time t, while the return of reading in turn takes part in updating the
state. For simplicity, only one reading and writing is allowed at one time step, but more than
one read-heads are allowed (see Section 3.1 for an example). Now suppose for one particular
instance (index omitted for notational simplicity), the system reads from the R-memory (Mr ,
with Nr units) and writes to W -memory (denoted Mw , with Nw units)
R-memory: Mr = {xr1 , xr2 , , xrNr },
w
w
W -memory: Mw = {xw
1 , x2 , , xNw },
dw
with xrn Rdr and xw
n R . The main equations for controllers are then
Read vector:
rt = Fr (Mr , st ; r )
Write vector:
vt = Fw (st ; w )
State update:
where Fdyn (), Fr () and Fw () are respectively the operators for dynamics, reading and writing1 , parameterized by dyn , r , and w .
In the remainder of this section, we will discuss several relatively simple yet effective
choices of reading and writing strategies.
2.2
Note that our definition of writing is slightly different from that in [8].
Nr
X
n=1
g(st , xrn ; r )
and g(st , xrn ; r ) = PNr
,
r ; )
g(s
,
x
0
0
t
r
n =1
n
where g(st , xrn ; r ), implemented as a deep neural network (DNN), gives un-normalized affliation score for unit xrn in R-memory. It is strongly related to the automatic alignment
mechanism first introduced in [2] and general attention models discussed in computer vision [9]. One particular advantage of the content-based addressing is that it provides a way
to do global reordering of the sequence, therefore introduces great flexibility in learning the
representation.
Hybrid Addressing: With hybrid addressing (H-addressing) for reading, we essentially
use two read-heads (can be easily extended to more), one with L-addressing and the other
with C-addressing. At each time t, the controller simply concatenates the return of both
read-heads as the final return:
rt = [xrt ,
Nr
X
n=1
It is worth noting that with H-addressing, the tempo of the state machine will be determined
by the L-addressing read-head, and therefore creates W -memory of the same number of
locations in writing. As shown later, H-addressing can be readily extended to allow readheads to work on different memories.
2.3
Location-based Addressing
Content-based Addressing In a way similar to C-addressing for reading, the location for
units to be written is determined through a gating network g(st , xw
n,t ; w ), where the values
in W -memory at time t is given by
w
w
n,t = g(st , xw
n,t ; w )Fw (st ; w ), xn,t = (1 )xn,t1 + n,t , n = 1, 2, , Nw ,
th location in W -memory at time t, is the forgetwhere xw
n,t stands for the values of the n
ting factor (similarly defined as in [8]), g is the normalized weight (with unnormalized score
implemented also with a DNN) given to the nth location at time t.
2.4
Clearly the read-write defined above in this section transforms the representation in R-memory
to the representation in W -memory, with the spatial structure2 embedded in a way shaped
2
This spatial structure is to encode the temporal structure of input and output sequences.
jointly by design (in specifying the strategy and the implementation details) and the later supervised learning (in tuning the parameters). We argue that the memory offers more representational flexibility for encoding sequences with complicated structures, and the transformation
introduced above provides more modeling flexibility for sequence-to-sequence learning.
As the most conventional special case,
if we use L-addressing for both reading and
writing, we actually get the familiar structure in units found in RNN with stacked layers [16]. Indeed, as illustrated in the figure
right to the text, this read-write strategy will invoke a relatively local dependency based on
the original spatial order in R-memory. It is not hard to show that we can recover some deep
RNN model in [16] after stacking layers of read-write operations like this.
The C-addressing, however, be it for reading and writing, offers a means for major reordering of the cells, while H-addressing can add into it the spatial structure of the lower
layer memory. Memory with designed inner structure gives more representational flexibility
for sequences than a fixed-length vector, especially when coupled with appropriate reading
strategy in composing the memory of next layer in a deep architecture as shown later. On
the other hand, the learned memory-based representation is in general less universal than the
fixed-length representation, since they typically needs a particular reading strategy to decode
the information.
In this paper, we consider four types of transformations induced by combinations of the
read and write addressing strategies, listed pictorially in Figure 2. Notice that 1) we only
include one combination with C-addressing for writing since it is computationally expensive
to optimize when combined with a C-addressing reading (see Section 3.2 for some analysis) ,
and 2) for one particular read-write strategy there are still a fair amount of implementation
details to be specified, which are omitted due to the space limit. One can easily design
different read/write strategies, for example a particular way of H-addressing for writing.
As illustrated in Figure 3 (left panel), the stacking is straightforward: we can just apply
a transformation on top of another, with the W -memory in lower layer being the R-memory
of upper layer. Based on the memory layers stacked in this manner, we can define the entire
deep architecture of NTram, with diagram in Figure 3 (right panel). Basically, it starts with
symbol sequence (Layer-0), then moves to sequence of word embeddings (Layer-1), through
layers of transformation to reach the final intermeidate layer (Layer-L), which will be read
by the output layer. The operations in output layer, relying on another LSTM to generating
the target sequence, are similar to a memory read-write, with however the following two
differences:
it predicts the symbols for the target sequence, and takes the guess as part of the
input to update the state of the generating LSTM, while in a memory read-write, there
is no information flow from higher layers to lower layers;
since the target sequence is in general of different length as the top-layer memory, it is
limited to take only pure C-addressing reading, and relies on the built-in mechanism of
the generating LSTM to stop (e.g., after generating a End-of-Sentence token).
Figure 3: Illustration of stacked layers of memory (left) and the overall diagram of NTram
(right).
Memory of different layers could be equipped with different read-write strategies, and even
for the same strategy, the configurations and learned parameters are in general different. This
is in contrast to DNNs, for which the transformations of different layers are more homogeneous
(mostly linear transforms with nonlinear activation function). As shown by our empirical
study (Section 5), a sensible architecture design in combining the nonlinear transformations
can greatly affect the performance of the model, on which however little is known and future
research is needed.
3.1
Cross-Layer Reading
In addition to the generic read-write strategies in Section 2.4, we also introduce the crosslayer reading into NTram for more modeling flexibility. In other words, for writing in any
Layer-`, NTram allows reading from any layers lower than `, instead of just Layer-`1. More
specifically, we consider the following two cases.
Memory-Bundle: A Memory-Bundle,
as shown in Figure 4 (left panel), concatenates the units of two aligned memories in
reading, regardless of the addressing strategy. Formally, the nth location in the bundle
of memory Layer-`0 and Layer-`00 would be
0
00 )
+`
x(`
n
00
3.2
Optimization
For any designed architecture, the parameters to be optimized include {dyn , r , w } for
each controller, the parameters for the LSTM in the output layer, and the word-embeddings.
Since the reading from each memory can only be done after the writing on it completes, the
feed-forward process can be described in two scales: 1) the flow from memory of lower layer
to memory of higher layer, and 2) the forming of a memory at each layer controlled by the
corresponding state machine. Accordingly in optimization, the flow of correction signal also
propagates at two scales:
On the cross-layer scale: the signal starts with the output layer and propagates from
higher layers to lower layers, until Layer-1 for the tuning of word embedding;
On the within-layer scale: the signal back-propagates through time (BPTT) controlled
by the corresponding state-machine (LSTM). In optimization, there is a correction for
each reading or writing on each location in a memory, making the C-addressing more
expensive than L-addressing for it in general involves all locations in the memory at
each time t.
The optimization can be done via the standard back-propagation (BP) aiming to maximize
the likelihood of the target sequence. In practice, we use the standard stochastic gradient
descent (SGD [14]) and mini-batch (size 80) with learning rate controlled by AdaDelta [22].
We discuss four representative special cases of NTram: Arc-I, II, III and IV, as novel deep
architectures for machine translation. We also show that RNNsearch [2] can be described in
the framework in NTram as a relatively shallow case.
Arc-I The first proposal, including two variants (Arc-Ihyb and
Arc-Iloc ), is designed to demonstrate the effect of C-addressing in
intermediate memory layers, with
diagram shown in the figure right
to the text.
It employs a Laddressing reading from memory
Layer-1 (the embedding layer), and
L-addressing writing to Layer-2.
After that, Arc-Ihyb writes to
Layer-3 (L-addressing) based on its H-addressing reading (two read-heads) on Layer-2, while
Arc-Iloc uses L-addressing to read from Layer-2. Once Layer-3 is formed, it is then put together with Layer-2 for a Memory-Bundle, from which the output layer reads (C-addressing)
for predicting the target sequence.
Arc-II As a variant of Arc-Ihyb , this architecture is designed to investigate the effect of H-addressing reading
from different layers of memory (or Short-Cut in Section
3.1). It uses the same strategy as Arc-Ihyb in generating
memory Layer-1 and 2, but differs in generating Layer-3,
where Arc-II uses C-addressing reading on Layer-2 but Laddressing reading on Layer-1. Once Layer-3 is formed, it
is then put together with Layer-2 as a Memory-Bundle,
which is then read by the output layer for predicting the
target sequence.
Arc-III We intend to use this design to study a deeper
architecture and more complicated addressing strategy.
Arc-III follows the same way to generate Layer-1, Layer-2
and Layer-3 as Arc-II. After that it uses two read-heads
combined with a L-addressing write to generate Layer-4,
where the two read-heads consist of a L-addressing readhead on Layer-1 and a C-addressing read-head on the memory bundle of Layer-2 and Layer-3. After the generation
of Layer-4, it puts Layer-2, 3 and 4 together for a bigger
Memory-Bundle to the output layer. Arc-III, with 4
intermediate layers, is the deepest among the four special
cases.
4.1
Experiments
5.1
Setup
Dataset and Evaluation Metrics: Our training data consist of 1.25M sentence pairs
extracted from LDC corpora.3 The bilingual training data contain 27.9 million Chinese words
and 34.5 million English words. We choose NIST MT Evaluation test set 2002 (MT02) as
our development set, NIST MT Evaluation test sets 2003 (MT03), 2004 (MT04) and 2005
(MT05) as our test sets. The numbers of sentences in NIST MT02, MT03, MT04 and MT05
are 878, 919, 1788 and 1082 respectively. We use the case-insensitive 4-gram NIST BLEU
score4 as our evaluation metric, and sign-test [6] as statistical significance test.
Preprocessing: We perform word segmentation for Chinese with the Stanford NLP toolkit5 ,
and English tokenization with the tokenizer from Moses6 . In training of the neural networks,
3
Systems
RNNsearch (default)
RNNsearch (best)
Arc-Iloc
Arc-Ihyb
Arc-II
Arc-III
Arc-IV
Moses
MT03
29.02
30.28
28.98
30.14
31.27*
30.15
26.32
31.61
MT04
31.25
31.72
32.02
32.70*
33.02*
33.46*
28.18
33.48
MT05
28.32
28.52
29.53*
29.40*
30.63*
29.49*
26.23
30.75
Average
29.53
30.17
30.18
30.75
31.64
31.03
26.91
31.95
Table 1: BLEU-4 scores (%) of NMT baselines: RNNsearch (default) and RNNsearch (best),
NTram architectures (Arc-I, II, III and IV), and phrase-based SMT system (Moses). The
* indicates that the results are significantly (p<0.05) better than those of the RNNsearch
(best), while boldface numbers indicate the best results on the corresponding test set.
we limit the source and target vocabularies to the most frequent 16K words in Chinese and
English, covering approximately 95.8% and 98.3% of the two corpora respectively. All the
out-of-vocabulary words are mapped to a special token UNK.
5.2
5.3
Results
The main results of different models are given in Table 1. RNNsearch (best) is about 1.7 point
behind Moses in BLEU7 on average, which is consistent with the observations made by other
7
The reported Moses does not include any language model trained with a separate monolingual corpus.
10
authors on different machine translation tasks [2, 11]. Remarkably, some sensible designs of
NTram (e.g., Arc-II) can already achieve performance comparable to Moses, with only 42M
parameters, while RNNsearch (best) has 46M parameters.
Clearly most of the NTram architectures (Arc-Ihyb , Arc-II and Arc-III) yield performances better than the NMT baselines. Among them, Arc-II outperforms the best setting
of NMT baseline RNNsearch (best) by about 1.5 BLEU on average with less parameters.
This suggests that the deeper architectures in NTram help to capture the transformation of
representations essential to the machine translation.
5.4
Discussion
Further comparison between Arc-Ihyb and Arc-Iloc (similar parameter sizes) suggests that
C-addressing reading plays an important role in learning a powerful transformation between
intermediate representations, necessary for translation between language pairs with vastly
different syntactical structures. This conjecture is further verified by the good performances
of Arc-II and Arc-III, both of which have C-addressing read-heads in their intermediate
memory layers.
Clearly the BLEU scores of Arc-IV are considerably lower than the other architectures,
while close analysis shows that the translation results are usually shorter than those of
Arc-IArc-III for about 15%. One possibility is that our particular implementation of
C-addressing for writing in Arc-IV (Section 4) is hard to optimize, which might need to be
guided by another write-head or some smart initialization. This will be one thing for our
future exploration.
As another observation, cross-layer reading almost always helps. The performances of
Arc-I, II, III and IV unanimously drop after removing the Memory-Bundle and ShortCut (results omitted here), even after the broadening of memory units to keep the parameter
size unchanged.
We have also observed some failure cases in the architecture designing for NTram, most
notably when we have only single read-head with C-addressing in writing to the memory
that is the sole layer going to the next stage. Comparison of this design with H-addressing
(results omitted here) suggests that another read-head with L-addressing can prevent the
transformation from going astray by adding the tempo from a memory with a clearer temporal
structure.
Conclusion
We propose NTram, a novel architecture for sequence-to-sequence learning, which is stimulated by the recent work of Neural Turing Machine[8] and Neural Machine Translation [2].
NTram builds its deep architecture for processing sequence data on the basis of a series
of transformations induced by the read-write operations on a stack of memories. This new
architecture significantly improves the expressing power of models in sequence-to-sequence
learning, which is verified by our empirical study on benchmark machine translation tasks.
References
[1] M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with
recurrent neural networks. In Proceedings of EMNLP, pages 10441054, 2013.
11
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align
and translate. In Proceedings of ICLR, 2015.
[3] D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In
Proceedings of EMNLP, pages 740750, 2014.
[4] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of EMNLP, pages 17241734, 2014.
[5] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of NIPS Deep Learning and Representation
Learning Workshop, 2014.
[6] M. Collins, P. Koehn, and I. Kucerova. Clause restructuring for statistical machine translation.
In Proceedings of ACL, pages 531540, 2005.
[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language
processing (almost) from scratch. The Journal of Machine Learning Research, 12:24932537, 2011.
[8] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,
2014.
[9] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):17351780,
1997.
[11] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural
machine translation. arXiv preprint arXiv:1412.2007, 2014.
[12] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen,
C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit
for statistical machine translation. In Proceedings of ACL on interactive poster and demonstration
sessions, pages 177180, Prague, Czech Republic, June 2007.
[13] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceedings of
NAACL, pages 4854, 2003.
[14] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In Neural Networks: Tricks of
the trade. Springer, 1998.
[15] F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational linguistics, 29(1):1951, 2003.
[16] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks.
In Proceedings of ICLR, 2014.
[17] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. Signal Processing, IEEE
Transactions on, 45(11):26732681, 1997.
[18] A. Stolcke et al. Srilm-an extensible language modeling toolkit. In Proceedings of ICSLP, volume 2,
pages 901904, 2002.
[19] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems, pages 31043112, 2014.
[20] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign
language. arXiv preprint arXiv:1412.7449, 2014.
[21] M. Wang, Z. Lu, H. Li, W. Jiang, and Q. Liu. gencnn: A convolutional architecture for word
sequence prediction. arXiv preprint arXiv:1503.05034, 2015.
[22] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
12