Professional Documents
Culture Documents
Joss Moorkens
School of Applied Language and Intercultural Studies, Dublin City University, Ireland.
Email: joss.moorkens@dcu.ie
Twitter: @jossmo
http://orcid.org/0000-0003-0766-0071
https://www.linkedin.com/in/jossmo/
What to expect from Neural Machine Translation: A practical in-class
translation evaluation exercise
The rise of NMT has been accompanied by a good deal of media hyperbole about
neural networks and machine learning, some of which has suggested that several
professions, including translation, may be under threat. This evaluation exercise
is intended to empower the students, and help them understand the strengths and
weaknesses of this new technology. Students’ findings using several language
pairs mirror those from published research, such as improved fluency and word
order in NMT output, with some unpredictable problems of omission and
mistranslation.
Subject classification codes: include these here if the journal requires them
Introduction
new and updated tools and technologies. Where these technologies can help translators
to “maintain high levels of productivity and offer value-added services” (Olohan 2007,
59), it is incumbent on translation trainers to ensure that students are made aware of
their usefulness in order to maximise their agency as translators, and to fulfil industry
employment needs. One particularly disruptive change in recent years has been the
translations, have become commonplace. Although the first wave of NMT publications
appeared relatively recently (Bahdanau, Cho and Bengio 2014; Cho et al. 2014), NMT
and deployment (Wu et al. 2016). NMT is also a statistical paradigm, with systems
trained on human translations. Nonetheless, claims that NMT is, according to the title of
a 2016 Google research paper, “bridging the gap between human and machine
translation” (Wu et al. 2016), or that Microsoft have achieved “human parity on
automatic Chinese to English news translation” (Hassan et al. 2018), have led to a great
deal of media hyperbole about the potential uses of NMT and related displacement of
determinism in media reports about machine learning, this may lead translators to fear
NMT, and engender a sense of powerlessness due to a perception that the technology
“gets precedence and is inevitable” (Cadwell, O’Brien, and Teixeira 2017, 17). O’Brien,
however, suggests that the “increasing technologisation of the profession is not a threat,
but an opportunity to expand skill sets and take on new roles” (2012, 118). The points
that may benefit from the skills of the translator, as highlighted by Kenny and Doherty
(2014), still hold true for NMT. With this point in mind, I contend that helping students
intervention.
MT tends to be perceived negatively by many translators due to low quality
expectations, imposition of the technology without any choice to opt out, and fear of
being replaced (Cadwell, O’Brien, and Teixeira 2017). It was originally envisaged that
MT would replace human translators (Hutchins 1986), and although that intention may
not still hold true (Way 2018), the perception remains. As MT has moved towards
statistical methods, the MT process has become more difficult to explain. NMT can be
particularly difficult for students and scholars to conceptualise, not least because neural
networks are complex and NMT output can be unpredictable (Arthur, Neubig and
Nakamura 2016)1. This makes it all the more important for translation students to
familiarise themselves with and demystify NMT output, and to become aware that,
despite the hype about machine learning, that NMT output has many weaknesses as
well as strengths.
comparatively evaluate statistical and neural MT output for one language pair. It was set
following a large scale comparative evaluation carried out as part of the TraMOOC EU-
funded project (Castilho et al. 2017b), and scaled down so that students could complete
the evaluations within two hours. The exercise was designed based on the constructivist
paradigm so that students could independently learn about - and build an “evaluative
experience at using several translation quality assessment (TQA) techniques. For most
students, this was also a first experience of MT post-editing. The exercise was carried
out with second year undergraduate students of translation and repeated with a cohort of
Translation Studies PhD students in another university. Students’ reports showed many
1
See Forcada (2017) for an accessible introduction to the technology behind NMT.
insights into what may be expected of each technology. A lecture on MT, incorporating
NMT and published NMT evaluations from research and from industry, followed one
week after the in-class comparative task for the undergraduate cohort, providing context
for their findings, and explaining (at the level of concepts rather than mathematical
equations) the processes of creating and training an NMT system that lead to the type of
output that they had evaluated. Their findings meant that they had personal experience
of the systems discussed, also mirrored the findings of the large-scale, randomised
TraMOOC evaluation across four language pairs, and a number of other recently
The following section details the technological and industrial context of the
evaluations, and the potential employability benefits of learning about NMT. Thereafter,
the evaluation task is described, with results from both cohorts summarised, followed
by some in-class feedback and discussion points that were raised by the exercise.
commonplace some years before, the first published NMT papers appeared in 20142.
Researchers began to create single neural networks, tuned to maximise the “probability
of a correct translation given a source sentence” (Bahdanau, Cho and Bengio 2014, 1)
based on contextual information from the source text and previously produced target
2
Forcada and Ñeco (1997) had, in fact, suggested a method that was effectively a precursor to
with translation of unknown source words by breaking words into subword particles,
(Sennrich, Haddow and Birch 2016), improving the quality of NMT output. When NMT
systems were entered into competitive MT environments in 2016, they scored above
SMT for many language pairs (English-German and vice versa, English-Czech,
English-Russian; see Bojar et al. 2016) despite the comparatively few years of NMT
forward in quality.
Comparative evaluation and the state of the art for machine translation
Several papers subsequently appeared, using automatic and human evaluation methods
to compare NMT with SMT. All highlight the low number of word order errors found in
NMT output and associated improved scores for fluency. Bentivogli et al. (2016, 265)
found that NMT had “significantly pushed ahead the state of the art”. In their
evaluation, technical post-editing effort (in terms of the number of edits) for English to
German was reduced on average by 26% when using NMT rather than the best-
performing SMT system, with fewer overall errors, and notably fewer word order and
verb placement errors. Wu et al. (2016) used automatic evaluation and human ranking
of 500 Wikipedia segments that had been machine-translated from English into Spanish,
languages. The publicity surrounding this publication and the subsequent move to NMT
by Google Translate for these language pairs led to a spike in NMT hype (Castilho et al.
2017a). Several further language pairs have now moved to Google NMT, prompting
translation quality.
A detailed evaluation by Popović (2017) of SMT and NMT output for English-
German and vice-versa found that NMT produced fewer overall errors, and that NMT
output contained improved verb order and verb forms, with fewer verbal omissions.
NMT also performed strongly regarding morphology and word order, particularly on
articles, phrase structure, English noun collocations, and German compound words. She
reported, however, that NMT was less successful in translating prepositions, ambiguous
Castilho et al. (2017a) found inconsistent evaluation results for adequacy and
addition and mistranslation in NMT output in several language pairs and three domains.
They conclude that, although NMT represents a significant improvement “for some
language pairs and specific domains, there is still much room for research and
improvement” (110). They caution that “overselling a technology that is still in need of
more research may cause negativity about MT”, and more specifically, a “wave of
Employability considerations
The outlook for employment in translation in the short-to-medium term looks to be
relatively positive. The U. S. Bureau of Labor Statistics expect an 18% increase in jobs
for translators and interpreters in the U.S.A. between 2016 and 2026, equating to 12,100
new positions. The demand for translators is predicted to vary depending on speciality
or language pair, and opportunities “should be plentiful for interpreters and translators
specializing in healthcare and law, because of the critical need for all parties to fully
work with MT output. Post-editors of NMT in Castilho et al. (2017b, 11) “found NMT
errors more difficult to identify”, whereas “word order errors and disfluencies requiring
revision were detected faster” in SMT output. Familiarity with NMT output, especially
considering the increasing importance of speed in the translator workplace (see Bowker
and McBride, 2017), should improve the efficiency of NMT error identification.
At present, the move is underway for MT providers to entirely neural or neural hybrid
methods, Microsoft have released online NMT engines, Kantan MT has created an
NMT product called NeuralFleet, the ModernMT project has moved to NMT,3 despite
being led by one of the creators of the popular Moses SMT system (Koehn et al. 2007).
This all suggests that professional linguists will soon find themselves in workflows that
incorporate NMT to a greater or lesser extent. Learning about NMT should empower
the translator “as an agent who is very much present throughout” such a workflow,
rather than taking a “limited or reductive role” (Kenny and Doherty 2014, 290).
Gaspari, Almaghout and Doherty (2015) identified MT, TQA, and post-editing as
described in the following section incorporates each of these skills, and is also intended
participants gained TQA experience using three metrics. The first employs the construct
3
See https://github.com/ModernMT/MMT.
commonly used (along with fluency, both measured using a Likert-type scale) for
human evaluations of MT quality. The second metric employs error annotation using a
simple typology of errors, common among research and industry models but little-used
in academic scenarios.4 The third is post-editing effort, using one of the three categories
Temporal effort, or time spent post-editing, is commonly used, often to highlight the
Evaluation task
This evaluation task was set for a cohort of 46 second-year translation undergraduate
with a two-hour time limit. These students had a small amount of translation experience
their marks for the module, with the remainder awarded on the basis of results of a
demonstrate awareness of appropriate tools that assist in the translation process, and
awareness of the historical development of CAT tools and their importance in modern-
day translation practice. They should be able to apply translation memory and
terminology management tools at a basic level, and to explain the interaction between
students had learned about TQA and the history of MT, but had not yet learned about
4
See Doherty, Gaspari, Moorkens, and Castilho (2018) and Lommel (2018) for a further
discussion of these.
SMT and NMT in any detail. All students had used online MT without any knowledge
The task was repeated with a cohort of 9 participants from 14 attendees at a PhD
Summer School, where postgraduate students and staff from another university
performed the evaluation in two one-hour blocks. The rationale for this was to see
whether the exercise would be useful for a different group with a background in
translation studies and more translation experience. The first cohort used the Microsoft
Translator MT Try & Compare website5, and the second cohort were given the option of
using this same site or Google Translate for NMT and Google Sheets for SMT, as some
of the participants’ language pairs were not included in the Microsoft site.
Task description
The most popular Machine Translation (MT) paradigm over the past 15 years or
so has been Statistical Machine Translation (SMT). More recently, there have
been claims that Neural Machine Translation (NMT), a statistical paradigm that
5
From August 2017, the site at https://translator.microsoft.com/neural/ allowed users to
compare NMT and SMT output. This changed in March 2018 so that users can test the
research systems described in Hassan et al. (2018). At the time of writing (May 2018), users
can still access Google SMT via Google Sheets. Due to these being free online tools, the MT
(2) Adequacy
They were provided with a list of available languages for the exercise based on those
Within the allotted time, students were asked to choose a page from Wikipedia, copy 20
segments in one of the languages supported by the MT systems (again, their choice),
aiming for two groups of 10 with similar sentence length. While Wikipedia articles
could not be considered “appropriate, authentic texts” (Kenny 2007, 204) for a
translation workflow, they were suggested for a time-limited task to avoid time-
consuming terminology searches. Students could translate material on a topic that was
familiar and interesting to them. The students had 10 segments translated using NMT
and 10 using SMT, and copied the output to their documents making sure to clearly
differentiate the segments produced by SMT and NMT systems. The students then
(1) Post-editing effort (time spent, measured using the computer clock,
noting Total Editing Time in Word before and after evaluation, or ideally
6
Arabic, Chinese (simplified), English, French, German, Italian, Japanese, Korean, Portuguese,
(1) None of it
(2) Little of it
(3) Most of it
(4) All of it
and additions:
number, or case
Tip: Use the highlighter tool and count the number of occurrences for
each error.
Students were asked to work out their average post-editing time per segment for
each MT paradigm over the ten segments, the average segment adequacy score for each
paradigm, and the frequency of each error type by dividing the total number of
state a preference of one MT system or the other for their language pair and to
explaining their reasons, using examples from the exercise. They were also asked to
suggest scenarios in which these MT systems might be useful. The marking criteria
Examples provided: 5
The second cohort (PhD students) were not graded for the exercise and findings were
presented orally, as the four-day summer school format was time-limited and no grades
Students’ results
The study by Castilho et al. (2017b) on which this evaluation task was loosely based
found (for English to Portuguese, Greek, German, and Russian, in the educational
domain) that fluency is improved and word order errors are fewer using NMT when
compared with SMT. NMT produces fewer morphological errors and fewer segments
that require any editing. The authors found no clear improvement for omission or
mistranslation when using NMT, nor did they find any improvement in post-editing
throughput.
Students’ results, presented here with the caveat that this was not a controlled
study, but rather an exercise for them to analyse MT output, followed the same lines as
the published research study using professional translator participants, despite the
different language pairs and domains7. 93% of each cohort stated a preference for NMT.
Language pairs chosen by cohort one were English to French (8 students), French to
English (19 students), German to English (13 students), Spanish to English (4 students),
English, and English to Spanish (2), Chinese (2), Turkish, and Russian. In cohort one,
62% of undergraduate students spent less time post-editing the NMT output. Average
rating for adequacy (the extent to which the meaning expressed in the source fragment
appears in the translation fragment) for SMT was 2.95 and for NMT 3.46 (where 3 =
most of it and 4 = all of it). Overall errors were fewer for NMT (10.60 as opposed to
18.79), and students found fewer word order errors (NMT: 1.75, SMT 3.91), fewer
omission errors (NMT: 1.44, SMT 2.86), fewer addition errors (NMT: 1.02, SMT 1.66),
The results for cohort two were similar, with mean NMT adequacy rated at 2.9
and mean SMT adequacy rated at 2.0. Further evaluation results for students in this
cohort are in Table 1. The student evaluating Arabic-English output also rated errors
from 1 (low) to 4 (high) for gravity, with NMT errors averaging at 1.8 and SMT errors
averaging 3.7. The Chinese to Spanish translations were most likely translated in two
stages (Chinese to English, English to Spanish) using English as a pivot language, due
to the lack of availability of bilingual texts to use as MT training data in the Chinese-
Spanish language pair. This process would unavoidably limit MT output quality.
7
Most Wikipedia topics chosen by the students were geographical locations, with less obvious
choices including Russian Blue cats, Women in Nazi Germany, and Korean pop group Girls’
Generation.
** Place Table 1 about here **
Some of the students were more assiduous with their annotation than others,
with some comments revealing interesting observations about the comparison. The
comments presented here were considered representative of the research reports, and
were selected to support the points raised in the quantitative analysis. Although students
mostly preferred the NMT output, it became clear to them that errors could still be
expected using the neural paradigm. One student wrote that “Neural Translation tool
proved more effective, however of course it was not perfect and contained some errors
which I spent 20 minutes in post editing time correcting.” Another student found
mistranslations via both MT engines, for example when “the word ‘wetting’ was used
[paradigms] were also made plural when there was no reason to do so.” In general,
students were surprised at the high quality of NMT output: “I found NMT to produce
surprisingly good results in the case of long sentences, consisting of multiple clauses
(which are known to generally cause many problems during machine translation).” One
student suggested that NMT output was of higher quality “due to the multiple
German at all, removing the need for a German-speaking post-editor.” On the other
hand, a student with a preference for SMT found NMT errors difficult to identify,
whereas the “types of errors [found in SMT output] were not difficult to correct”. More
common was the student who found repairs such as “correcting gender agreement (e.g.:
par un camarade > par une camarade, from Neural [output]) and fixing flaws in word
order (e.g. et (suprême) extrêmement confiant en soi même (de soi-confiant.) (w.o),
from Neural [output])” to be far easier than finding “extremely precise vocabulary and
fixing tense issues (e.g. les sentiments pour l’un l’autre n’est (sont) jamais explicités,
from Statistical [output]).” One PhD student with a preference for NMT in English-
Russian also wrote that it is “easier to spot mistakes in SMT, but easier to correct errors
The average mark (out of 20) for the undergraduate cohort was 15.3 or 77%,
showing that they engaged well with the exercise and crafted reports that demonstrated
an excellent understanding of the strengths and weaknesses of NMT output for their
chosen language pair. The average mark for the module overall was 62%, so their
efforts in the comparative evaluation served to boost their mark overall, particularly for
surprise at the high quality and fluency of NMT, particularly for morphologically
complex languages that have proved troublesome for MT systems such as Arabic,
Russian, and even German. They also noted, with some relief, problems of omission
and mistranslation in NMT output. They did not consider NMT to be a threat to
translators as of yet, but were concerned that improvements in MT quality over time
may make it an attractive option in some scenarios such as for news translation or
software documentation. Most of the students were nonetheless positive about the
technology, and would be interested in working with NMT in future. They enjoyed
Two students noted that NMT had produced neologisms in their target languages
(Spanish and Turkish), which initially looked like comprehensible compound words, yet
they could not be found in dictionaries. This may be due to the process mentioned
previously of training using words broken down into smaller chunks or subword units
(Sennrich, Haddow and Birch 2016), so as to better translate words that are rare or do
In the follow-up lecture and discussion, students engaged with this complex
topic, asking questions related to - and building on - their own experience of working
with NMT output. Students discussed scenarios where use of NMT may be appropriate
– for perishable texts, as a springboard for ideas during rush-jobs – and those where
for high risk and literary texts, for translation jobs that involve regulatory compliance,
for example. As suggested by Massey (2017), the areas of ethics and risk proved to be a
useful starting point for further discussion: at present, machine learning copies human
As mentioned previously, this was not a controlled experiment, but rather an in-
class exercise carried out as part of a standard computer-aided translation module. The
such, there is little point in detailed data analysis. No biographical information was
requested from participants, nor any information about their language ability level. The
undergraduate cohort had no experience of research design, and had not considered the
importance of the order in which they had completed the tasks, and effects of limited
post-editing experience. When asked, most said that they had completed the SMT
evaluation first, which may have caused their post-editing speed to be slower for that
paradigm. Conversely, the task order may also have put the NMT evaluation at a
disadvantage, as Wikipedia articles tend to begin simply, in the general domain, before
language. Also, even though students were asked to aim for two groups of ten sentences
of similar length, several admitted afterwards that they had not taken this instruction
Concluding remarks
Previous work has shown the benefits of hands-on experience working with MT in
(Doherty and Kenny, 2014). As NMT is a relatively new technology, and the technical
barriers for building and training NMT systems remains high, this comparative
evaluation task was developed as a way for students to understand the level of
translation quality that could be expected from current standard (SMT) and incoming
(NMT) state-of-the-art automatic translation systems, using evaluation metrics that are
standard in research and industry. As NMT use becomes more commonplace, with
The two cohorts who took part in this evaluation gained experience at translation
between source and target text, and error annotation using a simple typology of errors.
The students also experienced the task of post-editing, and were introduced to the
concept of post-editing effort, using one of the three measures of effort, as developed by
highly likely to be disruptive in the coming years. Rather than fearing the incoming
instead be empowered to discuss its relative merits and drawbacks from their own –
Acknowledgements
This research is supported by the ADAPT Centre for Digital Content Technology, funded under
the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European
References:
Bentivogli, L., A. Bisazza, M. Cettolo, and M. Federico. 2016. “Neural versus Phrase-
Based Machine Translation Quality: a Case Study.” In Proceedings of Conference on
Empirical Methods in Natural Language Processing (EMNLP 2016), 257-267.
(http://arxiv.org/abs/1608.04631).
Castilho, S., J. Moorkens, F. Gaspari, I. Calixto, J. Tinsley, and A. Way. 2017a. “Is
Neural Machine Translation the New State-of-the-Art?” The Prague Bulletin of
Mathematical Linguistics 108: 109-120. doi: 10.1515/pralin-2017-0013
Cho, K., B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. “On the Properties of
Neural Machine Translation: Encoder-Decoder Approaches”. Computing Research
Repository, abs/1409.1259
Doherty, S., and D. Kenny. 2014. “The Design and Evaluation of a Statistical Machine
Translation Syllabus for Translation Students.” The Interpreter and Translator Trainer
8:2, 276-294, doi:10.1080/1750399X.2014.937571
Doherty, S., F. Gaspari, J. Moorkens, and S. Castilho. 2018. “On education and training
in Translation Quality Assessment.” Translation Quality Assessment: From principles
to practice, edited by J. Moorkens, S. Castilho, F. Gaspari, S. Doherty. Berlin: Springer.
doi:10.1007/978-3-319-91241-7_5
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X.,
Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T-Y. Luo, R., Menezes, A.,
Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z.,
Zhou, M. 2018. Achieving Human Parity on Automatic Chinese to English News
Translation. Computing Research Repository arXiv:1803.05567v1
(https://arxiv.org/abs/1803.05567).
Hurtado Albir, A. 2007. “Competence-based Curriculum Design for Training
Translators.” The Interpreter and Translator Trainer 1 (2): 63-195. doi:
10.1080/1750399X.2007.10798757
Kenny, D. 2007. “Translation Memories and Parallel Corpora: Challenges for the
Translation Trainer.” In Across Boundaries: International Perspectives on Translation,
edited by D. Kenny and K. Ryou, 192–208. Newcastle-upon-Tyne: Cambridge Scholars
Publishing.
Kenny, D., and S. Doherty. 2014. “Statistical Machine Translation in the Translation
Curriculum: Overcoming Obstacles and Empowering Translators.” The Interpreter and
Translator Trainer 8 (2), 295–315. doi:10.1080/1750399X.2014.936112.
Krings, H. P. 2001. Repairing Texts. Kent, OH: Kent State University Press.
Plitt, M., and F. Masselot. 2010. “A Productivity Test of Statistical Machine Translation
Post-Editing in a Typical Localisation Context.” The Prague Bulletin of Mathematical
Linguistics 93: 7-16.
Popović, M. 2017. “Comparing Language Related Issues for NMT and PBMT between
German and English.” The Prague Bulletin of Mathematical Linguistics 108: 209-220.
doi: 10.1515/pralin-2017-0021
SMT 2.1
EN-ES NMT 3.1 Mostly mistranslation (3) Mistranslation (9) and word order
SMT 1.7 omissions (elaborate, not (2), omission (2), addition (3)
easy to detect)
ZH-ES NMT 2.4 Mistranslation (8), word Mistranslation (10), word order