What To Expect From Neural Machine Translation

What to expect from Neural Machine Translation: A practical in-class
translation evaluation exercise
Joss Moorkens
School of Applied Language and Intercultural Studies, Dublin City University, Ireland.
Office phone: +353-1-7007477
Email: joss.moorkens@dcu.ie
Twitter: @jossmo
http://orcid.org/0000-0003-0766-0071
https://www.linkedin.com/in/jossmo/
What to expect from Neural Machine Translation: A practical in-class
translation evaluation exercise
Machine translation is currently undergoing a paradigm shift from statistical to

neural network models. Neural machine translation (NMT) is difficult to
conceptualise for translation students, especially without context. This article
describes a short in-class evaluation exercise to compare statistical and neural
MT, including details of student results and follow-on discussions. As part of this
exercise, students carry out evaluations of two types of MT output using three
translation quality assurance (TQA) metrics: adequacy, post-editing productivity,
and a simple error taxonomy. In this way, the exercise introduces NMT, TQA,
and post-editing. In our module, a more detailed explanation of NMT followed
the evaluation.
The rise of NMT has been accompanied by a good deal of media hyperbole about
neural networks and machine learning, some of which has suggested that several
professions, including translation, may be under threat. This evaluation exercise
is intended to empower the students, and help them understand the strengths and
weaknesses of this new technology. Students’ findings using several language
pairs mirror those from published research, such as improved fluency and word
order in NMT output, with some unpredictable problems of omission and
mistranslation.
Keywords: Machine translation; neural MT; neural networks; machine translation

evaluation; translation technology
Subject classification codes: include these here if the journal requires them
Introduction
Teaching third-level students about translation technology is complicated by the
dynamic technological environment, with standard practices regularly overridden by
new and updated tools and technologies. Where these technologies can help translators
to “maintain high levels of productivity and offer value-added services” (Olohan 2007,
59), it is incumbent on translation trainers to ensure that students are made aware of
their usefulness in order to maximise their agency as translators, and to fulfil industry
employment needs. One particularly disruptive change in recent years has been the
inception of neural machine translation (NMT).
Since the early 2000s, statistical MT (SMT) systems, trained on human
translations, have become commonplace. Although the first wave of NMT publications
appeared relatively recently (Bahdanau, Cho and Bengio 2014; Cho et al. 2014), NMT
has quickly gained a foothold in both academia and industry by outperforming
statistical systems in competitions (Bojar et al. 2016) and in well-publicised research
and deployment (Wu et al. 2016). NMT is also a statistical paradigm, with systems
trained on human translations. Nonetheless, claims that NMT is, according to the title of
a 2016 Google research paper, “bridging the gap between human and machine
translation” (Wu et al. 2016), or that Microsoft have achieved “human parity on
automatic Chinese to English news translation” (Hassan et al. 2018), have led to a great
deal of media hyperbole about the potential uses of NMT and related displacement of
human translators (Castilho et al. 2017a). In conjunction with widespread technological
determinism in media reports about machine learning, this may lead translators to fear
NMT, and engender a sense of powerlessness due to a perception that the technology
“gets precedence and is inevitable” (Cadwell, O’Brien, and Teixeira 2017, 17). O’Brien,
however, suggests that the “increasing technologisation of the profession is not a threat,
but an opportunity to expand skill sets and take on new roles” (2012, 118). The points
of potential intervention in the SMT preparation, training, and post-editing processes
that may benefit from the skills of the translator, as highlighted by Kenny and Doherty
(2014), still hold true for NMT. With this point in mind, I contend that helping students
to learn about new technologies, including NMT, is a positive and empowering
intervention.
MT tends to be perceived negatively by many translators due to low quality
expectations, imposition of the technology without any choice to opt out, and fear of
being replaced (Cadwell, O’Brien, and Teixeira 2017). It was originally envisaged that
MT would replace human translators (Hutchins 1986), and although that intention may
not still hold true (Way 2018), the perception remains. As MT has moved towards
statistical methods, the MT process has become more difficult to explain. NMT can be
particularly difficult for students and scholars to conceptualise, not least because neural
networks are complex and NMT output can be unpredictable (Arthur, Neubig and
Nakamura 2016)1. This makes it all the more important for translation students to
familiarise themselves with and demystify NMT output, and to become aware that,
despite the hype about machine learning, that NMT output has many weaknesses as
well as strengths.
This article reports a practical in-class translation evaluation exercise to
comparatively evaluate statistical and neural MT output for one language pair. It was set
following a large scale comparative evaluation carried out as part of the TraMOOC EU-
funded project (Castilho et al. 2017b), and scaled down so that students could complete
the evaluations within two hours. The exercise was designed based on the constructivist
paradigm so that students could independently learn about - and build an “evaluative
awareness” (Massey 2017) of - a cutting-edge translation technology, while also gaining
experience at using several translation quality assessment (TQA) techniques. For most
students, this was also a first experience of MT post-editing. The exercise was carried
out with second year undergraduate students of translation and repeated with a cohort of
Translation Studies PhD students in another university. Students’ reports showed many
1
See Forcada (2017) for an accessible introduction to the technology behind NMT.
insights into what may be expected of each technology. A lecture on MT, incorporating
NMT and published NMT evaluations from research and from industry, followed one
week after the in-class comparative task for the undergraduate cohort, providing context
for their findings, and explaining (at the level of concepts rather than mathematical
equations) the processes of creating and training an NMT system that lead to the type of
output that they had evaluated. Their findings meant that they had personal experience
of the systems discussed, also mirrored the findings of the large-scale, randomised
TraMOOC evaluation across four language pairs, and a number of other recently
published comparative evaluations.
The following section details the technological and industrial context of the
emergence of NMT, including summarised results from previously published
evaluations, and the potential employability benefits of learning about NMT. Thereafter,
the evaluation task is described, with results from both cohorts summarised, followed
by some in-class feedback and discussion points that were raised by the exercise.
Technological and industrial context
Although the application of neural networks to speech recognition became
commonplace some years before, the first published NMT papers appeared in 20142.
Researchers began to create single neural networks, tuned to maximise the “probability
of a correct translation given a source sentence” (Bahdanau, Cho and Bengio 2014, 1)
based on contextual information from the source text and previously produced target
2
Forcada and Ñeco (1997) had, in fact, suggested a method that was effectively a precursor to
NMT some years before.

text. New techniques were developed to cope with variable sentence lengths and to help
with translation of unknown source words by breaking words into subword particles,
(Sennrich, Haddow and Birch 2016), improving the quality of NMT output. When NMT
systems were entered into competitive MT environments in 2016, they scored above
SMT for many language pairs (English-German and vice versa, English-Czech,
English-Russian; see Bojar et al. 2016) despite the comparatively few years of NMT
development, leading to great anticipation within the MT research community of a leap
forward in quality.
Comparative evaluation and the state of the art for machine translation
Several papers subsequently appeared, using automatic and human evaluation methods
to compare NMT with SMT. All highlight the low number of word order errors found in
NMT output and associated improved scores for fluency. Bentivogli et al. (2016, 265)
found that NMT had “significantly pushed ahead the state of the art”. In their
evaluation, technical post-editing effort (in terms of the number of edits) for English to
German was reduced on average by 26% when using NMT rather than the best-
performing SMT system, with fewer overall errors, and notably fewer word order and
verb placement errors. Wu et al. (2016) used automatic evaluation and human ranking
of 500 Wikipedia segments that had been machine-translated from English into Spanish,
French, Simplified Chinese, and vice-versa, concluding that NMT strongly
outperformed other approaches, improving translation quality for morphologically rich
languages. The publicity surrounding this publication and the subsequent move to NMT
by Google Translate for these language pairs led to a spike in NMT hype (Castilho et al.
2017a). Several further language pairs have now moved to Google NMT, prompting
Burschardt et al. (2017, 169) to note a “striking improvement” in English-German
translation quality.
A detailed evaluation by Popović (2017) of SMT and NMT output for English-
German and vice-versa found that NMT produced fewer overall errors, and that NMT
output contained improved verb order and verb forms, with fewer verbal omissions.
NMT also performed strongly regarding morphology and word order, particularly on
articles, phrase structure, English noun collocations, and German compound words. She
reported, however, that NMT was less successful in translating prepositions, ambiguous
English words, and continuous English verbs.
Castilho et al. (2017a) found inconsistent evaluation results for adequacy and
post-editing effort, and comparatively higher numbers of identified errors of omission,
addition and mistranslation in NMT output in several language pairs and three domains.
They conclude that, although NMT represents a significant improvement “for some
language pairs and specific domains, there is still much room for research and
improvement” (110). They caution that “overselling a technology that is still in need of
more research may cause negativity about MT”, and more specifically, a “wave of
discontent and suspicion among translators” (118).
Employability considerations
The outlook for employment in translation in the short-to-medium term looks to be
relatively positive. The U. S. Bureau of Labor Statistics expect an 18% increase in jobs
for translators and interpreters in the U.S.A. between 2016 and 2026, equating to 12,100
new positions. The demand for translators is predicted to vary depending on speciality
or language pair, and opportunities “should be plentiful for interpreters and translators
specializing in healthcare and law, because of the critical need for all parties to fully
understand the information communicated in those fields” (Bureau of Labor Statistics,
2018). The translation industry as a whole continues to report growth, with an
increasing proportion of turnover coming from post-editing of MT (Lommel and

DePalma, 2016). The likelihood is therefore that many graduate translators will have to
work with MT output. Post-editors of NMT in Castilho et al. (2017b, 11) “found NMT
errors more difficult to identify”, whereas “word order errors and disfluencies requiring
revision were detected faster” in SMT output. Familiarity with NMT output, especially
considering the increasing importance of speed in the translator workplace (see Bowker
and McBride, 2017), should improve the efficiency of NMT error identification.
At present, the move is underway for MT providers to entirely neural or neural hybrid
MT systems. As mentioned, Google Translate has been increasingly adopting neural
methods, Microsoft have released online NMT engines, Kantan MT has created an
NMT product called NeuralFleet, the ModernMT project has moved to NMT,3 despite
being led by one of the creators of the popular Moses SMT system (Koehn et al. 2007).
This all suggests that professional linguists will soon find themselves in workflows that
incorporate NMT to a greater or lesser extent. Learning about NMT should empower
the translator “as an agent who is very much present throughout” such a workflow,
rather than taking a “limited or reductive role” (Kenny and Doherty 2014, 290).
Gaspari, Almaghout and Doherty (2015) identified MT, TQA, and post-editing as
underrepresented skills in translator training programmes generally. The exercise
described in the following section incorporates each of these skills, and is also intended
to contribute towards the translator’s technological competence (EMT Network 2017)
and instrumental competence (Hurtado Albir 2007). More specifically, student
participants gained TQA experience using three metrics. The first employs the construct
of adequacy, a functional measure of equivalence between source and target text,
3
See https://github.com/ModernMT/MMT.
commonly used (along with fluency, both measured using a Likert-type scale) for
human evaluations of MT quality. The second metric employs error annotation using a
simple typology of errors, common among research and industry models but little-used
in academic scenarios.4 The third is post-editing effort, using one of the three categories
of effort introduced by Krings (2001): temporal, technical, and cognitive effort.
Temporal effort, or time spent post-editing, is commonly used, often to highlight the
benefit of MT as a translation aid despite a lack of enthusiasm from translator users
(Plitt and Masselot 2010; Guerberof 2012).
Evaluation task
This evaluation task was set for a cohort of 46 second-year translation undergraduate
students at Dublin City University as part of a Computer-Aided Translation module
with a two-hour time limit. These students had a small amount of translation experience
in a classroom environment and no experience of post-editing. It contributed to 20% of
their marks for the module, with the remainder awarded on the basis of results of a
Translation Memory project. Following this module, students should be able to
demonstrate awareness of appropriate tools that assist in the translation process, and
awareness of the historical development of CAT tools and their importance in modern-
day translation practice. They should be able to apply translation memory and
terminology management tools at a basic level, and to explain the interaction between
translation memory and terminology management tools. Prior to this evaluation,
students had learned about TQA and the history of MT, but had not yet learned about
4
See Doherty, Gaspari, Moorkens, and Castilho (2018) and Lommel (2018) for a further
discussion of these.
SMT and NMT in any detail. All students had used online MT without any knowledge
of the MT paradigm employed. Their attitudes to MT in the classroom were broadly
positive, but were not measured using any tests or questionnaires.
The task was repeated with a cohort of 9 participants from 14 attendees at a PhD
Summer School, where postgraduate students and staff from another university
performed the evaluation in two one-hour blocks. The rationale for this was to see
whether the exercise would be useful for a different group with a background in
translation studies and more translation experience. The first cohort used the Microsoft
Translator MT Try & Compare website5, and the second cohort were given the option of
using this same site or Google Translate for NMT and Google Sheets for SMT, as some
of the participants’ language pairs were not included in the Microsoft site.
Task description
Students were given the following preamble:
The most popular Machine Translation (MT) paradigm over the past 15 years or
so has been Statistical Machine Translation (SMT). More recently, there have
been claims that Neural Machine Translation (NMT), a statistical paradigm that
carries out more operations simultaneously, can produce improved output
5
From August 2017, the site at https://translator.microsoft.com/neural/ allowed users to
compare NMT and SMT output. This changed in March 2018 so that users can test the
research systems described in Hassan et al. (2018). At the time of writing (May 2018), users
can still access Google SMT via Google Sheets. Due to these being free online tools, the MT
systems involved are liable to change without warning.

You will compare SMT and NMT using three evaluation methods
(1) Post-editing effort
(2) Adequacy
(3) Error typology
They were provided with a list of available languages for the exercise based on those
available on the Microsoft site6.
Within the allotted time, students were asked to choose a page from Wikipedia, copy 20
segments in one of the languages supported by the MT systems (again, their choice),
aiming for two groups of 10 with similar sentence length. While Wikipedia articles
could not be considered “appropriate, authentic texts” (Kenny 2007, 204) for a
translation workflow, they were suggested for a time-limited task to avoid time-
consuming terminology searches. Students could translate material on a topic that was
familiar and interesting to them. The students had 10 segments translated using NMT
and 10 using SMT, and copied the output to their documents making sure to clearly
differentiate the segments produced by SMT and NMT systems. The students then
carried out three evaluations based on the following instructions:
(1) Post-editing effort (time spent, measured using the computer clock,
noting Total Editing Time in Word before and after evaluation, or ideally
phone stopwatch). This is known as Temporal Post-Editing Effort.
6
Arabic, Chinese (simplified), English, French, German, Italian, Japanese, Korean, Portuguese,
Russian, and Spanish.

(2) Adequacy: How much of the meaning expressed in the source fragment
appears in the translation fragment?
(1) None of it
(2) Little of it
(3) Most of it
(4) All of it
(3) Error typology: Mark word order errors, mistranslations, omissions,
and additions:
 Word order errors: incorrect word order at phrase or word level
 Mistranslations: incorrectly translated word, wrong gender,
number, or case
 Omissions: word(s) from the ST has been omitted from the TT
 Additions: word(s) not in the ST have been added in the TT
Tip: Use the highlighter tool and count the number of occurrences for
each error.
Students were asked to work out their average post-editing time per segment for
each MT paradigm over the ten segments, the average segment adequacy score for each
paradigm, and the frequency of each error type by dividing the total number of
occurrences by the number of segments (presumably 10).
Deliverables and Marking Criteria

Students were asked to submit a short report (up to 500 words), in which they were to
state a preference of one MT system or the other for their language pair and to
explaining their reasons, using examples from the exercise. They were also asked to
suggest scenarios in which these MT systems might be useful. The marking criteria
provided for the exercise were as follows:
 Credible results for each evaluation: 5
 Reasoned preference for an MT paradigm: 5
 Examples provided: 5
 Appropriate scenario for MT system(s): 3
 Originality and critical input: 2
The second cohort (PhD students) were not graded for the exercise and findings were
presented orally, as the four-day summer school format was time-limited and no grades
or credits were awarded.
Students’ results
The study by Castilho et al. (2017b) on which this evaluation task was loosely based
found (for English to Portuguese, Greek, German, and Russian, in the educational
domain) that fluency is improved and word order errors are fewer using NMT when
compared with SMT. NMT produces fewer morphological errors and fewer segments
that require any editing. The authors found no clear improvement for omission or
mistranslation when using NMT, nor did they find any improvement in post-editing
throughput.
Students’ results, presented here with the caveat that this was not a controlled
study, but rather an exercise for them to analyse MT output, followed the same lines as
the published research study using professional translator participants, despite the
different language pairs and domains7. 93% of each cohort stated a preference for NMT.
Language pairs chosen by cohort one were English to French (8 students), French to
English (19 students), German to English (13 students), Spanish to English (4 students),
and English to Spanish (2 students). Cohort 2 chose Chinese to Spanish, Arabic to
English, and English to Spanish (2), Chinese (2), Turkish, and Russian. In cohort one,
62% of undergraduate students spent less time post-editing the NMT output. Average
rating for adequacy (the extent to which the meaning expressed in the source fragment
appears in the translation fragment) for SMT was 2.95 and for NMT 3.46 (where 3 =
most of it and 4 = all of it). Overall errors were fewer for NMT (10.60 as opposed to
18.79), and students found fewer word order errors (NMT: 1.75, SMT 3.91), fewer
omission errors (NMT: 1.44, SMT 2.86), fewer addition errors (NMT: 1.02, SMT 1.66),
and fewer mistranslations in NMT (NMT: 6.78, SMT 10.01).
The results for cohort two were similar, with mean NMT adequacy rated at 2.9
and mean SMT adequacy rated at 2.0. Further evaluation results for students in this
cohort are in Table 1. The student evaluating Arabic-English output also rated errors
from 1 (low) to 4 (high) for gravity, with NMT errors averaging at 1.8 and SMT errors
averaging 3.7. The Chinese to Spanish translations were most likely translated in two
stages (Chinese to English, English to Spanish) using English as a pivot language, due
to the lack of availability of bilingual texts to use as MT training data in the Chinese-
Spanish language pair. This process would unavoidably limit MT output quality.
7
Most Wikipedia topics chosen by the students were geographical locations, with less obvious
choices including Russian Blue cats, Women in Nazi Germany, and Korean pop group Girls’
Generation.
** Place Table 1 about here **
Some of the students were more assiduous with their annotation than others,
with some comments revealing interesting observations about the comparison. The
comments presented here were considered representative of the research reports, and
were selected to support the points raised in the quantitative analysis. Although students
mostly preferred the NMT output, it became clear to them that errors could still be
expected using the neural paradigm. One student wrote that “Neural Translation tool
proved more effective, however of course it was not perfect and contained some errors
which I spent 20 minutes in post editing time correcting.” Another student found
mistranslations via both MT engines, for example when “the word ‘wetting’ was used
instead of "humidity" in the neural machine translation.” “Some words in both
[paradigms] were also made plural when there was no reason to do so.” In general,
students were surprised at the high quality of NMT output: “I found NMT to produce
surprisingly good results in the case of long sentences, consisting of multiple clauses
(which are known to generally cause many problems during machine translation).” One
student suggested that NMT output was of higher quality “due to the multiple
operations it is capable of performing simultaneously.”
One of the students believed that German-English NMT was of a quality
sufficient to make monolingual post-editing feasible “by someone with no knowledge of
German at all, removing the need for a German-speaking post-editor.” On the other
hand, a student with a preference for SMT found NMT errors difficult to identify,
whereas the “types of errors [found in SMT output] were not difficult to correct”. More
common was the student who found repairs such as “correcting gender agreement (e.g.:
par un camarade > par une camarade, from Neural [output]) and fixing flaws in word
order (e.g. et (suprême) extrêmement confiant en soi même (de soi-confiant.) (w.o),
from Neural [output])” to be far easier than finding “extremely precise vocabulary and
fixing tense issues (e.g. les sentiments pour l’un l’autre n’est (sont) jamais explicités,
from Statistical [output]).” One PhD student with a preference for NMT in English-
Russian also wrote that it is “easier to spot mistakes in SMT, but easier to correct errors
in NMT (they are fewer, but harder to spot at times)”.
The average mark (out of 20) for the undergraduate cohort was 15.3 or 77%,
showing that they engaged well with the exercise and crafted reports that demonstrated
an excellent understanding of the strengths and weaknesses of NMT output for their
chosen language pair. The average mark for the module overall was 62%, so their
efforts in the comparative evaluation served to boost their mark overall, particularly for
some of the weaker students.
Feedback and discussion points
In discussions following on from this comparative evaluation, students expressed
surprise at the high quality and fluency of NMT, particularly for morphologically
complex languages that have proved troublesome for MT systems such as Arabic,
Russian, and even German. They also noted, with some relief, problems of omission
and mistranslation in NMT output. They did not consider NMT to be a threat to
translators as of yet, but were concerned that improvements in MT quality over time
may make it an attractive option in some scenarios such as for news translation or
software documentation. Most of the students were nonetheless positive about the
technology, and would be interested in working with NMT in future. They enjoyed
post-editing MT output, and found it easier than translating from scratch.
Two students noted that NMT had produced neologisms in their target languages
(Spanish and Turkish), which initially looked like comprehensible compound words, yet
they could not be found in dictionaries. This may be due to the process mentioned
previously of training using words broken down into smaller chunks or subword units
(Sennrich, Haddow and Birch 2016), so as to better translate words that are rare or do
not appear in the MT training data.
In the follow-up lecture and discussion, students engaged with this complex
topic, asking questions related to - and building on - their own experience of working
with NMT output. Students discussed scenarios where use of NMT may be appropriate
– for perishable texts, as a springboard for ideas during rush-jobs – and those where
NMT would currently be highly inappropriate, such as where transcreation is required,
for high risk and literary texts, for translation jobs that involve regulatory compliance,
for example. As suggested by Massey (2017), the areas of ethics and risk proved to be a
useful starting point for further discussion: at present, machine learning copies human
activities with an increasing level of intelligence but no consciousness, and as such
cannot independently consider ethics or evaluate risk.
Limitations of this evaluation
As mentioned previously, this was not a controlled experiment, but rather an in-
class exercise carried out as part of a standard computer-aided translation module. The
purpose was not to publish a detailed comparative evaluation of MT paradigms. As
such, there is little point in detailed data analysis. No biographical information was
requested from participants, nor any information about their language ability level. The
undergraduate cohort had no experience of research design, and had not considered the
importance of the order in which they had completed the tasks, and effects of limited
post-editing experience. When asked, most said that they had completed the SMT
evaluation first, which may have caused their post-editing speed to be slower for that
paradigm. Conversely, the task order may also have put the NMT evaluation at a
disadvantage, as Wikipedia articles tend to begin simply, in the general domain, before
becoming more complex describing a topic in more detail, using domain-specific
language. Also, even though students were asked to aim for two groups of ten sentences
of similar length, several admitted afterwards that they had not taken this instruction
into account in their evaluations.
Concluding remarks
Previous work has shown the benefits of hands-on experience working with MT in
improving the levels of confidence and self-efficacy among translation students
(Doherty and Kenny, 2014). As NMT is a relatively new technology, and the technical
barriers for building and training NMT systems remains high, this comparative
evaluation task was developed as a way for students to understand the level of
translation quality that could be expected from current standard (SMT) and incoming
(NMT) state-of-the-art automatic translation systems, using evaluation metrics that are
standard in research and industry. As NMT use becomes more commonplace, with
availability in a wider range of languages, the opportunity opens to incorporate this
technology into other translation modules, as suggested by Mellinger (2017).
The two cohorts who took part in this evaluation gained experience at translation
quality assessment using the construct of adequacy, a functional measure of equivalence
between source and target text, and error annotation using a simple typology of errors.
The students also experienced the task of post-editing, and were introduced to the
concept of post-editing effort, using one of the three measures of effort, as developed by
Krings (2001). Finally, the students gained hands-on experience of a cutting-edge MT

paradigm that is only beginning to impinge on the profession of translation, but is
highly likely to be disruptive in the coming years. Rather than fearing the incoming
technology, amidst widespread technological determinism, we hope that they will
instead be empowered to discuss its relative merits and drawbacks from their own –
albeit limited – personal experience.
Acknowledgements
This research is supported by the ADAPT Centre for Digital Content Technology, funded under
the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European
Regional Development Fund.
References:
Arthur, P., G. Neubig, and S. Nakamura. 2016. “Incorporating Discrete Translation

Lexicons into Neural Machine Translation.” In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, 1557–1567.
Bahdanau, D., K. Cho, Y. Bengio. 2014. “Neural Machine Translation by Jointly

Learning to Align and Translate.” Computing Research Repository, abs/1409.0473.
https://arxiv.org/abs/1409.0473
Bentivogli, L., A. Bisazza, M. Cettolo, and M. Federico. 2016. “Neural versus Phrase-
Based Machine Translation Quality: a Case Study.” In Proceedings of Conference on
Empirical Methods in Natural Language Processing (EMNLP 2016), 257-267.
(http://arxiv.org/abs/1608.04631).
Bojar, O., R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno

Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Neveol, M. Neves, M. Popel,
M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri.
2016. “Findings of the 2016 Conference on Machine Translation.” In Proceedings of
the First Conference on Machine Translation, 131–198.
Bowker, L., C. McBride. 2017. “Précis-writing as a form of speed training for

translation students.” The Interpreter and Translator Trainer 11(4): 259-279. doi:
10.1080/1750399X.2017.1359758.
Burschardt, A., V. Macketanz, J. Dehdari, G. Heigold, J-T. Peter, and P. Williams.

2017. “A Linguistic Evaluation of Rule-Based, Phrase-Based, and Neural MT Engines.”
The Prague Bulletin of Mathematical Linguistics 108: 159-170. doi: 10.1515/pralin-
2017-0017
Bureau of Labor Statistics. 2018. Occupational outlook handbook: Interpreters and

translators. Washington, DC: U.S. Department of Labor. Retrieved from
https://www.bls.gov/ooh/media-and-communication/interpreters-and-translators.htm
Cadwell, P., S. O’Brien, and C. S. C. Teixeira. 2017. “Resistance and accommodation:

factors for the (non-) adoption of machine translation among professional translators.”
Perspectives 26(3): 301-321. doi: 10.1080/0907676X.2017.1337210
Castilho, S., J. Moorkens, F. Gaspari, I. Calixto, J. Tinsley, and A. Way. 2017a. “Is
Neural Machine Translation the New State-of-the-Art?” The Prague Bulletin of
Mathematical Linguistics 108: 109-120. doi: 10.1515/pralin-2017-0013
Castilho, S., J. Moorkens, F. Gaspari, R. Sennrich, V. Sosoni, P. Georgakopoulou, P.

Lohar, A. Way, A. Valerio Miceli Barone, and M. Gialama. 2017b. “A Comparative
Quality Evaluation of PBSMT and NMT using Professional Translators.” In
Proceedings of MT Summit 2017.
Cho, K., B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. “On the Properties of
Neural Machine Translation: Encoder-Decoder Approaches”. Computing Research
Repository, abs/1409.1259
Doherty, S., and D. Kenny. 2014. “The Design and Evaluation of a Statistical Machine
Translation Syllabus for Translation Students.” The Interpreter and Translator Trainer
8:2, 276-294, doi:10.1080/1750399X.2014.937571
Doherty, S., F. Gaspari, J. Moorkens, and S. Castilho. 2018. “On education and training
in Translation Quality Assessment.” Translation Quality Assessment: From principles
to practice, edited by J. Moorkens, S. Castilho, F. Gaspari, S. Doherty. Berlin: Springer.
doi:10.1007/978-3-319-91241-7_5
EMT Network. 2017. European Master’s in Translation Competence Framework 2017.

https://ec.europa.eu/info/sites/info/files/emt_competence_fwk_2017_en_web.pdf
Forcada, M. 2017. “Making sense of neural machine translation”. Translation Spaces

6(2): 291–309.
Forcada, M., Ñeco, R. P. 1997. “Recursive hetero-associative memories for translation.”

Biological and Artificial Computation: From Neuroscience to Technology, edited by J.
Mira, R. Moreno-Díaz, and J. Cabestany, 453-462. Berlin: Springer.
Gaspari, F., H. Almaghout, and S. Doherty. 2015. “A survey of machine translation

competences: Insights for translation technology educators and practitioners.”
Perspectives 23 (3): 333-358, doi: 10.1080/0907676X.2014.979842
Guerberof, A. 2012. Productivity and quality in the post-editing of outputs from

translation memories and machine translation. PhD dissertation.
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X.,
Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T-Y. Luo, R., Menezes, A.,
Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z.,
Zhou, M. 2018. Achieving Human Parity on Automatic Chinese to English News
Translation. Computing Research Repository arXiv:1803.05567v1
(https://arxiv.org/abs/1803.05567).
Hurtado Albir, A. 2007. “Competence-based Curriculum Design for Training
Translators.” The Interpreter and Translator Trainer 1 (2): 63-195. doi:
10.1080/1750399X.2007.10798757
Hutchins, W. J. 1986. Machine Translation: Past, present, future. Chichester: Ellis

Horwood.
Kenny, D. 2007. “Translation Memories and Parallel Corpora: Challenges for the
Translation Trainer.” In Across Boundaries: International Perspectives on Translation,
edited by D. Kenny and K. Ryou, 192–208. Newcastle-upon-Tyne: Cambridge Scholars
Publishing.
Kenny, D., and S. Doherty. 2014. “Statistical Machine Translation in the Translation
Curriculum: Overcoming Obstacles and Empowering Translators.” The Interpreter and
Translator Trainer 8 (2), 295–315. doi:10.1080/1750399X.2014.936112.
Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,

W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007.
“Moses: Open Source Toolkit for Statistical Machine Translation”. In Proceedings of
Annual Meeting of the Association for Computational Linguistics (ACL).
Krings, H. P. 2001. Repairing Texts. Kent, OH: Kent State University Press.
Lommel, A., D. A. DePalma. 2016. “Post-Editing Goes Mainstream.” Common Sense

Advisory Report.
Lommel, A. 2018. “The Multidimensional Quality Metrics and Dynamic Quality

Framework.” Translation Quality Assessment: From principles to practice, edited by J.
Moorkens, S. Castilho, F. Gaspari, S. Doherty. Berlin: Springer. doi:10.1007/978-3-
319-91241-7_6
Massey, G. 2017. “Machine learning: Implications for translator education.” In

Proceedings of CIUTI Forum 2017: Short- and long-term impact of artificial
intelligence on language professions.
Mellinger, C. D. 2017. “Translators and machine translation: knowledge and skills gaps
in translator pedagogy.” The Interpreter and Translator Trainer 11(4): 280-293.
doi:10.1080/1750399X.2017.1359760.
O’Brien, S. 2012. “Translation as human‐computer interaction.” Translation Spaces

1(1):101-122. doi: 10.1075/ts.1.05obr
Olohan, M. 2007. “Economic Trends and Developments in the Translation Industry.”

The Interpreter and Translator Trainer 1 (1): 37–63.
doi:10.1080/1750399X.2007.10798749.
Plitt, M., and F. Masselot. 2010. “A Productivity Test of Statistical Machine Translation
Post-Editing in a Typical Localisation Context.” The Prague Bulletin of Mathematical
Linguistics 93: 7-16.
Popović, M. 2017. “Comparing Language Related Issues for NMT and PBMT between
German and English.” The Prague Bulletin of Mathematical Linguistics 108: 209-220.
doi: 10.1515/pralin-2017-0021
Sennrich, R., B. Haddow, A. Birch. 2016. “Neural Machine Translation by Jointly

Learning to Align and Translate.” In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, 1715–1725.
Way, A. 2018. “Traditional and emerging use-cases for machine translation.”

Translation Quality Assessment: From principles to practice, edited by J. Moorkens, S.
Castilho, F. Gaspari, S. Doherty. Berlin: Springer. doi:10.1007/978-3-319-91241-7_8
Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,

Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws,
Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C.Young, J.
Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean. 2017
“Google's Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation.” Computing Research Repository arXiv:1609.08144
(https://arxiv.org/abs/1609.08144 ).
Language pair Mean adequacy NMT errors SMT errors
AR-EN NMT 3.2 Stylistic, ‘awkward’ Compound errors, ‘gibberish’
SMT 2.1 phrasing
EN-IT NMT 2.7 All mistranslations (9) Mostly mistranslations (25)
SMT 2.1
EN-ES NMT 3.1 Mostly mistranslation (3) Mistranslation (9) and word order
SMT 1.9 and omission (2) (3)
EN-RU NMT 3.1 8 mistranslations, 2 Mistranslations (11), word order
SMT 1.7 omissions (elaborate, not (2), omission (2), addition (3)
easy to detect)
ZH-ES NMT 2.4 Mistranslation (8), word Mistranslation (10), word order
SMT 2.3 order (3), omission (7) (8), omission (10)
Table 1. Cohort two findings for adequacy and error annotation.

What To Expect From Neural Machine Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What To Expect From Neural Machine Translation

Uploaded by

Copyright:

Available Formats

What to expect from Neural Machine Translation: A practical in-class

translation evaluation exercise

Office phone: +353-1-7007477

Machine translation is currently undergoing a paradigm shift from statistical to

Keywords: Machine translation; neural MT; neural networks; machine translation

Teaching third-level students about translation technology is complicated by the

dynamic technological environment, with standard practices regularly overridden by

inception of neural machine translation (NMT).

Since the early 2000s, statistical MT (SMT) systems, trained on human

has quickly gained a foothold in both academia and industry by outperforming

statistical systems in competitions (Bojar et al. 2016) and in well-publicised research

human translators (Castilho et al. 2017a). In conjunction with widespread technological

of potential intervention in the SMT preparation, training, and post-editing processes

to learn about new technologies, including NMT, is a positive and empowering

This article reports a practical in-class translation evaluation exercise to

awareness” (Massey 2017) of - a cutting-edge translation technology, while also gaining

published comparative evaluations.

emergence of NMT, including summarised results from previously published

Technological and industrial context

Although the application of neural networks to speech recognition became

NMT some years before.

development, leading to great anticipation within the MT research community of a leap

French, Simplified Chinese, and vice-versa, concluding that NMT strongly

outperformed other approaches, improving translation quality for morphologically rich

Burschardt et al. (2017, 169) to note a “striking improvement” in English-German

English words, and continuous English verbs.

post-editing effort, and comparatively higher numbers of identified errors of omission,

discontent and suspicion among translators” (118).

understand the information communicated in those fields” (Bureau of Labor Statistics,

2018). The translation industry as a whole continues to report growth, with an

increasing proportion of turnover coming from post-editing of MT (Lommel and

MT systems. As mentioned, Google Translate has been increasingly adopting neural

underrepresented skills in translator training programmes generally. The exercise

to contribute towards the translator’s technological competence (EMT Network 2017)

and instrumental competence (Hurtado Albir 2007). More specifically, student

of adequacy, a functional measure of equivalence between source and target text,

of effort introduced by Krings (2001): temporal, technical, and cognitive effort.

benefit of MT as a translation aid despite a lack of enthusiasm from translator users

(Plitt and Masselot 2010; Guerberof 2012).

students at Dublin City University as part of a Computer-Aided Translation module

in a classroom environment and no experience of post-editing. It contributed to 20% of

Translation Memory project. Following this module, students should be able to

translation memory and terminology management tools. Prior to this evaluation,

of the MT paradigm employed. Their attitudes to MT in the classroom were broadly

positive, but were not measured using any tests or questionnaires.

Students were given the following preamble:

carries out more operations simultaneously, can produce improved output

systems involved are liable to change without warning.

(1) Post-editing effort

(3) Error typology

available on the Microsoft site6.

carried out three evaluations based on the following instructions:

phone stopwatch). This is known as Temporal Post-Editing Effort.

Russian, and Spanish.

appears in the translation fragment?

(3) Error typology: Mark word order errors, mistranslations, omissions,

 Word order errors: incorrect word order at phrase or word level

 Mistranslations: incorrectly translated word, wrong gender,

 Omissions: word(s) from the ST has been omitted from the TT

 Additions: word(s) not in the ST have been added in the TT

occurrences by the number of segments (presumably 10).

Deliverables and Marking Criteria

provided for the exercise were as follows: