You are on page 1of 24

What to expect from Neural Machine Translation: A practical in-class

translation evaluation exercise

Joss Moorkens

School of Applied Language and Intercultural Studies, Dublin City University, Ireland.

Office phone: +353-1-7007477

Email: joss.moorkens@dcu.ie

Twitter: @jossmo

http://orcid.org/0000-0003-0766-0071

https://www.linkedin.com/in/jossmo/
What to expect from Neural Machine Translation: A practical in-class
translation evaluation exercise

Machine translation is currently undergoing a paradigm shift from statistical to


neural network models. Neural machine translation (NMT) is difficult to
conceptualise for translation students, especially without context. This article
describes a short in-class evaluation exercise to compare statistical and neural
MT, including details of student results and follow-on discussions. As part of this
exercise, students carry out evaluations of two types of MT output using three
translation quality assurance (TQA) metrics: adequacy, post-editing productivity,
and a simple error taxonomy. In this way, the exercise introduces NMT, TQA,
and post-editing. In our module, a more detailed explanation of NMT followed
the evaluation.

The rise of NMT has been accompanied by a good deal of media hyperbole about
neural networks and machine learning, some of which has suggested that several
professions, including translation, may be under threat. This evaluation exercise
is intended to empower the students, and help them understand the strengths and
weaknesses of this new technology. Students’ findings using several language
pairs mirror those from published research, such as improved fluency and word
order in NMT output, with some unpredictable problems of omission and
mistranslation.

Keywords: Machine translation; neural MT; neural networks; machine translation


evaluation; translation technology

Subject classification codes: include these here if the journal requires them

Introduction

Teaching third-level students about translation technology is complicated by the

dynamic technological environment, with standard practices regularly overridden by

new and updated tools and technologies. Where these technologies can help translators

to “maintain high levels of productivity and offer value-added services” (Olohan 2007,

59), it is incumbent on translation trainers to ensure that students are made aware of
their usefulness in order to maximise their agency as translators, and to fulfil industry

employment needs. One particularly disruptive change in recent years has been the

inception of neural machine translation (NMT).

Since the early 2000s, statistical MT (SMT) systems, trained on human

translations, have become commonplace. Although the first wave of NMT publications

appeared relatively recently (Bahdanau, Cho and Bengio 2014; Cho et al. 2014), NMT

has quickly gained a foothold in both academia and industry by outperforming

statistical systems in competitions (Bojar et al. 2016) and in well-publicised research

and deployment (Wu et al. 2016). NMT is also a statistical paradigm, with systems

trained on human translations. Nonetheless, claims that NMT is, according to the title of

a 2016 Google research paper, “bridging the gap between human and machine

translation” (Wu et al. 2016), or that Microsoft have achieved “human parity on

automatic Chinese to English news translation” (Hassan et al. 2018), have led to a great

deal of media hyperbole about the potential uses of NMT and related displacement of

human translators (Castilho et al. 2017a). In conjunction with widespread technological

determinism in media reports about machine learning, this may lead translators to fear

NMT, and engender a sense of powerlessness due to a perception that the technology

“gets precedence and is inevitable” (Cadwell, O’Brien, and Teixeira 2017, 17). O’Brien,

however, suggests that the “increasing technologisation of the profession is not a threat,

but an opportunity to expand skill sets and take on new roles” (2012, 118). The points

of potential intervention in the SMT preparation, training, and post-editing processes

that may benefit from the skills of the translator, as highlighted by Kenny and Doherty

(2014), still hold true for NMT. With this point in mind, I contend that helping students

to learn about new technologies, including NMT, is a positive and empowering

intervention.
MT tends to be perceived negatively by many translators due to low quality

expectations, imposition of the technology without any choice to opt out, and fear of

being replaced (Cadwell, O’Brien, and Teixeira 2017). It was originally envisaged that

MT would replace human translators (Hutchins 1986), and although that intention may

not still hold true (Way 2018), the perception remains. As MT has moved towards

statistical methods, the MT process has become more difficult to explain. NMT can be

particularly difficult for students and scholars to conceptualise, not least because neural

networks are complex and NMT output can be unpredictable (Arthur, Neubig and

Nakamura 2016)1. This makes it all the more important for translation students to

familiarise themselves with and demystify NMT output, and to become aware that,

despite the hype about machine learning, that NMT output has many weaknesses as

well as strengths.

This article reports a practical in-class translation evaluation exercise to

comparatively evaluate statistical and neural MT output for one language pair. It was set

following a large scale comparative evaluation carried out as part of the TraMOOC EU-

funded project (Castilho et al. 2017b), and scaled down so that students could complete

the evaluations within two hours. The exercise was designed based on the constructivist

paradigm so that students could independently learn about - and build an “evaluative

awareness” (Massey 2017) of - a cutting-edge translation technology, while also gaining

experience at using several translation quality assessment (TQA) techniques. For most

students, this was also a first experience of MT post-editing. The exercise was carried

out with second year undergraduate students of translation and repeated with a cohort of

Translation Studies PhD students in another university. Students’ reports showed many

1
See Forcada (2017) for an accessible introduction to the technology behind NMT.
insights into what may be expected of each technology. A lecture on MT, incorporating

NMT and published NMT evaluations from research and from industry, followed one

week after the in-class comparative task for the undergraduate cohort, providing context

for their findings, and explaining (at the level of concepts rather than mathematical

equations) the processes of creating and training an NMT system that lead to the type of

output that they had evaluated. Their findings meant that they had personal experience

of the systems discussed, also mirrored the findings of the large-scale, randomised

TraMOOC evaluation across four language pairs, and a number of other recently

published comparative evaluations.

The following section details the technological and industrial context of the

emergence of NMT, including summarised results from previously published

evaluations, and the potential employability benefits of learning about NMT. Thereafter,

the evaluation task is described, with results from both cohorts summarised, followed

by some in-class feedback and discussion points that were raised by the exercise.

Technological and industrial context

Although the application of neural networks to speech recognition became

commonplace some years before, the first published NMT papers appeared in 20142.

Researchers began to create single neural networks, tuned to maximise the “probability

of a correct translation given a source sentence” (Bahdanau, Cho and Bengio 2014, 1)

based on contextual information from the source text and previously produced target

2
Forcada and Ñeco (1997) had, in fact, suggested a method that was effectively a precursor to

NMT some years before.


text. New techniques were developed to cope with variable sentence lengths and to help

with translation of unknown source words by breaking words into subword particles,

(Sennrich, Haddow and Birch 2016), improving the quality of NMT output. When NMT

systems were entered into competitive MT environments in 2016, they scored above

SMT for many language pairs (English-German and vice versa, English-Czech,

English-Russian; see Bojar et al. 2016) despite the comparatively few years of NMT

development, leading to great anticipation within the MT research community of a leap

forward in quality.

Comparative evaluation and the state of the art for machine translation

Several papers subsequently appeared, using automatic and human evaluation methods

to compare NMT with SMT. All highlight the low number of word order errors found in

NMT output and associated improved scores for fluency. Bentivogli et al. (2016, 265)

found that NMT had “significantly pushed ahead the state of the art”. In their

evaluation, technical post-editing effort (in terms of the number of edits) for English to

German was reduced on average by 26% when using NMT rather than the best-

performing SMT system, with fewer overall errors, and notably fewer word order and

verb placement errors. Wu et al. (2016) used automatic evaluation and human ranking

of 500 Wikipedia segments that had been machine-translated from English into Spanish,

French, Simplified Chinese, and vice-versa, concluding that NMT strongly

outperformed other approaches, improving translation quality for morphologically rich

languages. The publicity surrounding this publication and the subsequent move to NMT

by Google Translate for these language pairs led to a spike in NMT hype (Castilho et al.

2017a). Several further language pairs have now moved to Google NMT, prompting

Burschardt et al. (2017, 169) to note a “striking improvement” in English-German

translation quality.
A detailed evaluation by Popović (2017) of SMT and NMT output for English-

German and vice-versa found that NMT produced fewer overall errors, and that NMT

output contained improved verb order and verb forms, with fewer verbal omissions.

NMT also performed strongly regarding morphology and word order, particularly on

articles, phrase structure, English noun collocations, and German compound words. She

reported, however, that NMT was less successful in translating prepositions, ambiguous

English words, and continuous English verbs.

Castilho et al. (2017a) found inconsistent evaluation results for adequacy and

post-editing effort, and comparatively higher numbers of identified errors of omission,

addition and mistranslation in NMT output in several language pairs and three domains.

They conclude that, although NMT represents a significant improvement “for some

language pairs and specific domains, there is still much room for research and

improvement” (110). They caution that “overselling a technology that is still in need of

more research may cause negativity about MT”, and more specifically, a “wave of

discontent and suspicion among translators” (118).

Employability considerations
The outlook for employment in translation in the short-to-medium term looks to be

relatively positive. The U. S. Bureau of Labor Statistics expect an 18% increase in jobs

for translators and interpreters in the U.S.A. between 2016 and 2026, equating to 12,100

new positions. The demand for translators is predicted to vary depending on speciality

or language pair, and opportunities “should be plentiful for interpreters and translators

specializing in healthcare and law, because of the critical need for all parties to fully

understand the information communicated in those fields” (Bureau of Labor Statistics,

2018). The translation industry as a whole continues to report growth, with an

increasing proportion of turnover coming from post-editing of MT (Lommel and


DePalma, 2016). The likelihood is therefore that many graduate translators will have to

work with MT output. Post-editors of NMT in Castilho et al. (2017b, 11) “found NMT

errors more difficult to identify”, whereas “word order errors and disfluencies requiring

revision were detected faster” in SMT output. Familiarity with NMT output, especially

considering the increasing importance of speed in the translator workplace (see Bowker

and McBride, 2017), should improve the efficiency of NMT error identification.

At present, the move is underway for MT providers to entirely neural or neural hybrid

MT systems. As mentioned, Google Translate has been increasingly adopting neural

methods, Microsoft have released online NMT engines, Kantan MT has created an

NMT product called NeuralFleet, the ModernMT project has moved to NMT,3 despite

being led by one of the creators of the popular Moses SMT system (Koehn et al. 2007).

This all suggests that professional linguists will soon find themselves in workflows that

incorporate NMT to a greater or lesser extent. Learning about NMT should empower

the translator “as an agent who is very much present throughout” such a workflow,

rather than taking a “limited or reductive role” (Kenny and Doherty 2014, 290).

Gaspari, Almaghout and Doherty (2015) identified MT, TQA, and post-editing as

underrepresented skills in translator training programmes generally. The exercise

described in the following section incorporates each of these skills, and is also intended

to contribute towards the translator’s technological competence (EMT Network 2017)

and instrumental competence (Hurtado Albir 2007). More specifically, student

participants gained TQA experience using three metrics. The first employs the construct

of adequacy, a functional measure of equivalence between source and target text,

3
See https://github.com/ModernMT/MMT.
commonly used (along with fluency, both measured using a Likert-type scale) for

human evaluations of MT quality. The second metric employs error annotation using a

simple typology of errors, common among research and industry models but little-used

in academic scenarios.4 The third is post-editing effort, using one of the three categories

of effort introduced by Krings (2001): temporal, technical, and cognitive effort.

Temporal effort, or time spent post-editing, is commonly used, often to highlight the

benefit of MT as a translation aid despite a lack of enthusiasm from translator users

(Plitt and Masselot 2010; Guerberof 2012).

Evaluation task

This evaluation task was set for a cohort of 46 second-year translation undergraduate

students at Dublin City University as part of a Computer-Aided Translation module

with a two-hour time limit. These students had a small amount of translation experience

in a classroom environment and no experience of post-editing. It contributed to 20% of

their marks for the module, with the remainder awarded on the basis of results of a

Translation Memory project. Following this module, students should be able to

demonstrate awareness of appropriate tools that assist in the translation process, and

awareness of the historical development of CAT tools and their importance in modern-

day translation practice. They should be able to apply translation memory and

terminology management tools at a basic level, and to explain the interaction between

translation memory and terminology management tools. Prior to this evaluation,

students had learned about TQA and the history of MT, but had not yet learned about

4
See Doherty, Gaspari, Moorkens, and Castilho (2018) and Lommel (2018) for a further

discussion of these.
SMT and NMT in any detail. All students had used online MT without any knowledge

of the MT paradigm employed. Their attitudes to MT in the classroom were broadly

positive, but were not measured using any tests or questionnaires.

The task was repeated with a cohort of 9 participants from 14 attendees at a PhD

Summer School, where postgraduate students and staff from another university

performed the evaluation in two one-hour blocks. The rationale for this was to see

whether the exercise would be useful for a different group with a background in

translation studies and more translation experience. The first cohort used the Microsoft

Translator MT Try & Compare website5, and the second cohort were given the option of

using this same site or Google Translate for NMT and Google Sheets for SMT, as some

of the participants’ language pairs were not included in the Microsoft site.

Task description

Students were given the following preamble:

The most popular Machine Translation (MT) paradigm over the past 15 years or

so has been Statistical Machine Translation (SMT). More recently, there have

been claims that Neural Machine Translation (NMT), a statistical paradigm that

carries out more operations simultaneously, can produce improved output

5
From August 2017, the site at https://translator.microsoft.com/neural/ allowed users to

compare NMT and SMT output. This changed in March 2018 so that users can test the

research systems described in Hassan et al. (2018). At the time of writing (May 2018), users

can still access Google SMT via Google Sheets. Due to these being free online tools, the MT

systems involved are liable to change without warning.


You will compare SMT and NMT using three evaluation methods

(1) Post-editing effort

(2) Adequacy

(3) Error typology

They were provided with a list of available languages for the exercise based on those

available on the Microsoft site6.

Within the allotted time, students were asked to choose a page from Wikipedia, copy 20

segments in one of the languages supported by the MT systems (again, their choice),

aiming for two groups of 10 with similar sentence length. While Wikipedia articles

could not be considered “appropriate, authentic texts” (Kenny 2007, 204) for a

translation workflow, they were suggested for a time-limited task to avoid time-

consuming terminology searches. Students could translate material on a topic that was

familiar and interesting to them. The students had 10 segments translated using NMT

and 10 using SMT, and copied the output to their documents making sure to clearly

differentiate the segments produced by SMT and NMT systems. The students then

carried out three evaluations based on the following instructions:

(1) Post-editing effort (time spent, measured using the computer clock,

noting Total Editing Time in Word before and after evaluation, or ideally

phone stopwatch). This is known as Temporal Post-Editing Effort.

6
Arabic, Chinese (simplified), English, French, German, Italian, Japanese, Korean, Portuguese,

Russian, and Spanish.


(2) Adequacy: How much of the meaning expressed in the source fragment

appears in the translation fragment?

(1) None of it

(2) Little of it

(3) Most of it

(4) All of it

(3) Error typology: Mark word order errors, mistranslations, omissions,

and additions:

 Word order errors: incorrect word order at phrase or word level

 Mistranslations: incorrectly translated word, wrong gender,

number, or case

 Omissions: word(s) from the ST has been omitted from the TT

 Additions: word(s) not in the ST have been added in the TT

Tip: Use the highlighter tool and count the number of occurrences for

each error.

Students were asked to work out their average post-editing time per segment for

each MT paradigm over the ten segments, the average segment adequacy score for each

paradigm, and the frequency of each error type by dividing the total number of

occurrences by the number of segments (presumably 10).

Deliverables and Marking Criteria


Students were asked to submit a short report (up to 500 words), in which they were to

state a preference of one MT system or the other for their language pair and to
explaining their reasons, using examples from the exercise. They were also asked to

suggest scenarios in which these MT systems might be useful. The marking criteria

provided for the exercise were as follows:

 Credible results for each evaluation: 5

 Reasoned preference for an MT paradigm: 5

 Examples provided: 5

 Appropriate scenario for MT system(s): 3

 Originality and critical input: 2

The second cohort (PhD students) were not graded for the exercise and findings were

presented orally, as the four-day summer school format was time-limited and no grades

or credits were awarded.

Students’ results

The study by Castilho et al. (2017b) on which this evaluation task was loosely based

found (for English to Portuguese, Greek, German, and Russian, in the educational

domain) that fluency is improved and word order errors are fewer using NMT when

compared with SMT. NMT produces fewer morphological errors and fewer segments

that require any editing. The authors found no clear improvement for omission or

mistranslation when using NMT, nor did they find any improvement in post-editing

throughput.

Students’ results, presented here with the caveat that this was not a controlled

study, but rather an exercise for them to analyse MT output, followed the same lines as

the published research study using professional translator participants, despite the
different language pairs and domains7. 93% of each cohort stated a preference for NMT.

Language pairs chosen by cohort one were English to French (8 students), French to

English (19 students), German to English (13 students), Spanish to English (4 students),

and English to Spanish (2 students). Cohort 2 chose Chinese to Spanish, Arabic to

English, and English to Spanish (2), Chinese (2), Turkish, and Russian. In cohort one,

62% of undergraduate students spent less time post-editing the NMT output. Average

rating for adequacy (the extent to which the meaning expressed in the source fragment

appears in the translation fragment) for SMT was 2.95 and for NMT 3.46 (where 3 =

most of it and 4 = all of it). Overall errors were fewer for NMT (10.60 as opposed to

18.79), and students found fewer word order errors (NMT: 1.75, SMT 3.91), fewer

omission errors (NMT: 1.44, SMT 2.86), fewer addition errors (NMT: 1.02, SMT 1.66),

and fewer mistranslations in NMT (NMT: 6.78, SMT 10.01).

The results for cohort two were similar, with mean NMT adequacy rated at 2.9

and mean SMT adequacy rated at 2.0. Further evaluation results for students in this

cohort are in Table 1. The student evaluating Arabic-English output also rated errors

from 1 (low) to 4 (high) for gravity, with NMT errors averaging at 1.8 and SMT errors

averaging 3.7. The Chinese to Spanish translations were most likely translated in two

stages (Chinese to English, English to Spanish) using English as a pivot language, due

to the lack of availability of bilingual texts to use as MT training data in the Chinese-

Spanish language pair. This process would unavoidably limit MT output quality.

7
Most Wikipedia topics chosen by the students were geographical locations, with less obvious

choices including Russian Blue cats, Women in Nazi Germany, and Korean pop group Girls’

Generation.
** Place Table 1 about here **

Some of the students were more assiduous with their annotation than others,

with some comments revealing interesting observations about the comparison. The

comments presented here were considered representative of the research reports, and

were selected to support the points raised in the quantitative analysis. Although students

mostly preferred the NMT output, it became clear to them that errors could still be

expected using the neural paradigm. One student wrote that “Neural Translation tool

proved more effective, however of course it was not perfect and contained some errors

which I spent 20 minutes in post editing time correcting.” Another student found

mistranslations via both MT engines, for example when “the word ‘wetting’ was used

instead of "humidity" in the neural machine translation.” “Some words in both

[paradigms] were also made plural when there was no reason to do so.” In general,

students were surprised at the high quality of NMT output: “I found NMT to produce

surprisingly good results in the case of long sentences, consisting of multiple clauses

(which are known to generally cause many problems during machine translation).” One

student suggested that NMT output was of higher quality “due to the multiple

operations it is capable of performing simultaneously.”

One of the students believed that German-English NMT was of a quality

sufficient to make monolingual post-editing feasible “by someone with no knowledge of

German at all, removing the need for a German-speaking post-editor.” On the other

hand, a student with a preference for SMT found NMT errors difficult to identify,

whereas the “types of errors [found in SMT output] were not difficult to correct”. More

common was the student who found repairs such as “correcting gender agreement (e.g.:

par un camarade > par une camarade, from Neural [output]) and fixing flaws in word
order (e.g. et (suprême) extrêmement confiant en soi même (de soi-confiant.) (w.o),

from Neural [output])” to be far easier than finding “extremely precise vocabulary and

fixing tense issues (e.g. les sentiments pour l’un l’autre n’est (sont) jamais explicités,

from Statistical [output]).” One PhD student with a preference for NMT in English-

Russian also wrote that it is “easier to spot mistakes in SMT, but easier to correct errors

in NMT (they are fewer, but harder to spot at times)”.

The average mark (out of 20) for the undergraduate cohort was 15.3 or 77%,

showing that they engaged well with the exercise and crafted reports that demonstrated

an excellent understanding of the strengths and weaknesses of NMT output for their

chosen language pair. The average mark for the module overall was 62%, so their

efforts in the comparative evaluation served to boost their mark overall, particularly for

some of the weaker students.

Feedback and discussion points

In discussions following on from this comparative evaluation, students expressed

surprise at the high quality and fluency of NMT, particularly for morphologically

complex languages that have proved troublesome for MT systems such as Arabic,

Russian, and even German. They also noted, with some relief, problems of omission

and mistranslation in NMT output. They did not consider NMT to be a threat to

translators as of yet, but were concerned that improvements in MT quality over time

may make it an attractive option in some scenarios such as for news translation or

software documentation. Most of the students were nonetheless positive about the

technology, and would be interested in working with NMT in future. They enjoyed

post-editing MT output, and found it easier than translating from scratch.

Two students noted that NMT had produced neologisms in their target languages

(Spanish and Turkish), which initially looked like comprehensible compound words, yet
they could not be found in dictionaries. This may be due to the process mentioned

previously of training using words broken down into smaller chunks or subword units

(Sennrich, Haddow and Birch 2016), so as to better translate words that are rare or do

not appear in the MT training data.

In the follow-up lecture and discussion, students engaged with this complex

topic, asking questions related to - and building on - their own experience of working

with NMT output. Students discussed scenarios where use of NMT may be appropriate

– for perishable texts, as a springboard for ideas during rush-jobs – and those where

NMT would currently be highly inappropriate, such as where transcreation is required,

for high risk and literary texts, for translation jobs that involve regulatory compliance,

for example. As suggested by Massey (2017), the areas of ethics and risk proved to be a

useful starting point for further discussion: at present, machine learning copies human

activities with an increasing level of intelligence but no consciousness, and as such

cannot independently consider ethics or evaluate risk.

Limitations of this evaluation

As mentioned previously, this was not a controlled experiment, but rather an in-

class exercise carried out as part of a standard computer-aided translation module. The

purpose was not to publish a detailed comparative evaluation of MT paradigms. As

such, there is little point in detailed data analysis. No biographical information was

requested from participants, nor any information about their language ability level. The

undergraduate cohort had no experience of research design, and had not considered the

importance of the order in which they had completed the tasks, and effects of limited

post-editing experience. When asked, most said that they had completed the SMT
evaluation first, which may have caused their post-editing speed to be slower for that

paradigm. Conversely, the task order may also have put the NMT evaluation at a

disadvantage, as Wikipedia articles tend to begin simply, in the general domain, before

becoming more complex describing a topic in more detail, using domain-specific

language. Also, even though students were asked to aim for two groups of ten sentences

of similar length, several admitted afterwards that they had not taken this instruction

into account in their evaluations.

Concluding remarks

Previous work has shown the benefits of hands-on experience working with MT in

improving the levels of confidence and self-efficacy among translation students

(Doherty and Kenny, 2014). As NMT is a relatively new technology, and the technical

barriers for building and training NMT systems remains high, this comparative

evaluation task was developed as a way for students to understand the level of

translation quality that could be expected from current standard (SMT) and incoming

(NMT) state-of-the-art automatic translation systems, using evaluation metrics that are

standard in research and industry. As NMT use becomes more commonplace, with

availability in a wider range of languages, the opportunity opens to incorporate this

technology into other translation modules, as suggested by Mellinger (2017).

The two cohorts who took part in this evaluation gained experience at translation

quality assessment using the construct of adequacy, a functional measure of equivalence

between source and target text, and error annotation using a simple typology of errors.

The students also experienced the task of post-editing, and were introduced to the

concept of post-editing effort, using one of the three measures of effort, as developed by

Krings (2001). Finally, the students gained hands-on experience of a cutting-edge MT


paradigm that is only beginning to impinge on the profession of translation, but is

highly likely to be disruptive in the coming years. Rather than fearing the incoming

technology, amidst widespread technological determinism, we hope that they will

instead be empowered to discuss its relative merits and drawbacks from their own –

albeit limited – personal experience.

Acknowledgements

This research is supported by the ADAPT Centre for Digital Content Technology, funded under

the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European

Regional Development Fund.

References:

Arthur, P., G. Neubig, and S. Nakamura. 2016. “Incorporating Discrete Translation


Lexicons into Neural Machine Translation.” In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, 1557–1567.

Bahdanau, D., K. Cho, Y. Bengio. 2014. “Neural Machine Translation by Jointly


Learning to Align and Translate.” Computing Research Repository, abs/1409.0473.
https://arxiv.org/abs/1409.0473

Bentivogli, L., A. Bisazza, M. Cettolo, and M. Federico. 2016. “Neural versus Phrase-
Based Machine Translation Quality: a Case Study.” In Proceedings of Conference on
Empirical Methods in Natural Language Processing (EMNLP 2016), 257-267.
(http://arxiv.org/abs/1608.04631).

Bojar, O., R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno


Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Neveol, M. Neves, M. Popel,
M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri.
2016. “Findings of the 2016 Conference on Machine Translation.” In Proceedings of
the First Conference on Machine Translation, 131–198.

Bowker, L., C. McBride. 2017. “Précis-writing as a form of speed training for


translation students.” The Interpreter and Translator Trainer 11(4): 259-279. doi:
10.1080/1750399X.2017.1359758.

Burschardt, A., V. Macketanz, J. Dehdari, G. Heigold, J-T. Peter, and P. Williams.


2017. “A Linguistic Evaluation of Rule-Based, Phrase-Based, and Neural MT Engines.”
The Prague Bulletin of Mathematical Linguistics 108: 159-170. doi: 10.1515/pralin-
2017-0017

Bureau of Labor Statistics. 2018. Occupational outlook handbook: Interpreters and


translators. Washington, DC: U.S. Department of Labor. Retrieved from
https://www.bls.gov/ooh/media-and-communication/interpreters-and-translators.htm

Cadwell, P., S. O’Brien, and C. S. C. Teixeira. 2017. “Resistance and accommodation:


factors for the (non-) adoption of machine translation among professional translators.”
Perspectives 26(3): 301-321. doi: 10.1080/0907676X.2017.1337210

Castilho, S., J. Moorkens, F. Gaspari, I. Calixto, J. Tinsley, and A. Way. 2017a. “Is
Neural Machine Translation the New State-of-the-Art?” The Prague Bulletin of
Mathematical Linguistics 108: 109-120. doi: 10.1515/pralin-2017-0013

Castilho, S., J. Moorkens, F. Gaspari, R. Sennrich, V. Sosoni, P. Georgakopoulou, P.


Lohar, A. Way, A. Valerio Miceli Barone, and M. Gialama. 2017b. “A Comparative
Quality Evaluation of PBSMT and NMT using Professional Translators.” In
Proceedings of MT Summit 2017.

Cho, K., B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. “On the Properties of
Neural Machine Translation: Encoder-Decoder Approaches”. Computing Research
Repository, abs/1409.1259
Doherty, S., and D. Kenny. 2014. “The Design and Evaluation of a Statistical Machine
Translation Syllabus for Translation Students.” The Interpreter and Translator Trainer
8:2, 276-294, doi:10.1080/1750399X.2014.937571

Doherty, S., F. Gaspari, J. Moorkens, and S. Castilho. 2018. “On education and training
in Translation Quality Assessment.” Translation Quality Assessment: From principles
to practice, edited by J. Moorkens, S. Castilho, F. Gaspari, S. Doherty. Berlin: Springer.
doi:10.1007/978-3-319-91241-7_5

EMT Network. 2017. European Master’s in Translation Competence Framework 2017.


https://ec.europa.eu/info/sites/info/files/emt_competence_fwk_2017_en_web.pdf

Forcada, M. 2017. “Making sense of neural machine translation”. Translation Spaces


6(2): 291–309.

Forcada, M., Ñeco, R. P. 1997. “Recursive hetero-associative memories for translation.”


Biological and Artificial Computation: From Neuroscience to Technology, edited by J.
Mira, R. Moreno-Díaz, and J. Cabestany, 453-462. Berlin: Springer.

Gaspari, F., H. Almaghout, and S. Doherty. 2015. “A survey of machine translation


competences: Insights for translation technology educators and practitioners.”
Perspectives 23 (3): 333-358, doi: 10.1080/0907676X.2014.979842

Guerberof, A. 2012. Productivity and quality in the post-editing of outputs from


translation memories and machine translation. PhD dissertation.

Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X.,
Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T-Y. Luo, R., Menezes, A.,
Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z.,
Zhou, M. 2018. Achieving Human Parity on Automatic Chinese to English News
Translation. Computing Research Repository arXiv:1803.05567v1
(https://arxiv.org/abs/1803.05567).
Hurtado Albir, A. 2007. “Competence-based Curriculum Design for Training
Translators.” The Interpreter and Translator Trainer 1 (2): 63-195. doi:
10.1080/1750399X.2007.10798757

Hutchins, W. J. 1986. Machine Translation: Past, present, future. Chichester: Ellis


Horwood.

Kenny, D. 2007. “Translation Memories and Parallel Corpora: Challenges for the
Translation Trainer.” In Across Boundaries: International Perspectives on Translation,
edited by D. Kenny and K. Ryou, 192–208. Newcastle-upon-Tyne: Cambridge Scholars
Publishing.

Kenny, D., and S. Doherty. 2014. “Statistical Machine Translation in the Translation
Curriculum: Overcoming Obstacles and Empowering Translators.” The Interpreter and
Translator Trainer 8 (2), 295–315. doi:10.1080/1750399X.2014.936112.

Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,


W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007.
“Moses: Open Source Toolkit for Statistical Machine Translation”. In Proceedings of
Annual Meeting of the Association for Computational Linguistics (ACL).

Krings, H. P. 2001. Repairing Texts. Kent, OH: Kent State University Press.

Lommel, A., D. A. DePalma. 2016. “Post-Editing Goes Mainstream.” Common Sense


Advisory Report.

Lommel, A. 2018. “The Multidimensional Quality Metrics and Dynamic Quality


Framework.” Translation Quality Assessment: From principles to practice, edited by J.
Moorkens, S. Castilho, F. Gaspari, S. Doherty. Berlin: Springer. doi:10.1007/978-3-
319-91241-7_6

Massey, G. 2017. “Machine learning: Implications for translator education.” In


Proceedings of CIUTI Forum 2017: Short- and long-term impact of artificial
intelligence on language professions.
Mellinger, C. D. 2017. “Translators and machine translation: knowledge and skills gaps
in translator pedagogy.” The Interpreter and Translator Trainer 11(4): 280-293.
doi:10.1080/1750399X.2017.1359760.

O’Brien, S. 2012. “Translation as human‐computer interaction.” Translation Spaces


1(1):101-122. doi: 10.1075/ts.1.05obr

Olohan, M. 2007. “Economic Trends and Developments in the Translation Industry.”


The Interpreter and Translator Trainer 1 (1): 37–63.
doi:10.1080/1750399X.2007.10798749.

Plitt, M., and F. Masselot. 2010. “A Productivity Test of Statistical Machine Translation
Post-Editing in a Typical Localisation Context.” The Prague Bulletin of Mathematical
Linguistics 93: 7-16.

Popović, M. 2017. “Comparing Language Related Issues for NMT and PBMT between
German and English.” The Prague Bulletin of Mathematical Linguistics 108: 209-220.
doi: 10.1515/pralin-2017-0021

Sennrich, R., B. Haddow, A. Birch. 2016. “Neural Machine Translation by Jointly


Learning to Align and Translate.” In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, 1715–1725.

Way, A. 2018. “Traditional and emerging use-cases for machine translation.”


Translation Quality Assessment: From principles to practice, edited by J. Moorkens, S.
Castilho, F. Gaspari, S. Doherty. Berlin: Springer. doi:10.1007/978-3-319-91241-7_8

Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,


Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws,
Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C.Young, J.
Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean. 2017
“Google's Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation.” Computing Research Repository arXiv:1609.08144
(https://arxiv.org/abs/1609.08144 ).
Language pair Mean adequacy NMT errors SMT errors

AR-EN NMT 3.2 Stylistic, ‘awkward’ Compound errors, ‘gibberish’

SMT 2.1 phrasing

EN-IT NMT 2.7 All mistranslations (9) Mostly mistranslations (25)

SMT 2.1

EN-ES NMT 3.1 Mostly mistranslation (3) Mistranslation (9) and word order

SMT 1.9 and omission (2) (3)

EN-RU NMT 3.1 8 mistranslations, 2 Mistranslations (11), word order

SMT 1.7 omissions (elaborate, not (2), omission (2), addition (3)

easy to detect)

ZH-ES NMT 2.4 Mistranslation (8), word Mistranslation (10), word order

SMT 2.3 order (3), omission (7) (8), omission (10)

Table 1. Cohort two findings for adequacy and error annotation.

You might also like