Professional Documents
Culture Documents
Abstract
Introduction
2057
4. feed presents valid corrections or translation alternatives, but is not phrased as a grammatical sentence (e.g.,
X should be better translated as Y;
5. feed is identical to trans;
6. feed is better than trans in some parts but worse in others;
7. src is too noisy for a translation to be feasible (e.g., it
contains grammar errors or junk content); or
8. feed contains only noise.
To these, we add an additional rule:
9. If both trans and feed are valid translations of src, select
the most natural according to the available context.
2.2
2.1
Corpus Annotation
Annotation Guidelines
Translation quality is a multifaceted concept, which encompasses adequacy (i.e., the translation conveys the meaning of
the source), fluency (i.e., the translation is a fluent utterance
in the target language), and other aspects; some of them defined on a per-application basis, e.g., vocabulary usage, language register, post-editing effort, etc. The ultimate goal of
this study is to automatically improve on-line SMT quality
by incorporating casual users correction feedback (cf. Section 5), so one has to be cautious when considering which
instances are useful. To determine which fragment (trans or
feed) is a better translation of src, we followed the 9 rules
(partially derived from Pighin et al. [2012]) described below.
Table 1 includes some real examples for illustration.
Feed is considered useful, if:
1. feed is strictly more adequate than trans (even if still
imperfect); or
2. src had a small typo, causing the translator to fail, but
feed is clearly the expected translation.
Feed is considered useless if:
3. feed is indeed a comment (e.g., about the translator);
1
This project is focused on the development of machine translation systems that can respond rapidly and intelligently to user feedback (http://www.faust-fp7.eu).
2
The FFF and FFF+ corpora are freely available at http://www.
faust-fp7.eu/faust/Main/DataReleases
2058
Table 1: Instances of source, translation and human feedback tuples. ID corresponds to the rule number
ID
source
1 I am still in a meeting
automatic translation
Soy todava en una reunion
user feedback
Aun estoy en una reunion
2
3
4
Relentlessy
He s cinema
saw
Relentlessy
s cine
El
vio
implacablemente
Ustedes no saben nada de ingles
vieron o vio
7
8
9
Surface features
These features consider src, trans, feed, and ilang (not considered before); i.e., information available at operation time:
We model feedback filtering as a binary classification problem: tuples (src, trans, feed) have to be assigned a positive
label iff feed is a better translation from src than trans. We
consider four sets of very simple features to characterize the
tuple fields as well as the relationships within them: surface,
back-translation, noise-based, and similarity-based. The
former two sets are derived from those described in [Pighin
et al., 2012]. Back-translation and similarity-based features
use the texts resulting from translating either trans or feed
back into the source language. Text pre-processing includes
case folding and diacritic elimination.
explanation
Verb fixed (still a diacritic
missing)
Typo in source
Unrelated comment
The entire sentence is not a
valid translation of src
Identical sentences
Noise-based features
These binary features are designed to indicate the likelihood
of any of the text fragments including noisy sections. Some of
them try to determine a length-based translation difficulty.
2059
Similarity-based features
This set assesses the resemblance of the elements in the tuple to each other on the basis of different similarity metrics,
which are designed to be a better option than Levenshteinbased features.
Table 2: Micro-averaged evaluation for model tuning. We consider three SVM kernels, basic (top) and enhanced (bottom) feature
sets and applying feature selection (or not). The number of features
maintained after convergence of the feature selection process is presented in column |M |.
kernel
Acc.
F1
P rec.
Rec.
|M |
linear 66.5 (63.8) 69.9 (64.8) 64.8 (64.7) 76.0 (65.0) 21
poly 67.5 (61.0) 73.7 (69.1) 63.0 (58.2) 89.0 (85.0) 24
RBF 60.8 (60.4) 70.3 (70.0) 57.5 (57.3) 90.6 (90.2) 34
linear 70.5 (65.9) 73.6 (69.3) 68.1 (64.3) 79.9 (75.2) 43
poly 70.3 (66.3) 75.8 (72.3) 65.2 (62.5) 90.6 (85.8) 60
RBF 71.1 (67.9) 74.5 (71.6) 68.1 (65.6) 82.3 (78.7) 52
These features are based on concepts borrowed from crosslanguage information retrieval and machine translation. Short
character n-grams are considered a useful choice when
comparing texts across languages [McNamee and Mayfield,
2004]. Pseudo-cognates and the relative length of translations
are well known options for aligning bilingual corpora [Simard
et al., 1992; Gale and Church, 1993].
2060
and F1 for the three kernel classifiers, number of instances, and percent classified positive for each set of instances.
#tokens
Acc.
F1
size %pos
1 74.52.66 32.27.48 115
23.5
25 61.03.80 66.43.14 199
54.3
610 74.50.98 84.10.51 102
70.6
1115 65.41.06 76.01.82
54
57.4
1620 65.32.31 76.72.33
25
64.0
Recall
80
60
40
20
50
60
70
80
90
F1
73.2
75.3
75.6
P rec.
65.2
65.3
67.4
Rec.
83.3
88.9
86.1
linear
poly
radial
100
Acc.
69.9
71.2
72.6
100
Precision
Figure 1: Precision-recall curve over the development partition for the three kernels.Values for the SVM classification threshold are highlighted.
2061
cantly better than the baseline translator optimized over standard text. We take Base Faust as the reference for the following comparisons. Secondly, Faust+Feed (all) is unable
to improve performance, showing that one cannot simply add
all feedback instances collected without any filtering of the
noisy cases. Instead, selecting a subset of instances according to the classifiers (class. or 50%) improves the results
over Base Faust consistently across all the evaluation metrics.
The qualityquantity tradeoff appears to be an important aspect when enriching the baseline system. The top-50% cut
obtains the best results in all metrics. This is slightly more
conservative (so more precision oriented) than selecting the
instances classified positively, which amount to 61.2% in this
case. Empirical analysis of the optimal selection thresholds
should be studied in future work. Finally, statistical tests performed using bootstrapping [Riezler and Maxwell, 2005] indicate that BLEU and NIST scores for Faust+Feed (50%)
are better than those of Base Faust with high probability in
both versions of the test set ( and in the table indicate
confidence levels of 0.99 and 0.95, respectively).
Table 5: Results obtained by all baseline and feedbackenriched SMT systems on the Faust test set.
Faust Raw
Base News
Base Faust
Faust+Feed (all)
Faust+Feed (class.)
Faust+Feed (50%)
Faust Clean
Base News
Base Faust
Faust+Feed (all)
Faust+Feed (class.)
Faust+Feed (50%)
BLEU
NIST
TER
MTR
ULC
32.86
34.47
34.41
34.85
35.22
7.98
8.28
8.27
8.39
8.41
53.39
51.76
51.15
50.61
50.19
53.30
54.87
55.32
55.82
55.83
38.87
42.54
42.93
44.05
44.59
36.75
38.64
38.61
39.17
39.49
8.40
8.68
8.64
8.76
8.80
48.71
46.91
46.80
45.74
45.42
56.66
58.42
58.60
59.29
59.31
39.29
42.91
42.79
44.28
44.78
7
Experiments with the polynomial and RBF kernel-based classifiers did not improve over the results of the linear kernel. Pending
a further study, this fact could be explained by the ability of linear
models to achieve higher levels of precision (cf. Figure 1).
8
http://nlp.lsi.upc.edu/asiya/
2062
References
2063