A Corpus-Based View of Similarity

JB[v.
20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.1 (56-130)
A corpus-based view of similarity

and difference in translation
Mona Baker
Centre for Translation & Intercultural Studies, University of Manchester
Corpus-based research throws up a number of methodological challenges.

Many of these are evident in any type of research which attempts to compare
authentic data of any kind, but the difficulties are accentuated by the
availability of vast amounts of data in this case. In particular, questions
relating to how one selects the features to be compared and, more
importantly, how the findings may be interpreted, invite us to elaborate our
methodology far more explicitly than in other types of research. The
accessibility of the same body of data to other researchers also means that
(a) the findings can be assessed and challenged in other studies, and (b) other
researchers can invoke different, and perhaps more plausible explanations of
the same findings by appealing to parameters that may have been downplayed
or ignored in previous studies. These issues have been extensively debated in
the literature on corpus linguistics, but rarely if ever in the context of
corpus-based translation studies. A small-scale study involving comparisons
between corpora of translated and non-translated texts in English in terms of
frequency and distribution of recurring lexical patterns is used to examine
some methodological issues in corpus-based translation research and suggest
different ways in which the same findings may be interpreted depending on
the variables on which individual researchers choose to focus.
(c) John Benjamins

Delivered by Ingenta
on: Mon, 27 Mar 2006 08:54:30
to: Chinese University of Hong Kong
IP: 137.189.174.203
Keywords: translation, corpus-based translation studies, style, literary

translators, methodology, lexical patterns
Introduction
For a number of years now, I have been involved in a research project which attempts, among other things, to compare translated and non-translated English
text on the basis of a computerized collection of translated English (TranslaInternational Journal of Corpus Linguistics : (), .
- John Benjamins Publishing Company
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.2 (130-171)
Mona Baker
tional English Corpus)1 and a similar computerized collection of non-translated

English text. The latter is a subset of a much larger corpus available commercially, namely the British National Corpus. Research based on these two computerized collections of text, or some subset of them, has been widely published
in various journals and collections, by myself and other colleagues (see in particular Laviosa 1998a; Laviosa-Braithwaite 1997; Baker 2000; Olohan & Baker
2000; Olohan 2001). It has also informed research done by various postgraduate students at the University of Manchester and elsewhere, some of which
is reported in readily available publications (see, for example, Kenny 2000a,
2000b, 2001; Bosseaux 2001).
The Translational English Corpus has been used not only to compare translated and non-translated English, but also to inform stylistic comparisons between individual translators represented in the corpus. Of the four subcorpora which constitute TEC (namely fiction, biography, inflight magazines and
news), fiction and biography taken together constitute what might be broadly
seen as the narrative subcorpus and lend themselves particularly well to stylistic analyses. The question
how individual
translators behave linguistically is
(c)ofJohn
Benjamins
an important aspect of the study I will describe shortly.
Since both corpora are currently fluid in terms of size and hence composion: important
Mon, 27
Mar the
2006
08:54:30
tion,2 it is always
to establish
parameters
of the corpora which
to: study.
Chinese
Hong
inform each
Details ofUniversity
the corpora usedof
in the
currentKong
study as well as
the composition of theIP:
translational
corpus in terms of the output of individual
137.189.174.203
translators are presented later in this article.
There are then essentially two broad questions addressed by the type of research that concerns us here: one is whether the patterning of translated text is
significantly different from the patterning of non-translated text, and the other
is whether there are patterns of variation within the corpus of translated text
in terms of the linguistic behaviour of individual translators. These questions
are ambitious in scope and difficult to answer reliably at the moment. Initially,
the difficulty of arriving at reasonably reliable findings had to do with insufficient data. When we only had one or two million words and very little of the
output of any specific translator, there was simply not enough data to allow us
to conduct descriptive studies of most features. We are now beginning to experience a different kind of difficulty, namely that there is so much of some types
of evidence and data that what we really need at the moment is much more research time and more researchers to be able to follow up the many threads and
avenues that this resource is opening up, and in order to come up with plausible explanations of the patterns that are emerging from our studies. This is an
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.3 (171-221)
A corpus-based view of similarity and difference in translation
interesting difficulty to have too much rather than too little to go on3 and
no less challenging than the difficulty of not having sufficient data to inform
our research.
At any rate, the project requires a certain level of investment in research
time and funding, both of which are currently in very short supply, at least in
the context of British academia. Nevertheless, it is important for researchers in
the current climate to press on with carefully thought out research agendas in
spite of financial and other obstacles, and so we continue to do our best with
the current resources.
. Methodology in corpus research

In what follows, I would like to use a small-scale study on recurring lexical
patterns to demonstrate the potential of both types of corpus research comparison of translated and non-translated language, and identifying patterns
of stylistic variation in(c)
the John
work of individual
translators for offering inBenjamins
sights into various aspects of translational behaviour. I also want to use this
study to explore some of the methodological challenges and problems which
27 Mar
2006 08:54:30
are involved on:
in thisMon,
type of research
generally.
to: Chinese
University
of Hong
Corpus-based
research has
become very popular
among Kong
translation studies scholars in recent years,
several groups of researchers working on corIP: with
137.189.174.203
pus projects in various countries and involving different languages, including
Finnish, German, Italian, Spanish and Brazilian Portuguese.4 Like any area of
study that becomes attractive to researchers, there is always a danger of uncritical application of the methodology, of applying the approach without being
cautious enough about how one interprets the findings and about how far a
particular methodology can take us before we have to switch to other methodologies to complement our research. Although corpus-based research is an approach that I have been advocating and believe on the whole to offer a very
powerful research programme for translation studies, what I would like to do
here is to explore a number of the more problematic aspects of this methodology and suggest concrete ways in which we can nuance some of the findings of
this type of research.
In presenting and questioning the methodology I used in the small-scale
study I wish to report on here, I will try to highlight a number of things, all of
which are relevant in any kind of research, but I will be focusing on these issues
in the context of corpus-based translation research specifically:
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.4 (221-295)
Mona Baker
(a) the inherent methodological difficulties of this type of research bearing

in mind that all research methods have their difficulties, weaknesses and
blind spots;
(b) the complexity of the issues involved and the difficulty of coming up with
plausible explanations for the patterns we manage to identify;
(c) the potentially conflicting but also potentially illuminating ways of interpreting the same set of findings, depending on the contextual parameters
we decide to appeal to in offering an explanation for the patterns we choose
to foreground.
I am therefore inviting readers to view this article essentially as an exercise in
methodology, but I also hope to offer them some insight into what seem at this
stage to be interesting patterns of difference and variation between translated
and non-translated English, and among individual translators. How we ultimately verify these patterns and how we interpret them is the real challenge
that I want to focus on in the following sections.
(c) John Benjamins

. The study
on: Mon, 27 Mar 2006 08:54:30
The twoto:
corpora
used in thisUniversity
study both consist,
of narrative
Chinese
ofbroadly,
Hong
Kongtext (see
Figure 1).
IP: 137.189.174.203
Corpus of Translated English (Subset of TEC)
Fiction
Biography
Total Size: 6,613,456 words/tokens
94 files, all full texts
Corpus of Non-translated English (Subset of BNC)
Fiction (imaginative domain)
Total Size: 6,423,325 words/tokens
171 files, some full texts but mostly extracts of approximately 40,000 words on average
Figure 1. Overview of Subcorpora Used in the Study

TEC = Translated English; BNC = Non-translated English
There are at least two important methodological issues here, both of which
concern the basis of comparison.5 The first problem is that although the two
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.5 (295-333)
corpora are very similar in size, the BNC consists largely of extracts while TEC
consists of full texts (this is why we have much fewer texts in TEC even though
it is a slightly larger corpus). This has implications for the way we interpret
some of the findings I will present shortly, and it is therefore important to keep
this difference in the composition of the two corpora in mind. At the same
time, we must also recognize that imbalances of this type are inevitable. Furthermore, they are not specific to corpus-based studies. It is in the nature of
any type of comparison, any attempt to look for similarities and differences,
that what is being compared can never be totally balanced in every respect. It is
also particularly a feature of full text corpora, and particularly those of literary
texts, that they cannot be balanced even internally: literary texts vary tremendously in their lengths, and if we are to include full texts in our corpus (which
is desirable for many reasons),6 we have to accept that the individual texts will
be seriously imbalanced in terms of size. In addition to this internal imbalance
in the corpus, there is also the imbalance between full-text corpora like TEC
and those, like the BNC, which are made up of extracts. The first methodological difficulty then, which
largely inevitable,
concerns the imbalance in size
(c)isJohn
Benjamins
between and within the two corpora which provide the basis of comparison.
The second problem concerns the composition of the corpora in terms of
on:subcorpus
Mon, on
27which
Mar
08:54:30
genres. The BNC
the2006
current study
is based consists of ficto:
Chinese
Hong
Kong7 which I
tion only.
TEC,
on the other University
hand, consists of of
fiction
and biography,
have chosen to groupIP:
together
under the heading of narrative. These are the
137.189.174.203
kinds of decisions and compromises that researchers have to make all the time
in the course of conducting descriptive studies. I could defend my decision
to include biography in the TEC corpus on the basis that (a) it is narrative
and in this sense shares many features with fiction as a genre, and (b) the distinction between fiction and biography is deliberately being blurred by many
contemporary authors of biography. Whether or not readers accept this type
of justification or find it plausible, the important issue to bear in mind here
is that decisions of this type clearly have implications for the way we interpret
any findings that we present in our research.
Even if we accept the decision to include biography in the TEC corpus, there is still the question of the comparability of fictional texts in BNC
and TEC: fiction is far too broad a category to ensure a reliable basis of
comparison.8 The BNC corpus used in the current study consists of a selection of texts/extracts from the imaginative domain; these were individually
scrutinized to match as closely as possible the fictional texts in TEC, in terms
of type of fiction, year of publication, and so on.
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.6 (333-409)
Mona Baker
Overall Number of Translators in the Narrative Subcorpus of TEC
57 (individual); 4 (team)
Best Represented Translators (no. of texts/words)
Giovanni Pontiero (6 texts; 562,292 words)
Dorothy Blair (6 texts; 462,888 words)
Peter Bush (5 texts; 296,146 words)
Lawrence Venuti (4 texts; 214,098 words)
Figure 2. Translators in the Narrative Subcorpus of TEC
At any rate, the methodological difficulties I want to stress here concern

the imbalance between and within the corpora in terms of size and genres. It
would take far too much space to print a full list of the texts in each corpus, but
details of many titles, authors and (where applicable) translators featuring in
each corpus will naturally emerge in the course of presenting some of the findings later in this article. In addition, it may be helpful at this stage to provide
some information on the
of the narrative section of TEC in terms
(c)composition
John Benjamins
of the individual translators
represented
in itIngenta
(Figure 2).
Delivered
by
Details of this type are important to bear in mind when attempting to exon: Mon, 27 Mar 2006 08:54:30
plain or evaluate any patterns we might identify in our research, as will become
Chinese University of Hong Kong
apparentto:
shortly.
IP: 137.189.174.203
. Lexical patterns in translated text

Many claims have been made by translation scholars about translations being
different in a number of ways from non-translated text. The ways in which
translations are claimed to be different are often not clearly articulated, but
they include, for example, the assumption that translators are more conservative in their use of language; that they tend to prefer more standard forms of
the language; that there tends to be a raising of the level of formality in translation; that translated text is sanitized (in terms of translators avoiding certain
features such as regionalisms and irregular spelling); and that translators tend
to produce more uniform texts, for example by avoiding disruption of tense
sequences, etc.9 A particularly well known and explicitly articulated claim underpins much of Lawrence Venutis work, namely that translators in the AngloAmerican world specifically favour fluency because this is the strategy most
valued by their immediate readership (Venuti 1995).
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.7 (409-438)
However fuzzy some of these notions might appear to be, if there is any
truth in them, particularly the claim of fluency as the overriding strategy in
Anglo-American translations, we ought to be able to trace the impact of such
strategies on the lexical make up of translated English text.10 Given the type of
corpora now available to researchers, especially TEC, it should then be interesting to look at recurring lexical patterns, on the assumption that for example
if Anglo-American translators did favour fluency as an overall strategy then we
might reasonably expect to find a higher level of recurrence of fixed or semifixed lexical phrases in translated as opposed to non-translated English text.
And it should also be interesting to explore how individual translators respond
to this type of social pressure to produce fluent (and hence unmarked) language language that does not draw attention to itself at the lexical level.
This is an important issue, and one which is acknowledged in Venutis work
on fluency as a favoured strategy in Anglo-American translations as well as
Gideon Tourys broader work on norms: whatever the overall pattern might
prove to be, there will always be individual translators who opt to use different
strategies, to go against(c)
the John
norm. Hence
the interest in moving beyond the
Benjamins
description of overall patterns to study patterns of variation among individual
translators.
Mon,
Mar
2006
What weon:
are looking
for 27
then is
recurring
lexical08:54:30
patterns or lexical phrases
to: and
Chinese
University
of Hong
Kong
in translated
non-translated
English, and patterns
of variation
in the use of
these recurring phrases
among
individual translators. There are various ways
IP:
137.189.174.203
in which a researcher can get at this kind of data. I do not propose to offer a
full-blown description of the mechanics of pulling out repeated patterns and
comparing their frequency at this stage, though this exercise too throws up
various methodological issues that are interesting to debate.11 However, it is
important to point out that the kind of software I have had available to me
for this study (and I believe is available generally) is quite crude, because it
only allows the identification of exact repetitions, and only if the researcher
specifies a very precise number of words.12 For example, if one asks for a list
of all instances of four-word repetitions, this will throw up phrases such as in
the event that and in the event of but not in any event, because this phrase
consists of three rather than four words. Similarly, a request for a list of threeword repetitions will not return phrases such as in the event that or in the event
of. Moreover, the software does not identify discontinuous patterns such as in
the [unlikely] event of .
We are clearly working with fairly crude software at the moment, and
therefore have to be very cautious in making any claims at this stage. Having
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.8 (438-494)
Mona Baker
said that, and to place this exercise in a realistic context, it is also important to
point out that the software in question is nowhere near as crude as trying to
find patterns of this type manually.
To get around the software problem (at least partly) for the purpose of
this study, I pulled out lists of phrases of various lengths, but mostly 3-word,
4-word, and 5-word phrases; examples can be seen in Appendix 1, including
an example of a two-word phrase with a punctuation mark (that is,). I also
selected some lexical phrases from each list to analyze more closely on the basis
of full concordances for each corpus.
Because of the crudeness of the software, the lists generated by the program
contained a great deal of noise: in this case combinations of words which are
not recognizable as fixed or semi-fixed lexical phrases, such as the monument
to the battle and the shrine of the lady, which are listed as occurring 32 and 22
times respectively in TEC. At this stage, no principled way or robust methodology suggest themselves for selecting specific phrases to analyze in detail, nor for
systematically weeding out irrelevant patterns. Very broadly, however, I tried to
follow two principles in(c)
selecting
fromBenjamins
the two sets of lists generated by the softJohn
ware some 50 or so patterns for closer analysis to inform this methodological
exploration:
on: Mon, 27 Mar 2006 08:54:30
(a) all patterns selected had to be recognizable as recurring lexical phrases of

English (e.g. in other words, once and for all, at the same time, as a matter of fact)
IP:
rather than phrases that
are137.189.174.203
clearly tied to the theme of a single text. Examples
of the latter include history of the siege of, which occurs 44 times in TEC. This
is part of the title of a book, The History of the Siege of Lisbon, by Jos Saramago, translated from Portuguese by Giovanni Pontiero. Similarly, the curious
expression I reflected in the wing chair is the most frequent 6-word pattern in
TEC, but it was not selected for analysis because all 144 instances occur in the
same text (Cutting Timber, by Thomas Bernhard, translated from German by
Ewald Osers).
(b) phrases relating to temporal and spatial orientation (such as in the middle
of, for the first time, at the end of, in front of the, the end of the month) were
ignored, because they occur with very high frequency in both corpora.
At any rate, the material I am about to discuss (some of which is detailed
in Appendix 1) should not be interpreted as systematic findings, and I stress
that I am not presenting it as such. What I am trying to do here is explore
the kind of questions that researchers can sensibly try to address (at least in
part) using the corpus methodology, and how they might go about refining
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.9 (494-552)
their questions and following up the various research threads that this type of
resource can bring to their attention.
The crude nature of the software aside, a number of patterns caught my
attention as I began to select some phrases and look at their concordances more
closely. Given the methodological focus of this article, it seems reasonable to
organize the discussion under headings that might guide a methodology for
identifying and assessing patterns of this type in general, rather than under the
individual patterns selected for analysis.
. Overall frequency and number of recurring lexical phrases
in both corpora
As a first step, it seems reasonable to establish whether there is a noticeable
difference between the two corpora in terms of the overall number and frequencies of the lexical patterns we have chosen to focus on. We may assume,
for instance, that if translators into English did favour fluency as an overall
strategy, this preference
would
be reflected
in a higher reliance on recurring,
(c)
John
Benjamins
familiar lexical phrases of the language: frequent use of recognizable, fixed or
semi-fixed lexical phrases must be a major way of producing an impression of
on: Mon, 27 Mar 2006 08:54:30
fluency in a text.
to:
Chinese
oflistsHong
Kong
I have
already
stressed theUniversity
unreliability of the
generated
by the software
I have available at theIP:
moment,
so we cannot rely on an automatic compari137.189.174.203
son of frequencies of all phrases generated by the program in this particular
study. Nevertheless, the lists do suggest that a significant difference might exist between the two corpora in this respect. At least for the patterns that I selected and decided to examine more closely, the difference sometimes seems
staggering. Here are some examples of differences in the overall frequencies of
different types of lexical phrases that occur in the two corpora:
at the same time
in the middle of the
from time to time
on the other hand
that is,
in other words
that is to say
once and for all
TEC
BNC
669
401
394
347
288
161
129
120
323
209
137
150
119
36
31
26
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.10 (552-601)
Mona Baker
when it comes to
at the edge of the
I thought to myself
in a manner of speaking
78
67
43
40
35
46
12
10
This does not mean that there are no differences in the other direction: some
patterns must occur more frequently in BNC than in TEC, though these are not
as easy to spot as the patterns listed above, and many others like them. Examples of this type, though the difference does not seem so significant, include the
7-word phrase out of the corner of his eye, which occurs 18 times in BNC and 13
times in TEC, and the 6-word phrase on the other side of the, which occurs 126
times in BNC and 117 times in TEC. What we need is a piece of software that
can run through both lists and automatically identify significant differences in
the frequencies of phrases which occur in both corpora.13 We would then have
to examine these carefully and make some sense of them in terms of the other
issues I will be tackling next. But the point Im making here is that overall frequency is only one issue to consider in this respect. It is merely a starting point,
(c)to John
but one we cannot afford
ignore. Benjamins
on: across
Mon,
texts27 Mar 2006 08:54:30
. Distribution
Apart from differences between the two corpora in terms of overall frequencies
IP: 137.189.174.203
of individual lexical phrases, the next question concerns the distribution of
individual phrases across the texts which constitute each corpus: irrespective of
the overall frequency of a given phrase, is it evenly distributed across different
texts, or does it occur with higher frequency in some texts rather than others?
The examples in Appendix 1 suggest that the distribution of repeated lexical phrases may prove somewhat less even in translated text, with individual
texts showing what appear to be relatively high levels of repetition of the same
expression in many cases. The most striking example is the repetition of that
is, in Shaun Whitesides translation Notebooks (63 occurrences). Appendix 1
also includes a full concordance of the 63 instances, which readers may wish to
examine closely at their leisure.
Assuming this pattern of uneven distribution holds as we examine more
data, the next thing a researcher might find useful to establish is whether there
is a plausible reason for the high frequency of a specific lexical phrase or of
several lexical phrases in a specific text or indeed in the work of a specific
translator. For example, is the high frequency partly a function of the length
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.11 (601-656)
of the text? I have already drawn attention to the imbalance both between and
within the corpora informing this study in terms of text lengths, so this is one
issue to bear in mind when interpreting some of the emerging patterns.
It is possible to run the software program I used in this study on individual
texts as well as a corpus of texts. I have done so with a number of individual
texts by Giovanni Pontiero and Peter Bush, who are among the best represented
translators in the corpus. My initial reaction is that the frequency of use of lexical phrases is not entirely a function of the length of the text. Even in the longest
translation by Peter Bush, there seems to be very little repetition of specific lexical phrases. This is not the case in translations by Giovanni Pontiero, as we can
see by comparing the top of the lists of 4-word phrases in one of Bushs translations (Realms of Strife, by Juan Goytisolo; 96600 words in total) and one of
Pontieros (The History of the Siege of Lisbon, by Jos Saramago, 125713 words).
List of individual four-word chains
cut-off point 5):
17
14
14
12
11
9
9
9
8
8
7
7
7
7
7
7
7
6
6
6
6
6
6
6
6
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
ON_THE_RUE_POISSONNIERE
AT_THE_END_OF
ON_THE_RUE_DE
IN_THE_COURSE_OF
FOR_THE_FIRST_TIME
AS_A_RESULT_OF
CASA_DE_LAS_AMERICAS
THE_CASA_DE_LAS
A_MEMBER_OF_THE
WITH_A_GROUP_OF
BY_THE_IDEA_OF
FLAT_ON_THE_RUE
IN_ONE_OF_THE
IN_THE_COMPANY_OF
IN_THE_FIELD_OF
ON_THE_EVE_OF
THE_FIRST_TIME_IN
A_FEW_DAYS_LATER
A_GROUP_OF_FRIENDS
FROM_TIME_TO_TIME
IN_RELATION_TO_THE
MY_RETURN_TO_PARIS
ON_MY_RETURN_TO
ON_THE_OTHER_HAND
THE_RUE_DE_BIEVRE
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.12 (656-735)
Mona Baker
6
6
6
6
5
TO_MEET_UP_WITH
TO_THE_POINT_OF
TURNED_OUT_TO_BE
WAS_NOT_AT_ALL
AT_THE_SAME_TIME
List of individual four-word chains (The History of the Siege

of Lisbon; cut-off point 5):
48 HISTORY_OF_THE_SIEGE
48 THE_SIEGE_OF_LISBON
44 OF_THE_SIEGE_OF
36 THE_HISTORY_OF_THE
25 THE_PORTA_DE_FERRO
19 THAT_IS_TO_SAY
18 AS_IF_HE_WERE
18 ESCADINHAS_DE_SAO_CRISPIM
18 IT_SAYS_HERE_THAT
18 THE_ESCADINHAS_DE_SAO
15 IT_IS_TRUE_THAT
14 TO_BE_ABLE_TO
12 AS_FAR_AS_THE
12 AT_THE_SAME_TIME
12 FOR_THE_FIRST_TIME
12 OF_THE_HISTORY_OF
10 IF_HE_WERE_TO
10 OUR_LORD_JESUS_CHRIST
9 AS_IF_HE_HAD
9 A_MANNER_OF_SPEAKING
9 FROM_TIME_TO_TIME
9 IN_A_MANNER_OF
8 AND_AT_THAT_MOMENT
8 IF_WE_WERE_TO
8 IN_THE_CASE_OF
8 IN_THE_DIRECTION_OF
8 IN_THE_PRESENCE_OF
8 IT_IS_DIFFICULT_TO
8 MILAGRE_DE_SANTO_ANTONIO
8 NO_MORE_THAN_A
8 ON_THE_ESCADINHAS_DE
8 ON_THE_OTHER_HAND
8 THE_BISHOP_OF_OPORTO
8 THE_FACT_IS_THAT
8 WERE_IT_NOT_FOR
8 WHEN_IT_COMES_TO
7 AT_DEAD_OF_NIGHT
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.13 (735-746)
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
5
DO_MILAGRE_DE_SANTO
GUILLAUME_OF_THE_LONG
IN_A_STATE_OF
IN_FRONT_OF_THE
IN_ORDER_TO_BE
OF_DOM_AFONSO_HENRIQUES
OF_THE_PORTA_DE
ONCE_AND_FOR_ALL
ON_THE_OTHER_SIDE
ON_TOP_OF_THE
RUA_DO_MILAGRE_DE
TAKING_INTO_ACCOUNT_THE
THERE_MUST_HAVE_BEEN
THE_MONTE_DA_GRAA
THE_RUA_DO_MILAGRE
YOU_ONLY_HAVE_TO
AND_THEN_IT_MIGHT
BEAR_IN_MIND_THAT
DOM_AFONSO_HENRIQUES_WAS
DR_MARIA_SARA_WHO
IN_THE_MIDDLE_OF
IT_MIGHT_BE_ST
IT_WOULD_HAVE_BEEN
MARIA_SARA_AND_RAIMUNDO
NOT_TO_MENTION_THE
OVER_AND_OVER_AGAIN
SARA_AND_RAIMUNDO_SILVA
THAT_HE_SHOULD_HAVE
THEN_IT_MIGHT_BE
THERE_WILL_BE_NO
THE_ARCHBISHOP_OF_BRAGA
THE_CONQUEST_OF_SANTAREM
THE_DIRECTION_OF_THE
THE_FAITHFUL_TO_PRAYER
THE_PROOFS_OF_THE
THE_SHEET_OF_PAPER
WHAT_IS_YOUR_NAME
ALL_THE_MORE_SO
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
Even if we allow for the difference in overall length of the two texts, there
does seem to be a greater tendency to rely on fixed and semi-fixed lexical
phrases in Pontieros translation. This type of strategy becomes easier to identify when we examine several works by the same translator and find that the
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.14 (746-798)
Mona Baker
same lexical phrases are used again and again in different texts by different authors. I will return to the question of individual translators and their stylistic
preferences shortly.
Leaving the question of individual translators and their stylistic preferences
aside for a moment, and staying with the question of individual texts, there are
other reasons why a lexical phrase or set of phrases might occur with higher
frequencies in individual texts, apart from the overall length of the text in question. One such reason has to do with strategies of characterization. I have not
found any examples of this in TEC but the only two instances of noticeably high
frequencies of specific lexical phrases in an individual text that I have identified
in the BNC subcorpus (for the phrase that is to say) can be explained in these
terms. As can be seen in Appendix 1, this expression occurs 17 times in the
36,433-word extract from The Remains of the Day by Izuo Kashiguro. The narrator in this book is a butler; he is portrayed as a very old-fashioned character
who is obsessed with accuracy and detail and is therefore constantly rewording
what he says in order to be more accurate. The second example is of the same
expression being used (c)
6 times
in the Benjamins
43,859-word extract from Nice Work by
John
David Lodge, where the author or narrator himself draws attention to a parDelivered by Ingenta
ticular characters use of the expression. In both cases the reuse of a lexical
on: Mon,
27 strategy
Mar 2006
08:54:30
phrase is a conscious,
deliberate
on the part
of the writer. Strategies
to:could
Chinese
University
Hong
Kong
of this type
in principle
also account forofhigher
frequencies
of specific
phrases in individual texts
TEC.
IP: in137.189.174.203
By contrast, if fluency was a favoured strategy in translation then we would
not expect the high frequency of a lexical phrase in a given translation to be associated with a particular character. A good example of this is Infanta (by Bodo
Kirchhoff, translated by John Brownjohn), where the 21 uses of the expression
in other words are spread across the speech of five different characters, as well
as the voice of the narrator. This is of course where the distinction between
fiction (with its many voices and characters) and biography (where we often
have much fewer voices represented) becomes important, and the inclusion of
biography in this study may then raise specific problems.
As far as translations are concerned, another possible explanation for the
high frequency of a specific lexical phrase in an individual text could be that it
is a direct carrying over of a feature of the source text: it could have nothing
to do with a translators attempt (conscious or otherwise) to use familiar or
unmarked lexical phrases to give an impression of fluency. This is clearly one
avenue that many translation scholars would be keen to explore, but it is precisely this tendency to refer everything back to the source text that the Transla-
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.15 (798-844)
tional English Corpus project was designed to counterbalance. I therefore wish

to highlight other ways in which we might explore questions of this type on a
larger scale, without being unduly restricted to one source text and one target
text at a time, even if ultimately we will want to go back to the source text in
some cases to seek further or complementary explanations.
Other interesting patterns might capture our attention as we sift through
the data, and these patterns might lead us to seek explanations outside the direct source/target text relationship. For example, the expression in other words
occurs 21 times in a biography translated by Carol Maier (Delirium and Destiny: A Spaniard in Her Twenties). Of these 21 instances, two occur in Carol
Maiers own afterword. Might this suggest a stylistic quirk of the translator
rather than a carrying over of a feature in the source text?
. Distribution across translators
Next is the issue of distribution across translators. I have already mentioned
that even the longest translation
by Peter
Bush seems sparing in its repetition
(c) John
Benjamins
of individual lexical phrases. In fact, some of the best represented translators
in TEC are conspicuous by their absence or very marginal presence in the varon: Mon,
27 Mar
2006
08:54:30
ious concordances
I have examined
closely.
Peter Bush
and Lawrence Venuti,
to: do
Chinese
ofanyHong
Kong
for example,
not seem toUniversity
make heavy use of
particular
fixed or semifixed lexical phrases. This
or may not be confirmed by further and closer
IP:may
137.189.174.203
analysis of lexical phrases other than the ones I have managed to study so far. If
it were to be confirmed, it would not come as a surprise to those familiar with
the translators in question: both are very conscious of their use of language and
have repeatedly argued that translators should not pander to the expectations
of an Anglo-American readership.
At any rate, in terms of focusing on individual translators rather than individual texts, some of the questions we may wish to address are as follows. If
an expression occurs with high frequency in the work of a specific translator,
could it be because:
(a) it is a favourite expression/quirk of the translator independent of the style
of the author? Translators are writers, and like other writers may have their
particular favoured expressions. Since TEC is specifically designed to represent
several works by the same translator working with different authors, we should
be able to establish in most cases whether the frequent use of a set expression
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.16 (844-893)
Mona Baker
is a feature of the output of a particular translator, irrespective of source text

author, or whether it only occurs in translations of a specific author.
(b) it reflects a translation strategy, rather than simply stylistic preference on
the part of the translator? For example, I specifically chose to look more closely
at three lexical phrases used for glossing/explicating: that is, / that is to say /
in other words. Giovanni Pontiero, the best represented translator in the corpus, uses that is to say 47 times and in other words 31 times overall. The concordance of that is to say in Appendix 1 shows that Pontiero uses this type of
glossing expression fairly heavily in practically all his translations. Bearing in
mind that some of these are translations of the Portuguese author Jos Saramago and some are translations of the very different Brazilian author Clarice
Lispector, this is an interesting pattern which might be worth looking into in
more detail.
(c) Finally, but this would require much more extensive and detailed study of
the work of a specific translator, there is the question of whether we can identify
an overall tendency for(c)
a given
translator
to rely heavily on fixed or semi-fixed
John
Benjamins
lexical phrases throughout
his
or
her
work.
would be an attempt to explore
Delivered byThis
Ingenta
not so much the overall question of whether fluency is a preferred strategy in
on: Mon, 27 Mar 2006 08:54:30
English translations but whether it is a preferred strategy of a specific translator.
We can only explore this question of course, or rather any question relating to a
IP:
specific translator, if we
have137.189.174.203
several works by the translator in the corpus. This
means that, for example, although the repetition of the expression that is, 63
times in Shaun Whitesides translation (Notebooks) is striking, there is little we
can say about this because it is the only translation we currently have by him in
TEC.14
. The temporal dimension
It would also be interesting to look into the development of a translators style
over time. I say this because, for instance, there are six translations by Giovanni
Pontiero in TEC, the first published in 1986 and the last in 1996, with a spread
of 10 years between them. The first translation, The Hour of the Star by Clarice
Lispector (28,580 words), is admittedly much shorter than the other five, but it
simply does not figure in any of the concordances I have examined, including
concordances of glossing expressions such as that is to say and in other words,
which feature prominently in Pontieros other translations. It would be inter-
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.17 (893-937)
esting to use a resource such as TEC to explore whether there is evidence of a

change of style and strategy over time in the case of a particular translator.
. Summary and conclusions

In trying to conclude what I tried to do in this exploratory study, I would like
to start by stressing that in corpus work, and generally, figures and frequencies are only a starting point. We need to take a closer look at the data and
get a feel for the texts and what is happening in them, as well as the people
who produce these texts, in order to move beyond low-level description to situated explanation. The value of raw figures and frequencies is simply that they
draw our attention to some features that are likely to be worth investigating in
more detail.15 They offer one rationale for selecting features to focus on, but
they cannot offer an interpretation of those features, nor does documenting
such quantitative features in itself provide a justification for undertaking the
research in the first place.
corpus work, as in any other type of re(c)Indeed,
Johnin Benjamins
search, the real challenge lies in two things: one is how a researcher might select
features to focus on, and the other is how he or she might interpret what they
find in their on:
data. Mon, 27 Mar 2006 08:54:30
to: of
Chinese
University
of Hong
Kong
In terms
the first issue,
researchers working
with corpora
must realize
that just because the IP:
computer
appears to be objectively churning out data
137.189.174.203
this does not mean that the process of selecting what one focuses on is not just
as subjective and just as variable as it is in any other type of research. Here, as
elsewhere in research, we all create our object of study. Indeed, one thing that
is interesting to monitor in this type of research is the way in which ones own
perspective as a researcher creates the object of research and contextualizes the
findings. For example, of all the wealth of potentially interesting data that an
exercise such as this can throw up, I decided to focus on a number of expressions that are typically used for glossing or explicating (that is, that is to say, in
other words); another researcher might have chosen to focus on completely different types of expression. Moreover, my attempt to explain the patterns that I
saw emerging as I examined the distribution of these phrases in translated and
non-translated text focused on the translator and the individual text, where another researcher might have been more inclined to focus on something like the
source language (maybe the texts that have a higher incidence of repetitions are
translations from a particular source language or languages), or the translators
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.18 (937-995)
Mona Baker
gender. I chose to focus largely on individual translators, irrespective of their

gender or the languages from which they translate.
Secondly, irrespective of the issue of subjectivity in selecting what to focus
on, the question of how one arrives at plausible explanations of whatever he or
she chooses to find is just as complex and elusive in corpus work as it is in all
research. The computer can help us locate features textual features but it
cannot explain them. The onus of interpretation still lies with the researcher.
Where the corpus methodology does score highly in my view is in allowing
a higher level of transparency. Corpus-based work, if done responsibly, at least
has the virtue of being transparent and allowing other researchers not only to
check the validity of the basic claims being made but also to offer different
interpretations of the same data.
And finally, I would like to stress again that corpus-based research in principle takes textual material as a starting point, but this does not mean that it
necessarily ignores or sets out to downplay the human element. Nor does it,
or should it, be seen as a free-standing methodology that does not need to be
complemented by other
methods
of research.
Like any other methodology, it
(c)
John
Benjamins
can only take us so far, and no further.
on: Mon, 27 Mar 2006 08:54:30
Acknowledgements
IP: 137.189.174.203
I am grateful to the following
for assistance in undertaking this piece of re-
search. For access to the chains program (authored by Isabel Barth): Professor
Michael Stubbs, Universitt Trier. For software development and maintenance
of TEC: Saturnino Luz, Trinity College Dublin. For computational support:
Paul Johnston, UMIST. For administrative assistance on TEC project: Gabriela
Saldanha, former MPhil student at the Centre for Translation & Intercultural
Studies, UMIST, currently PhD student at Dublin City University.
Notes
. TEC is held at the Centre for Translation and Intercultural Studies, University of Manchester (http://www.art.man.ac.uk/SML/ctis/research/tec.htm). For a detailed description of
this corpus, see Laviosa (1998b); Baker (1999); Olohan & Baker (2000).
. TEC is being enlarged on a regular basis. As the size of TEC grows, my colleagues and
I have also been adding more texts to the BNC subcorpus that we use in our studies. The
sizes of both corpora, and hence their composition, may therefore vary from one study to
another, but details of such variation are provided where relevant.
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.19 (995-1068)

. See Sinclair (1986) on the issue of too much evidence.
. See Laviosa (1998a) and Olohan (2000) for examples of such studies.
. See Kilgariff (2001) for a detailed discussion of issues relating to the comparison of
corpora.
. For a very good and accessible discussion of various issues relating to corpus compilation,
including the pros and cons of opting for full texts vs text extracts, see Kenny (2001, Chapter
5, especially pp. 105117).
. And biography, in turn, includes a number of sub-genres which the compilers of TEC
chose to treat as one broad genre: biographies, autobiographies, memoirs, and books which
consist of correspondence between well-known personalities. An example of the latter is
The Boulez-Cage Correspondence, translated from French by Robert Samuels and published
by Cambridge University Press.
. There is of course ultimately no ideal basis of comparison, whatever phenomena we are
attempting to compare and whatever standards of comparison we choose to use. Since every
phenomenon, every event, and every text is by nature unique, comparability will always
remain a relative issue.
. For an overview of some of these claims, see Baker (1996, 1999).
. As well of course as its syntactic make up and a host of other features on the discourse
level.
(c) John Benjamins

. For details of computational methods used in capturing the data presented in this study,
on:Paul
Mon,
27
Mar
2006
08:54:30
readers may contact
Johnston
in the
first instance
(Paul.Johnston@umist.ac.uk).
to: Chinese
University
ofbyHong
. The software
in question is called
Chains (authored
Isabel Barth,Kong
February 2001). It
is not available commercially.
The
program
identifies
chains
of
words
which
recur in a text
IP: 137.189.174.203
or corpus. A chain is a sequence of word-forms: either two-word pairs (i.e. sequences of two
adjacent word-forms) or longer chains of repeated word-forms (e.g. a five-word sequence).
The program proceeds through the text, with a moving window, identifying each x-word
sequence (as specified by the user). Each chain is then checked against stored sequences, and
the program eventually prints out a list of all x-word sequences and their frequency.
. Something similar to what Im envisaging here exists for lists of individual words, but
not for phrases. This is the compare wordlists function in Wordsmith Tools. The procedure
compares all the words in two lists, already generated by the wordlist program, and reports
on all those which appear significantly more often in one than the other, including words
which appear more than a minimum number of times in one even if they do not appear at
all in the other.
. One of the more interesting aspects of corpus work is that it can throw up unexpected
patterns of this type, which can then feed back into the design of the corpus itself. In this
case, the TEC team is now actively seeking other translations by Shaun Whiteside to include
in the corpus.
. On the issue of using raw frequencies without recourse to measures of statistical significance, see Danielsson (2001, 2003).
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.20 (1068-1148)
Mona Baker
. A character who, rather awkwardly for me, doesnt herself believe in the concept of
character. That is to say (a favourite phrase of her own), Robyn Penrose, Temporary Lecturer
in English Literature at the University of Rummidge, holds that character is a bourgeois
myth, an illusion created to reinforce the ideology of capitalism.
. What is more, I learned from Mr. Tomero Alarcn that a substantial part of Delirio y
destino may well have been dictated to an amanuensis, whose identity is now unknown. In
other words, my developing hunch about Zambranos sentences had been correct: not only
was the book written quickly and its narrative mixed with both philosophical thinking
and poetic association, parts of it may also have been spoken and simultaneously recorded.
I have deliberately referred to that goal in terms of an occurrence rather than a product,
because what results when delirium (with its precarious and potentially rewarding consequences) and destiny (with its precarious and potentially rewarding possibility) interact is
the occurrence I most wanted my translation to convey. In other words, I wanted to translate, above all, Zambranos razn potica, which is present in Delirio y destino more as an
event, a manifestation of what Giles Deleuze has discussed, also in terms of writing, as a
possibility of life that invokes the oppressed bastard race that ceaselessly stirs beneath
dominations, resisting everything that crushes and oppresses.
(c) John Benjamins

Baker, M. (2000). Towards a Methodology for Investigating the Linguistic Behaviour of
on: Mon, 27 Mar 2006 08:54:30
Professional Translators. Target, 12 (2), 241266.
Chinese
of Hong
Kong
Baker, M.to:
(1999).
The Role of University
Corpora in Investigating
the Linguistic
Behaviour of
Professional Translators.
International
Journal
of
Corpus
Linguistics,
4
(2),
281298.
IP: 137.189.174.203
References
Baker, M. (1998). Rexplorer la langue de la traduction: une approche par corpus. Meta, 43
(4), 480485.
Baker, M. (1996). Corpus-based Translation Studies: the Challenges that Lie Ahead. In
H. Somers (Ed.), Terminology, LSP and Translation (pp. 175186). Amsterdam &
Philadelphia: John Benjamins.
Baker, M. (1995). Corpora in Translation Studies: An Overview and Some Suggestions for
Future Research. Target, 7 (2), 223243.
Baker, M. (1993). Corpus Linguistics and Translation Studies. Implications and Applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and Technology: In
Honour of John Sinclair (pp. 233250). Amsterdam: John Benjamins.
Bosseaux, Ch. (2001). A Study of the Translators Voice and Style in the French Translations
of Virginia Woolf s The Waves. In M. Olohan (Ed.), CTIS Occasional Papers, Volume 1
(pp. 5575). Manchester: CTIS, UMIST.
Danielsson, P. (2001). The Automatic Identification of Meaningful Units in Language.
Doctoral Dissertation, Department of Swedish, Gtenborg University, Sweden.
Danielsson, P. (2003). Automatic Extraction of Meaningful Units from Corpora: A Corpusdriven Approach Using the Word stroke. International Journal of Corpus Linguistics, 8
(1), 109127.
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.21 (1148-1240)
Gellerstam, M. (1986). Translationese in Swedish Novels Translated from English. In L.

Wollin & H. Lindquist (Eds.), Translation Studies in Scandinavia (pp. 8895). Lund:
CWK Gleerup.
Kenny, D. (2001). Lexis and Creativity in Translation. Manchester: St. Jerome.
Kenny, D. (2000a). Lexical Hide-and-Seek: looking for creativity in a parallel corpus. In M.
Olohan (Ed.), Intercultural Faultlines. Research Models in Translation Studies I: Textual
and Cognitive Aspects (pp. 93104). Manchester: St. Jerome.
Kenny, D. (2000b). Translators at Play: Exploitations of Collocational Norms in GermanEnglish Translation. In B. Dodd (Ed.), Working with German Corpora (pp. 143160).
Birmingham: University of Birmingham Press.
Kenny, D. (1997). (Ab)normal Translations: a German-English Parallel Corpus for Investigating Normalization in Translation. In B. Lewandowska-Tomaszczyk & P. J. Melia
(Eds.), Practical Applications in Language Corpora. PALC 97 Proceedings (pp. 387392).
dz: dz University Press.
Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6 (1),
97132.
Laviosa, S. (Ed.). (1998a). LApproche base sur le corpus/The Corpus-based Approach. Special
Issue of Meta, 43 (4).
Laviosa, S. (1998b). The English Comparable Corpus: A Resource and a Methodology. In L.
Bowker, M. Cronin, D. Kenny & J. Pearson (Eds.), Unity in Diversity: Current Trends in
Translation Studies (pp. 101112). Manchester: St. Jerome Publishing.
Laviosa, S. (1997). How Comparable Can Comparable Corpora Be? Target, 9 (2), 289319.
Laviosa-Braithwaite, S. (1997). Investigating Simplification in an English Comparable
Corpus of Newspaper Articles. In K. Klaudy & J. Kohn (Eds.), Transferre Necesse Est
(pp. 531540). Budapest: Scholastica.
Laviosa-Braithwaite, S. (1995). Comparable Corpora: Towards a Corpus Linguistic
Methodology for the Empirical Study of Translation. In M. Thelen & B. LewandoskaTomaszczyk (Eds.), Translation and Meaning (Part 3) (pp. 153163). Maastricht:
Hogeschool Maastricht.
Olohan, M. (2001). Spelling out the Optionals in Translation: A Corpus Study. UCREL
Technical Papers, 13, 423432.
Olohan, M. (Ed.). (2000). Intercultural Faultlines. Research Models in Translation Studies I:
Textual and Cognitive Aspects. Manchester: St. Jerome Publishing.
Olohan, M. & Baker, M. (2000). Reporting that in Translated English: Evidence for
Subconscious Processes of Explicitation. Across Languages & Cultures, 1 (2), 141158.
Sinclair, J. (1986). First throw away your evidence. In G. Leitner (Ed.), The English Reference
Grammar (pp. 5665). Tbingen: Max Niemeyer.
Sinclair, J. (1996). The Search for Units of Meaning. Textus, IX, 75106.
Stubbs, M. (1996). Text and Corpus Analysis. Oxford: Basil Blackwell.
Toury, G. (1995). Descriptive Translation Studies and Beyond. Amsterdam & Philadelphia:
John Benjamins.
Venuti, L. (1995). The Translators Invisibility. London & New York: Routledge.
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.22 (1240-1314)
Mona Baker
Appendix 1
Key
Lexical pattern in large bold italics. Details for BNC (corpus of non-translated
English) & TEC (corpus of translated English). Detailed information on files
accounting for a high percentage of occurrences in either corpus: filename, followed by number of occurrences of expression in file, total extent of file, name
of translator, source language, title of published translation, author of source
text.
that is,
BNC: Total: 119 instances. Maximum 5 in 1 text.
TEC: Total: 288 instances. Maximum 63 in 1 text.
bb000003 (63)
Extent: 78,144 words
Shaun Whiteside, German. Notebooks 19241954 (Wilhelm Furtwngler)
fn000071 (31)
Extent: 123,865
Nancy Roberts, Arabic. Beirut Nightmares (Ghada Samman)
fn000008 (17)
Extent: 72,239
Sophie Bennett, Arabic. The Stone of Laughter (Hoda Barakat)
bb000005 (16)
Naomi Seidman, Hebrew. Conversations with Dvora. An Experimental Biography of the First
Modern Hebrew Woman Writer (Amia Lieblich)
fn000011 (12)
Extent: 27,770
Margaret Jull Costa, Portuguese. Lucios Confession (Mario de Sa Carneiro)
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
that is to say
BNC: Total instances: 31; maximum 17 in one file.
ar3 (17)
Kazuo Ishiguro, The Remains of the Day (voice of butler throughout; obsessed with detail
and accuracy; part of characterization)
any (6)
David Lodge, Nice Work (5 instances voice of Robyn Penrose, the boring academic; 1
instance of narrator commenting on the characters use of the expression; again part of
characterization)16
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.23 (1314-1413)
TEC: Total instances: 129; maximum 19 in one file.

fn000005 (19)
Extent: 125,713
Giovanni Pontiero, Portuguese. The History of the Siege of Lisbon (Jos Saramago)
fn000006 (15)
Extent: 193,720
Giovanni Pontiero, Brazilian Portuguese. Discovering the World (Clarice Lispector)
fn000018 (12)
Extent: 80,530
Michael Hulse, German. Wonderful, Wonderful World (Elfriede Jelinek)
fn000008 (9)
Extent: 72,239
Sophie Bennett, Arabic. The Stone of Laughter (Hoda Barakat)
fn000024 (9)
Extent: 116,273
Terry Hale & Liz Heron, French. The Dedalus Book of French Horror: The 19th Century (various)
fn000007 (7)
Extent: 142,178
Giovanni Pontiero, Portuguese. The Gospel According to Jesus Christ (Jos Saramago)
in other words
BNC: Total: 36 instances; maximum 3 in one text.
TEC: Total: 162 instances; maximum 21 (in two texts).
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
bb000011 (21)
Extent: 120,643
Carol Maier, Spanish. Delirium and Destiny: A Spaniard in Her Twenties (Maria Zambrano)
Note: 2 of the 21 instances in bb000011 are in Carol Maiers own afterword.17
fn000020 (21)
Extent: 144,659
John Brownjohn, German. Infanta (Bodo Kirchhoff)
Note: multiplicity of voices; not part of characterization.
bb000012 (12)
Extent: 70,521
Robert Samuels, French. The Boulez-Cage Correspondence (Pierre Boulez and John Cage)
fn000006 (11)
Extent: 193,720
fn000058 (11)
Extent: 64,103
Ewald Osers, German. Cutting Timber (Thomas Bernhard)
once and for all

BNC: Total instances: 26; maximum 2 in 1 file.
TEC: Total instances: 120; maximum 10 in one file.
fn000014 (10)
Extent: 80,900
Ines Rieder and Jill Hannum, German. Violetta (Pieke Bierman)
fn000005 (7)
Extent: 125,713
fn000007 (6)
Extent: 142,178
Giovanni Pontiero, Portuguese. The Gospel According to Jesus Christ (Jos Saramago)
fn000071 (6)
Extent: 123,865
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.24 (1413-1506)
Mona Baker
when it comes to
BNC: Total instances: 35; maximum 3 (1 file)
TEC: Total instances: 78; maximum 26 (1 file)
fn000006 (26)
Extent: 193,720
fn000005 (8)
Extent: 125,713
fn000004 (5)
Extent: 114,826
Giovanni Pontiero, Portuguese. The Stone Raft (Jos Saramago)
I thought to myself
BNC: Total instances: 12; maximum 4 (2 files)
TEC: Total instances: 43; maximum 20 in 1 file
fn000006 (20)
Extent: 193,720
fn000071 (7)
Extent: 123,865
(c) John Benjamins

by Ingenta
BNC: Total instances:
10;
maximum
2
in
1
file.
on: Mon, 27 Mar 2006 08:54:30
TEC: Total instances: 40; maximum 9 in 1 file.
fn000005 (9)
Extent: 125,713
IP: The
137.189.174.203
Giovanni Pontiero, Portuguese.
History of the Siege of Lisbon (Jos Saramago)
in a manner of speaking
Delivered
fn000058 (7)
Extent: 64,103
Ewald Osers, German. Cutting Timber (Thomas Bernhard)
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.25 (1506-1506)
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.26 (1506-1506)
Mona Baker
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203
JB[v.20020404] Prn:11/10/2004; 16:10
F: IJCL9201.tex / p.27 (1506-1506)
(c) John Benjamins

on: Mon, 27 Mar 2006 08:54:30
IP: 137.189.174.203

A Corpus-Based View of Similarity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Corpus-Based View of Similarity

Uploaded by

Copyright:

Available Formats

JB[v.

20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.1 (56-130)

A corpus-based view of similarity

Corpus-based research throws up a number of methodological challenges.

(c) John Benjamins

Keywords: translation, corpus-based translation studies, style, literary

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.2 (130-171)

tional English Corpus)1 and a similar computerized collection of non-translated

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.3 (171-221)

A corpus-based view of similarity and difference in translation

. Methodology in corpus research

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.4 (221-295)

(a) the inherent methodological difficulties of this type of research bearing

(c) John Benjamins

Figure 1. Overview of Subcorpora Used in the Study

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.5 (295-333)

A corpus-based view of similarity and difference in translation

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.6 (333-409)

Figure 2. Translators in the Narrative Subcorpus of TEC

At any rate, the methodological difficulties I want to stress here concern

. Lexical patterns in translated text

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.7 (409-438)

A corpus-based view of similarity and difference in translation

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.8 (438-494)

on: Mon, 27 Mar 2006 08:54:30

(a) all patterns selected had to be recognizable as recurring lexical phrases of

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.9 (494-552)

A corpus-based view of similarity and difference in translation

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.10 (552-601)

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.11 (601-656)

A corpus-based view of similarity and difference in translation

(c) John Benjamins

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.12 (656-735)

List of individual four-word chains (The History of the Siege

(c) John Benjamins

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.13 (735-746)

A corpus-based view of similarity and difference in translation

(c) John Benjamins

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.14 (746-798)

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.15 (798-844)

A corpus-based view of similarity and difference in translation

tional English Corpus project was designed to counterbalance. I therefore wish

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.16 (844-893)

is a feature of the output of a particular translator, irrespective of source text

JB[v.20020404] Prn:11/10/2004; 16:10

F: IJCL9201.tex / p.17 (893-937)

A corpus-based view of similarity and difference in translation