Professional Documents
Culture Documents
Introduction
For a number of years now, I have been involved in a research project which attempts, among other things, to compare translated and non-translated English
text on the basis of a computerized collection of translated English (TranslaInternational Journal of Corpus Linguistics : (), .
- John Benjamins Publishing Company
Mona Baker
interesting difficulty to have too much rather than too little to go on3 and
no less challenging than the difficulty of not having sufficient data to inform
our research.
At any rate, the project requires a certain level of investment in research
time and funding, both of which are currently in very short supply, at least in
the context of British academia. Nevertheless, it is important for researchers in
the current climate to press on with carefully thought out research agendas in
spite of financial and other obstacles, and so we continue to do our best with
the current resources.
Mona Baker
There are at least two important methodological issues here, both of which
concern the basis of comparison.5 The first problem is that although the two
corpora are very similar in size, the BNC consists largely of extracts while TEC
consists of full texts (this is why we have much fewer texts in TEC even though
it is a slightly larger corpus). This has implications for the way we interpret
some of the findings I will present shortly, and it is therefore important to keep
this difference in the composition of the two corpora in mind. At the same
time, we must also recognize that imbalances of this type are inevitable. Furthermore, they are not specific to corpus-based studies. It is in the nature of
any type of comparison, any attempt to look for similarities and differences,
that what is being compared can never be totally balanced in every respect. It is
also particularly a feature of full text corpora, and particularly those of literary
texts, that they cannot be balanced even internally: literary texts vary tremendously in their lengths, and if we are to include full texts in our corpus (which
is desirable for many reasons),6 we have to accept that the individual texts will
be seriously imbalanced in terms of size. In addition to this internal imbalance
in the corpus, there is also the imbalance between full-text corpora like TEC
and those, like the BNC, which are made up of extracts. The first methodological difficulty then, which
largely inevitable,
concerns the imbalance in size
(c)isJohn
Benjamins
between and within the two corpora which provide the basis of comparison.
Delivered by Ingenta
The second problem concerns the composition of the corpora in terms of
on:subcorpus
Mon, on
27which
Mar
08:54:30
genres. The BNC
the2006
current study
is based consists of ficto:
Chinese
Hong
Kong7 which I
tion only.
TEC,
on the other University
hand, consists of of
fiction
and biography,
have chosen to groupIP:
together
under the heading of narrative. These are the
137.189.174.203
kinds of decisions and compromises that researchers have to make all the time
in the course of conducting descriptive studies. I could defend my decision
to include biography in the TEC corpus on the basis that (a) it is narrative
and in this sense shares many features with fiction as a genre, and (b) the distinction between fiction and biography is deliberately being blurred by many
contemporary authors of biography. Whether or not readers accept this type
of justification or find it plausible, the important issue to bear in mind here
is that decisions of this type clearly have implications for the way we interpret
any findings that we present in our research.
Even if we accept the decision to include biography in the TEC corpus, there is still the question of the comparability of fictional texts in BNC
and TEC: fiction is far too broad a category to ensure a reliable basis of
comparison.8 The BNC corpus used in the current study consists of a selection of texts/extracts from the imaginative domain; these were individually
scrutinized to match as closely as possible the fictional texts in TEC, in terms
of type of fiction, year of publication, and so on.
Mona Baker
Overall Number of Translators in the Narrative Subcorpus of TEC
57 (individual); 4 (team)
Best Represented Translators (no. of texts/words)
Giovanni Pontiero (6 texts; 562,292 words)
Dorothy Blair (6 texts; 462,888 words)
Peter Bush (5 texts; 296,146 words)
Lawrence Venuti (4 texts; 214,098 words)
IP: 137.189.174.203
However fuzzy some of these notions might appear to be, if there is any
truth in them, particularly the claim of fluency as the overriding strategy in
Anglo-American translations, we ought to be able to trace the impact of such
strategies on the lexical make up of translated English text.10 Given the type of
corpora now available to researchers, especially TEC, it should then be interesting to look at recurring lexical patterns, on the assumption that for example
if Anglo-American translators did favour fluency as an overall strategy then we
might reasonably expect to find a higher level of recurrence of fixed or semifixed lexical phrases in translated as opposed to non-translated English text.
And it should also be interesting to explore how individual translators respond
to this type of social pressure to produce fluent (and hence unmarked) language language that does not draw attention to itself at the lexical level.
This is an important issue, and one which is acknowledged in Venutis work
on fluency as a favoured strategy in Anglo-American translations as well as
Gideon Tourys broader work on norms: whatever the overall pattern might
prove to be, there will always be individual translators who opt to use different
strategies, to go against(c)
the John
norm. Hence
the interest in moving beyond the
Benjamins
description of overall patterns to study patterns of variation among individual
Delivered by Ingenta
translators.
Mon,
Mar
2006
What weon:
are looking
for 27
then is
recurring
lexical08:54:30
patterns or lexical phrases
to: and
Chinese
University
of Hong
Kong
in translated
non-translated
English, and patterns
of variation
in the use of
these recurring phrases
among
individual translators. There are various ways
IP:
137.189.174.203
in which a researcher can get at this kind of data. I do not propose to offer a
full-blown description of the mechanics of pulling out repeated patterns and
comparing their frequency at this stage, though this exercise too throws up
various methodological issues that are interesting to debate.11 However, it is
important to point out that the kind of software I have had available to me
for this study (and I believe is available generally) is quite crude, because it
only allows the identification of exact repetitions, and only if the researcher
specifies a very precise number of words.12 For example, if one asks for a list
of all instances of four-word repetitions, this will throw up phrases such as in
the event that and in the event of but not in any event, because this phrase
consists of three rather than four words. Similarly, a request for a list of threeword repetitions will not return phrases such as in the event that or in the event
of. Moreover, the software does not identify discontinuous patterns such as in
the [unlikely] event of .
We are clearly working with fairly crude software at the moment, and
therefore have to be very cautious in making any claims at this stage. Having
Mona Baker
said that, and to place this exercise in a realistic context, it is also important to
point out that the software in question is nowhere near as crude as trying to
find patterns of this type manually.
To get around the software problem (at least partly) for the purpose of
this study, I pulled out lists of phrases of various lengths, but mostly 3-word,
4-word, and 5-word phrases; examples can be seen in Appendix 1, including
an example of a two-word phrase with a punctuation mark (that is,). I also
selected some lexical phrases from each list to analyze more closely on the basis
of full concordances for each corpus.
Because of the crudeness of the software, the lists generated by the program
contained a great deal of noise: in this case combinations of words which are
not recognizable as fixed or semi-fixed lexical phrases, such as the monument
to the battle and the shrine of the lady, which are listed as occurring 32 and 22
times respectively in TEC. At this stage, no principled way or robust methodology suggest themselves for selecting specific phrases to analyze in detail, nor for
systematically weeding out irrelevant patterns. Very broadly, however, I tried to
follow two principles in(c)
selecting
fromBenjamins
the two sets of lists generated by the softJohn
ware some 50 or so patterns for closer analysis to inform this methodological
Delivered by Ingenta
exploration:
their questions and following up the various research threads that this type of
resource can bring to their attention.
The crude nature of the software aside, a number of patterns caught my
attention as I began to select some phrases and look at their concordances more
closely. Given the methodological focus of this article, it seems reasonable to
organize the discussion under headings that might guide a methodology for
identifying and assessing patterns of this type in general, rather than under the
individual patterns selected for analysis.
. Overall frequency and number of recurring lexical phrases
in both corpora
As a first step, it seems reasonable to establish whether there is a noticeable
difference between the two corpora in terms of the overall number and frequencies of the lexical patterns we have chosen to focus on. We may assume,
for instance, that if translators into English did favour fluency as an overall
strategy, this preference
would
be reflected
in a higher reliance on recurring,
(c)
John
Benjamins
familiar lexical phrases of the language: frequent use of recognizable, fixed or
Delivered by Ingenta
semi-fixed lexical phrases must be a major way of producing an impression of
on: Mon, 27 Mar 2006 08:54:30
fluency in a text.
to:
Chinese
oflistsHong
Kong
I have
already
stressed theUniversity
unreliability of the
generated
by the software
I have available at theIP:
moment,
so we cannot rely on an automatic compari137.189.174.203
son of frequencies of all phrases generated by the program in this particular
study. Nevertheless, the lists do suggest that a significant difference might exist between the two corpora in this respect. At least for the patterns that I selected and decided to examine more closely, the difference sometimes seems
staggering. Here are some examples of differences in the overall frequencies of
different types of lexical phrases that occur in the two corpora:
at the same time
in the middle of the
from time to time
on the other hand
that is,
in other words
that is to say
once and for all
TEC
BNC
669
401
394
347
288
161
129
120
323
209
137
150
119
36
31
26
Mona Baker
when it comes to
at the edge of the
I thought to myself
in a manner of speaking
78
67
43
40
35
46
12
10
This does not mean that there are no differences in the other direction: some
patterns must occur more frequently in BNC than in TEC, though these are not
as easy to spot as the patterns listed above, and many others like them. Examples of this type, though the difference does not seem so significant, include the
7-word phrase out of the corner of his eye, which occurs 18 times in BNC and 13
times in TEC, and the 6-word phrase on the other side of the, which occurs 126
times in BNC and 117 times in TEC. What we need is a piece of software that
can run through both lists and automatically identify significant differences in
the frequencies of phrases which occur in both corpora.13 We would then have
to examine these carefully and make some sense of them in terms of the other
issues I will be tackling next. But the point Im making here is that overall frequency is only one issue to consider in this respect. It is merely a starting point,
(c)to John
but one we cannot afford
ignore. Benjamins
Delivered by Ingenta
on: across
Mon,
texts27 Mar 2006 08:54:30
. Distribution
to: Chinese University of Hong Kong
Apart from differences between the two corpora in terms of overall frequencies
IP: 137.189.174.203
of individual lexical phrases, the next question concerns the distribution of
individual phrases across the texts which constitute each corpus: irrespective of
the overall frequency of a given phrase, is it evenly distributed across different
texts, or does it occur with higher frequency in some texts rather than others?
The examples in Appendix 1 suggest that the distribution of repeated lexical phrases may prove somewhat less even in translated text, with individual
texts showing what appear to be relatively high levels of repetition of the same
expression in many cases. The most striking example is the repetition of that
is, in Shaun Whitesides translation Notebooks (63 occurrences). Appendix 1
also includes a full concordance of the 63 instances, which readers may wish to
examine closely at their leisure.
Assuming this pattern of uneven distribution holds as we examine more
data, the next thing a researcher might find useful to establish is whether there
is a plausible reason for the high frequency of a specific lexical phrase or of
several lexical phrases in a specific text or indeed in the work of a specific
translator. For example, is the high frequency partly a function of the length
of the text? I have already drawn attention to the imbalance both between and
within the corpora informing this study in terms of text lengths, so this is one
issue to bear in mind when interpreting some of the emerging patterns.
It is possible to run the software program I used in this study on individual
texts as well as a corpus of texts. I have done so with a number of individual
texts by Giovanni Pontiero and Peter Bush, who are among the best represented
translators in the corpus. My initial reaction is that the frequency of use of lexical phrases is not entirely a function of the length of the text. Even in the longest
translation by Peter Bush, there seems to be very little repetition of specific lexical phrases. This is not the case in translations by Giovanni Pontiero, as we can
see by comparing the top of the lists of 4-word phrases in one of Bushs translations (Realms of Strife, by Juan Goytisolo; 96600 words in total) and one of
Pontieros (The History of the Siege of Lisbon, by Jos Saramago, 125713 words).
List of individual four-word chains
cut-off point 5):
17
14
14
12
11
9
9
9
8
8
7
7
7
7
7
7
7
6
6
6
6
6
6
6
6
Mona Baker
6
6
6
6
5
TO_MEET_UP_WITH
TO_THE_POINT_OF
TURNED_OUT_TO_BE
WAS_NOT_AT_ALL
AT_THE_SAME_TIME
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
5
DO_MILAGRE_DE_SANTO
GUILLAUME_OF_THE_LONG
IN_A_STATE_OF
IN_FRONT_OF_THE
IN_ORDER_TO_BE
OF_DOM_AFONSO_HENRIQUES
OF_THE_PORTA_DE
ONCE_AND_FOR_ALL
ON_THE_OTHER_SIDE
ON_TOP_OF_THE
RUA_DO_MILAGRE_DE
TAKING_INTO_ACCOUNT_THE
THERE_MUST_HAVE_BEEN
THE_MONTE_DA_GRAA
THE_RUA_DO_MILAGRE
YOU_ONLY_HAVE_TO
AND_THEN_IT_MIGHT
BEAR_IN_MIND_THAT
DOM_AFONSO_HENRIQUES_WAS
DR_MARIA_SARA_WHO
IN_THE_MIDDLE_OF
IT_MIGHT_BE_ST
IT_WOULD_HAVE_BEEN
MARIA_SARA_AND_RAIMUNDO
NOT_TO_MENTION_THE
OVER_AND_OVER_AGAIN
SARA_AND_RAIMUNDO_SILVA
THAT_HE_SHOULD_HAVE
THEN_IT_MIGHT_BE
THERE_WILL_BE_NO
THE_ARCHBISHOP_OF_BRAGA
THE_CONQUEST_OF_SANTAREM
THE_DIRECTION_OF_THE
THE_FAITHFUL_TO_PRAYER
THE_PROOFS_OF_THE
THE_SHEET_OF_PAPER
WHAT_IS_YOUR_NAME
ALL_THE_MORE_SO
Even if we allow for the difference in overall length of the two texts, there
does seem to be a greater tendency to rely on fixed and semi-fixed lexical
phrases in Pontieros translation. This type of strategy becomes easier to identify when we examine several works by the same translator and find that the
Mona Baker
same lexical phrases are used again and again in different texts by different authors. I will return to the question of individual translators and their stylistic
preferences shortly.
Leaving the question of individual translators and their stylistic preferences
aside for a moment, and staying with the question of individual texts, there are
other reasons why a lexical phrase or set of phrases might occur with higher
frequencies in individual texts, apart from the overall length of the text in question. One such reason has to do with strategies of characterization. I have not
found any examples of this in TEC but the only two instances of noticeably high
frequencies of specific lexical phrases in an individual text that I have identified
in the BNC subcorpus (for the phrase that is to say) can be explained in these
terms. As can be seen in Appendix 1, this expression occurs 17 times in the
36,433-word extract from The Remains of the Day by Izuo Kashiguro. The narrator in this book is a butler; he is portrayed as a very old-fashioned character
who is obsessed with accuracy and detail and is therefore constantly rewording
what he says in order to be more accurate. The second example is of the same
expression being used (c)
6 times
in the Benjamins
43,859-word extract from Nice Work by
John
David Lodge, where the author or narrator himself draws attention to a parDelivered by Ingenta
ticular characters use of the expression. In both cases the reuse of a lexical
on: Mon,
27 strategy
Mar 2006
08:54:30
phrase is a conscious,
deliberate
on the part
of the writer. Strategies
to:could
Chinese
University
Hong
Kong
of this type
in principle
also account forofhigher
frequencies
of specific
phrases in individual texts
TEC.
IP: in137.189.174.203
By contrast, if fluency was a favoured strategy in translation then we would
not expect the high frequency of a lexical phrase in a given translation to be associated with a particular character. A good example of this is Infanta (by Bodo
Kirchhoff, translated by John Brownjohn), where the 21 uses of the expression
in other words are spread across the speech of five different characters, as well
as the voice of the narrator. This is of course where the distinction between
fiction (with its many voices and characters) and biography (where we often
have much fewer voices represented) becomes important, and the inclusion of
biography in this study may then raise specific problems.
As far as translations are concerned, another possible explanation for the
high frequency of a specific lexical phrase in an individual text could be that it
is a direct carrying over of a feature of the source text: it could have nothing
to do with a translators attempt (conscious or otherwise) to use familiar or
unmarked lexical phrases to give an impression of fluency. This is clearly one
avenue that many translation scholars would be keen to explore, but it is precisely this tendency to refer everything back to the source text that the Transla-
Mona Baker
Mona Baker
Delivered by Ingenta
on: Mon, 27 Mar 2006 08:54:30
Acknowledgements
to: Chinese University of Hong Kong
IP: 137.189.174.203
I am grateful to the following
for assistance in undertaking this piece of re-
search. For access to the chains program (authored by Isabel Barth): Professor
Michael Stubbs, Universitt Trier. For software development and maintenance
of TEC: Saturnino Luz, Trinity College Dublin. For computational support:
Paul Johnston, UMIST. For administrative assistance on TEC project: Gabriela
Saldanha, former MPhil student at the Centre for Translation & Intercultural
Studies, UMIST, currently PhD student at Dublin City University.
Notes
. TEC is held at the Centre for Translation and Intercultural Studies, University of Manchester (http://www.art.man.ac.uk/SML/ctis/research/tec.htm). For a detailed description of
this corpus, see Laviosa (1998b); Baker (1999); Olohan & Baker (2000).
. TEC is being enlarged on a regular basis. As the size of TEC grows, my colleagues and
I have also been adding more texts to the BNC subcorpus that we use in our studies. The
sizes of both corpora, and hence their composition, may therefore vary from one study to
another, but details of such variation are provided where relevant.
Mona Baker
. A character who, rather awkwardly for me, doesnt herself believe in the concept of
character. That is to say (a favourite phrase of her own), Robyn Penrose, Temporary Lecturer
in English Literature at the University of Rummidge, holds that character is a bourgeois
myth, an illusion created to reinforce the ideology of capitalism.
. What is more, I learned from Mr. Tomero Alarcn that a substantial part of Delirio y
destino may well have been dictated to an amanuensis, whose identity is now unknown. In
other words, my developing hunch about Zambranos sentences had been correct: not only
was the book written quickly and its narrative mixed with both philosophical thinking
and poetic association, parts of it may also have been spoken and simultaneously recorded.
I have deliberately referred to that goal in terms of an occurrence rather than a product,
because what results when delirium (with its precarious and potentially rewarding consequences) and destiny (with its precarious and potentially rewarding possibility) interact is
the occurrence I most wanted my translation to convey. In other words, I wanted to translate, above all, Zambranos razn potica, which is present in Delirio y destino more as an
event, a manifestation of what Giles Deleuze has discussed, also in terms of writing, as a
possibility of life that invokes the oppressed bastard race that ceaselessly stirs beneath
dominations, resisting everything that crushes and oppresses.
Baker, M. (1998). Rexplorer la langue de la traduction: une approche par corpus. Meta, 43
(4), 480485.
Baker, M. (1996). Corpus-based Translation Studies: the Challenges that Lie Ahead. In
H. Somers (Ed.), Terminology, LSP and Translation (pp. 175186). Amsterdam &
Philadelphia: John Benjamins.
Baker, M. (1995). Corpora in Translation Studies: An Overview and Some Suggestions for
Future Research. Target, 7 (2), 223243.
Baker, M. (1993). Corpus Linguistics and Translation Studies. Implications and Applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and Technology: In
Honour of John Sinclair (pp. 233250). Amsterdam: John Benjamins.
Bosseaux, Ch. (2001). A Study of the Translators Voice and Style in the French Translations
of Virginia Woolf s The Waves. In M. Olohan (Ed.), CTIS Occasional Papers, Volume 1
(pp. 5575). Manchester: CTIS, UMIST.
Danielsson, P. (2001). The Automatic Identification of Meaningful Units in Language.
Doctoral Dissertation, Department of Swedish, Gtenborg University, Sweden.
Danielsson, P. (2003). Automatic Extraction of Meaningful Units from Corpora: A Corpusdriven Approach Using the Word stroke. International Journal of Corpus Linguistics, 8
(1), 109127.
Mona Baker
Appendix 1
Key
Lexical pattern in large bold italics. Details for BNC (corpus of non-translated
English) & TEC (corpus of translated English). Detailed information on files
accounting for a high percentage of occurrences in either corpus: filename, followed by number of occurrences of expression in file, total extent of file, name
of translator, source language, title of published translation, author of source
text.
that is,
BNC: Total: 119 instances. Maximum 5 in 1 text.
TEC: Total: 288 instances. Maximum 63 in 1 text.
bb000003 (63)
Extent: 78,144 words
Shaun Whiteside, German. Notebooks 19241954 (Wilhelm Furtwngler)
fn000071 (31)
Extent: 123,865
Nancy Roberts, Arabic. Beirut Nightmares (Ghada Samman)
fn000008 (17)
Extent: 72,239
Sophie Bennett, Arabic. The Stone of Laughter (Hoda Barakat)
bb000005 (16)
Extent: 135,019 words
Naomi Seidman, Hebrew. Conversations with Dvora. An Experimental Biography of the First
Modern Hebrew Woman Writer (Amia Lieblich)
fn000011 (12)
Extent: 27,770
Margaret Jull Costa, Portuguese. Lucios Confession (Mario de Sa Carneiro)
that is to say
BNC: Total instances: 31; maximum 17 in one file.
ar3 (17)
Extent: 36,433 words
Kazuo Ishiguro, The Remains of the Day (voice of butler throughout; obsessed with detail
and accuracy; part of characterization)
any (6)
Extent: 43,859 words
David Lodge, Nice Work (5 instances voice of Robyn Penrose, the boring academic; 1
instance of narrator commenting on the characters use of the expression; again part of
characterization)16
in other words
BNC: Total: 36 instances; maximum 3 in one text.
TEC: Total: 162 instances; maximum 21 (in two texts).
bb000011 (21)
Extent: 120,643
Carol Maier, Spanish. Delirium and Destiny: A Spaniard in Her Twenties (Maria Zambrano)
Note: 2 of the 21 instances in bb000011 are in Carol Maiers own afterword.17
fn000020 (21)
Extent: 144,659
John Brownjohn, German. Infanta (Bodo Kirchhoff)
Note: multiplicity of voices; not part of characterization.
bb000012 (12)
Extent: 70,521
Robert Samuels, French. The Boulez-Cage Correspondence (Pierre Boulez and John Cage)
fn000006 (11)
Extent: 193,720
Giovanni Pontiero, Brazilian Portuguese. Discovering the World (Clarice Lispector)
fn000058 (11)
Extent: 64,103
Ewald Osers, German. Cutting Timber (Thomas Bernhard)
Mona Baker
when it comes to
BNC: Total instances: 35; maximum 3 (1 file)
TEC: Total instances: 78; maximum 26 (1 file)
fn000006 (26)
Extent: 193,720
Giovanni Pontiero, Brazilian Portuguese. Discovering the World (Clarice Lispector)
fn000005 (8)
Extent: 125,713
Giovanni Pontiero, Portuguese. The History of the Siege of Lisbon (Jos Saramago)
fn000004 (5)
Extent: 114,826
Giovanni Pontiero, Portuguese. The Stone Raft (Jos Saramago)
I thought to myself
BNC: Total instances: 12; maximum 4 (2 files)
TEC: Total instances: 43; maximum 20 in 1 file
fn000006 (20)
Extent: 193,720
Giovanni Pontiero, Brazilian Portuguese. Discovering the World (Clarice Lispector)
fn000071 (7)
Extent: 123,865
Nancy Roberts, Arabic. Beirut Nightmares (Ghada Samman)
fn000058 (7)
Extent: 64,103
Ewald Osers, German. Cutting Timber (Thomas Bernhard)
Mona Baker