You are on page 1of 190

QUARTERLY

Founded 1966

VOLUMES MENU

CONTENTS

ARTICLES
Using Language Corpora in Initial Teacher Education: Pedagogic Issues and Practical Applications 389 Anne OKeeffe and Fiona Farr A Corpus-Based Study of Idioms in Academic Speech 419 Rita Simpson and Dushyanthi Mendis A Corpus Analysis of Would-Clauses Without Adjacent If-Clauses 443 Stefan Frazier Amplifier Collocations in the British National Corpus: Implications for English Language Teaching 467 Graeme Kennedy A Combined Corpus and Systemic-Functional Analysis of the Problem-Solution Pattern in a Student and Professional Corpus of Technical Writing 489 Lynne Flowerdew

BRIEF REPORTS AND SUMMARIES


The Corpus of English as Lingua Franca in Academic Settings 513 Anna Mauranen Designing a Corpus for Translation and Language Teaching: The CEXI Experience 528 Silvia Bernardini The International Corpus of Learner English: A New Resource for Foreign Language Learning and Teaching and Second Language Acquisition Research 538 Sylviane Granger The Multimedia Adult ESL Learner Corpus 546 Stephen Reder, Kathryn Harris, and Kristen Setzler

ccclxxxii

TESOL QUARTERLY

Volume 37, Number 3

Autumn 2003

REVIEWS
Corpus Linguistics Texts 559 An Introduction to Corpus Linguistics Graeme Kennedy Corpus Linguistics Tony McEnery and Andrew Wilson Corpus Linguistics: Investigating Language Structure and Use Douglas Biber, Susan Conrad, and Randi Reppen Reviewed by Marie Helt Edited Volumes on Corpus Linguistics 562 Corpus Linguistics in North America Rita C. Simpson and John M. Swales (Eds.) Learner English on Computer Sylviane Granger (Ed.) Reviewed by Frederica Barbieri and Suzanne Eckhardt Useful Web Sites for Corpus Linguistics in TESOL 564 Reviewed by Leslie Boyd, Jennifer Garland, Carol Radich, and Deborah Saari Information for Contributors 569 TESOL Order Form TESOL Membership Application

REVIEWS

ccclxxxiii

QUARTERLY
Founded 1966

Volume 37, Number 3

Autumn 2003

A Journal for Teachers of English to Speakers of Other Languages and of Standard English as a Second Dialect

Editor
CAROL A. CHAPELLE, Iowa State University

Guest Editor
SUSAN CONRAD, Portland State University

Brief Reports and Summaries Editors


SUSAN CONRAD, Portland State University

Reviews Editor
SUSAN CONRAD, Portland State University

Assistant Editor
ELLEN GARSHICK, TESOL Central Ofce

Assistant to the Editor


LILY COMPTON, Iowa State University

Editorial Advisory Board


J. D. Brown, University of Hawaii at Manoa Suresh Canagarajah, Baruch College, City University of New York Ryuko Kubota, The University of North Carolina at Chapel Hill Constant Leung, Kings College London John Levis, Iowa State University Jo A. Lewkowicz, University of Hong Kong Brian Lynch, Portland State University Paul Kei Matsuda, University of New Hampshire Lourdes Ortega, Northern Arizona University James E. Purpura, Teachers College, Columbia University Miyuki Sasaki, Nagoya Gakuin University Norbert Schmitt, University of Nottingham Mack Shelley, Iowa State University Kelleen Toohey, Simon Fraser University Jessica Williams, University of Illinois at Chicago

Additional Readers

Svenja Adolphs, John Armbrust, Guy Aston, Dwight Atkinson, Geoff Barnbrook, Douglas Biber, Jen Burges, Pat Byrd, Ron Carter, Marianne Celce-Murcia, Winnie Cheng, Ulla Connor, Jeff Connor-Linton, Viviana Cortes, Sylvie DeCock, Jan DeCarrico, Dan Douglas, Susan Fitzmaurice, Gwyneth Fox, Volker Hegelheimer, Rebecca Hughes, Ken Hyland, Chris Kennedy, Manfred Krug, Merja Kyt, Brian Lynch, Michaela Mahlberg, Michael McCarthy, Mary McGroarty, Charles Meyer, Bernard Mohan, Nadja Nesselhauf, Alan Partington, Kristin Precht, Carol Radich, Randi Reppen, Sarah Rilling, Ute Rmer, Betty Samraj, Michael Stubbs, John Swales, Hongyin Tao, Chris Tardy, Elena Tognini-Bonelli

Credits

Advertising arranged by Sarah Trujillo, TESOL Central Ofce, Alexandria, Virginia U.S.A. Typesetting by Capitol Communication Systems, Inc., Crofton, Maryland U.S.A. Printing and binding by Pantagraph Printing, Bloomington, Illinois U.S.A.
Copies of articles that appear in the TESOL Quarterly are available through ISI Document Solution, 3501 Market Street, Philadelphia, Pennsylvania 19104 U.S.A. Copyright 2003 Teachers of English to Speakers of Other Languages, Inc. US ISSN 0039-8322 (print), ISSN 1545-7249 (online)

REVIEWS

ccclxxxi

is an international professional organization for those concerned with the teaching of English as a second or foreign language and of standard English as a second dialect. TESOLs mission is to develop the expertise of its members and others involved in teaching English to speakers of other languages to help them foster effective communication in diverse settings while respecting individuals language rights. To this end, TESOL articulates and advances standards for professional preparation and employment, continuing education, and student programs; links groups worldwide to enhance communication among language specialists; produces high-quality programs, services, and products; and promotes advocacy to further the profession. Information about membership and other TESOL services is available from TESOL Central Ofce at the address below.
TESOL Quarterly is published in Spring, Summer, Autumn, and Winter. Contributions should be sent to the Editor or the appropriate Section Editors at the addresses listed in the Information for Contributors section. Publishers representative is Paul Gibbs, Director of Publications. All material in TESOL Quarterly is copyrighted. Copying without the permission of TESOL, beyond the exemptions specied by law, is an infringement involving liability for damages. Reader Response You can respond to the ideas expressed in TESOL Quarterly by writing directly to editors and staff at tq@tesol.org. This will be a read-only service, but your opinions and ideas will be read regularly. You may comment on the topics raised in The Forum on an interactive bulletin board at http://communities.tesol.org/ tq. TESOL Home Page You can nd out more about TESOL services and publications by accessing the TESOL home page on the World Wide Web at http://www.tesol.org/. Advertising in all TESOL publications is arranged by Sarah Trujillo, TESOL Central Ofce, 700 South Washington Street, Suite 200, Alexandria, Virginia 22314 USA, Tel. 703-836-0774. Fax 703-836-7864. E-mail tesol@tesol.org.

OFFICERS AND BOARD OF DIRECTORS 20022003


President AMY SCHLESSMAN Evaluation, Instruction, Design Tucson, AZ USA Northern Arizona University Flagstaff, AZ USA President-elect MICHELE SABINO University of Houston Downtown Houston, TX USA Past President MARY LOU McCLOSKEY Atlanta, GA USA Secretary CHARLES S. AMOROSINO, JR. Alexandria, VA USA Treasurer MARTHA EDMONDSON Washington, DC USA ccclxxxiv Mark Algren University of Kansas Lawrence, KS USA Neil J. Anderson Brigham Young University Provo, UT USA Mary Ann Boyd Illinois State University (Emerita) Towanda, IL USA Ays*egl Daloglu Middle East Technical University Ankara, Turkey Eric Dwyer Florida International University Miami, FL USA Bill Eggington Brigham Young University Provo, UT USA Mabel Gallo Instituto Cultural Argentino Norteamericano Buenos Aires, Argentina Aileen Gum City College San Diego, CA USA Jun Liu University of Arizona Tucson, AZ USA Lucilla LoPriore Italian Ministry of Education Rome, Italy Anne V. Martin ESL Consultant/Instructor Syracuse, NY USA Jo Ann Miller Universidad del Valle de Mexico Col. Copilco el Bajo Mexico DF, Mexico Betty Ansin Smallwood Center for Applied Linguistics Washington, DC QUARTERLY TESOL USA

QUARTERLY
Founded 1966

Editors Note
I On behalf of the TESOL Quarterly readership, I thank Susan Conrad for editing this excellent special-topic issue showing the multifaceted connections between corpus linguistics and TESOL. Please note the Call for Abstracts on page 442 for the Autumn 2005 special-topic issue on Reconceptualizing Pronunciation in TESOL: Intelligibility, Identity, and World Englishes.

Carol A. Chapelle

In This Issue
I

Corpus linguistics is an approach to investigating language that is characterized by the use of large collections of texts (spoken, written, or both) and computer-assisted analysis methods. The approach encompasses great diversity in the kinds of research questions addressed, the specic techniques employed, and the contexts in which it is applied. Furthermore, because it is a relatively new approach, new corpora and new techniques are constantly under development. Assembling a special-topic issue on corpus linguistics in TESOL was, therefore, a somewhat daunting thought. Besides showcasing quality research in corpus linguistics, I hoped that this issue would introduce TESOL Quarterly readers to some of the diversity within corpus linguistics; look ahead to new corpora, projects, and research questions; and present resources for readers who wanted to become more involved with corpus linguistics. Most important, I wanted the issue to show that corpus linguistics is not the domain of computer geeks interested in arcane language trivia but rather addresses issues central to the TESOL profession. However, the entirely open submission process meant there was no guarantee about the issues coverage. I am pleased to say that this special-topic issue fullls the roles that I envisioned. The articles cover a wide variety of important research questions,

IN THIS ISSUE Vol. 37, No. 3, Autumn 2003 TESOL QUARTERLY

385

teaching applications, techniques, and contexts. The Brief Reports and Summaries describe new corpus development projects that will expand the research questions and teaching applications of corpus linguistics. And the reviews point readers to books and Web sites that will help them learn more about this approach. I am pleased, too, that the diversity in the authors afliations (from seven countries spanning Asia, Europe, North America, and the Pacic) speaks to the appeal of corpus linguistics throughout many regions of the world. The articles are diverse in the types of language phenomena that they cover, their methodologies, and their contexts. All of the articles, however, demonstrate that studies within corpus linguistics contain both quantitative analysesto show what is common and what is unusual in languageand qualitative, functional interpretations of how the language is used. Anne OKeeffe and Fiona Farr discuss the use of corpora in language teacher education courses. They introduce some common tools in corpus linguistics, such as concordancing and word lists, as well as discussing the training that students need to become comfortable using corpus materials. They use corpus linguistics to help prospective teachers in three major ways: developing pedagogic expertise, acquiring language expertise, and raising sociocultural awareness. Drawing on their own experience compiling a corpus of Irish English, they offer practical considerations for getting started with corpus-based investigations. Rita Simpson and Dushyanthi Mendis present the results of a study of idioms in speech in university settings, based on analysis of the Michigan Corpus of Academic Spoken English. Their quantitative analysis of the most common idioms found no great differences in frequencies for monologic versus interactive speech or for academic divisions, such as humanities versus hard sciences. They found three salient pragmatic functions of the idioms (paraphrase, emphasis, and metalanguage) that are particularly relevant to the academic context. Their sample teaching materials demonstrate that the application of corpus research to materials development can help ESL learners gain experience with language in naturally occurring contexts. Stefan Frazier focuses on a grammatical structure in American Englishclauses that contain the modal would and appear in hypothetical and counterfactual environments. His quantitative analysis shows that, contrary to the typical treatment in ESL/EFL textbooks, the majority of would-clauses do not have adjacent if-clauses. Turning to the use of these would-clauses without adjacent if-clauses, he nds six basic functions. His analysis shows the contribution that a corpus-based study can make to reconsidering the design of grammar-teaching materials. Graeme Kennedy investigates collocationswords that tend to appear with other wordsin the British National Corpus. He analyzed 24

386

TESOL QUARTERLY

adverbs of degree and the adjectives or participles that they modify (e.g., incredibly exciting, utterly desolate). Using a statistical measure of the strength of co-occurrence of the two words, called the mutual information measure, he shows that the adverbs have strong tendencies to occur with words that share certain grammatical or semantic characteristics. He argues that giving due attention to collocations in language teaching is important for developing learners uency. Lynne Flowerdew illustrates how corpus analyses can be combined with other analytical perspectives and demonstrates the insights that can be gained from comparing expert and novice corpora. Concentrating on the problem-solution rhetorical pattern in technical writing, she used corpus-based word frequency analyses to show that lexical choices provide linguistic evidence for this rhetorical pattern. She then categorized occurrences of one common word, problem, based on a systemfunctional system for dening causal relations. Flowerdew found that student writing mirrors certain aspects of the professional writing, but the comparison also reveals limitations in the students lexicon limitations that are prime areas to address in the classroom. Also in this issue: Anna Mauranen describes a project to make a corpus of English spoken as a lingua franca in university settings in Finland. This corpus is one of the rst to address the need for corpora that show the target for EFL learners whose goal is not to speak with native speakers but to interact in communities where English is a lingua franca. Mauranen discusses the need for such corpora, the challenges of compiling the corpus, and the types of questions that will be answered when it is analyzed. Silvia Bernardini describes a 1-million-word corpus for English-Italian translation students. She outlines the design, which makes multiple types of comparisons possible among originals and translations, and illustrates how she uses the corpus to teach sociocultural insights, discourse-structuring expressions, and lexical patterns. The project has the potential to be replicated in other contexts. Sylviane Granger describes the design of the International Corpus of Learner English, outlining the learner and task variables in this 2.5million-word corpus of texts written by EFL university undergraduates. The careful design of the corpus and sampling of students with 11 different L1s make contrastive interlanguage analyses and error analyses possible on a much larger scale than was previously feasible. Stephen Reder, Kathryn Harris, and Kristen Setzler describe what may well be the rst of a new generation of corpora: the Multimedia Adult ESOL Learner Corpus. The corpus is notable for containing language produced by very low level learners in language classrooms and for the fact that the transcribed language remains linked to video recordings.

IN THIS ISSUE

387

Users can therefore not only work with the transcription but also see the corresponding audio and video, recorded with multiple cameras and microphones in the classroom. Reviews: Marie Helt reviews three well-known introductory texts: An Introduction to Corpus Linguistics (Graeme Kennedy), Corpus Linguistics (Tony McEnery and Andrew Wilson), and Corpus Linguistics: Investigating Language Structure and Use (Douglas Biber, Susan Conrad, and Randi Reppen). Federica Barbieri and Suzanne Eckhardt review two edited collections of corpus-based investigations: Corpus Linguistics in North America (Rita Simpson and John Swales, Eds.) and Learner English on Computer (Sylviane Granger, Ed.), which uses parts of the corpus described in Grangers Brief Report in this issue. Leslie Boyd, Jennifer Garland, Carol Radich, and Deborah Saari review Web sites that they have found useful for learning about corpus linguistics in three regards: general information, concordancing software, and teaching applications. The high quality of this issue owes much to the collective effort of many individuals. I am grateful to the numerous contributors of abstracts and manuscripts, who provided a wide array of work from which to choose; to the reviewers of the abstracts and manuscripts, whose comments and suggestions were invaluable in strengthening this issue; to the authors, who were consistently prompt and positive in responding to suggestions and inquiries; and to Carol Chapelle, for sharing her editorial experience and enthusiasm for the issue. Susan Conrad

388

TESOL QUARTERLY

Using Language Corpora in Initial Teacher Education: Pedagogic Issues and Practical Applications
ANNE OKEEFFE and FIONA FARR
University of Limerick Limerick, Ireland

The vast increase in the number of corpus-based materials, such as dictionaries and grammars, attests to the importance of corpus linguistics to English language description. Developments are also evident in the use of corpora in the classroom in data-driven learning ( Johns, 1991). These rapid developments in the use of language-related technology have not been matched by updated practices in teacher education. This article makes a case for the inclusion of corpus linguistics in initial language teacher education to enhance teachers research skills and language awareness. The authors offer examples of corpus-based tasks for increasing students understanding of word classes, registerrelated grammatical choices, and socioculturally conditioned grammatical choices. Practical considerations for the integration of language corpora in a teacher education program are outlined.

pplied linguists working on technology-related issues have for some time noted the relevance of technological changes in the digital global economy for TESOL (e.g., Chapelle, 2001; Cummins, 2000; Warschauer, 2000). Literacy is no longer just about reading and writing. Society now demands multiliteracies (Warschauer, 2000), which include a high prociency in digital and online competencies (see also Doering & Beach, 2002; Pennington, 2001). Consequently, language teacher educators have a fundamental obligation to educate teachers in a way that empowers them to work in the modern world. The initiation and implementation of many national educational policies and directives targeted at teacher education institutions are testament to such an obligation. Writing about in-service and preservice foreign language teaching, Murray (1998) and Barnes and Murray (1999) argue that information and communication technology (ICT) can no longer be an added extra but rather [is] an intrinsic part of a teachers methodological repertoire (Barnes & Murray, 1999, p. 167). They conclude that this
TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

389

transition must occur in the initial teacher training period to have the greatest effect (p. 167) because many novice teachers are too busy with other matters in the rst years of teaching to assume the task of developing and integrating ICT into their teaching and learning. Many researchers concur that promoting critical attitudes and developing conceptual as well as practical frameworks for technology in language learning is the key to meaningful future technology use (see, e.g., Egbert, Paulus, & Nakamichi, 2002; Meskill, Mossop, DiAngelo, & Pasquale, 2002). Doering & Beach (2002) argue that it is primarily through active participation with technology as opposed to receiving instruction about technology that preservice teachers learn to recognize the value of technology tools (p. 128). Others (e.g., Egbert et al., 2002; Murray, 1998; Tammelin, 2001) point out that mastery of ICT skills can also foster a positive attitude, increased condence, and teacher empowerment. However, including technology skills in teacher education adds a complex layer of issues to what is already a full curriculum. This article begins to address some of these issues by describing how students learn to use technology to exploit the linguistic data contained in corpora of English in English language teaching courses at the University of Limerick, in Ireland, where we have been integrating corpora into teacher education courses since 1997.

CORPUS LINGUISTICS AND LANGUAGE TEACHING


Many applied linguists who conduct corpus linguistic research are convinced of its signicant impact on the eld. According to McCarthy (2001), corpus linguistics represents cutting-edge change in terms of scientic techniques and methods, and probably foreshadows even more profound technological shifts that will impinge upon our long-held notions of education, roles of teachers, the cultural context of the delivery of educational services and the mediation of theory and technique (p. 125). Examination of large quantities of spoken and written texts has revealed language patterns and uses that had hitherto eluded intuition, and has resulted in improved dictionaries (see Fox, 1998) and grammars (see Biber, Johansson, Leech, Conrad, & Finegan, 1999the Longman Grammar of Spoken and Written English [LGSWE], a grammar that draws on a corpus of 40 million words). In addition, numerous studies have shown that the language presented in textbooks is often based on faulty intuition about how people use language. Holmes (1988, p. 40), for example, looks at epistemic modality in ESL textbooks as compared with corpus data and nds that many textbooks devote an unjustiably large amount of attention to modal verbs at the expense of alternative linguistic strategies. Boxer and
390 TESOL QUARTERLY

Pickering (1995) contrast speech acts in textbook dialogues with real, spontaneous encounters found in a corpus. Carter (1998) compares real data from the Cambridge and Nottingham Corpus of Discourse in English (CANCODE) with dialogues from textbooks, nding that they lack core spoken language features such as discourse markers, vague language, ellipsis, and hedges. Kettermann (1995) highlights the mismatch between actual language use and the prescription in pedagogical grammars that reported speech should involve the backshift rule for tenses in reported speech constructions (see also Baynham, 1991, 1996; McCarthy, 1998). Hughes and McCarthy (1998) look at the use of past perfect verb forms and nd that across a wide range of speakers in CANCODE, the past perfect has a broader and more complex function in spoken discourse than hitherto described. Corpus descriptions have also enhanced the understanding of units of xed phrasing, collocation, and language patterning (Aston, 1995; Murison-Bowie, 1996; Sinclair, 1991; Svartvik, 1991). Despite these and many other ndings from corpus research, Svartvik (1991) points out that the attitude to the use of corpora in linguistic research has had its ups and downs (p. 555). Many practitioners and applied linguists point to the problems of adopting corpus-based material in the language classroom (see, e.g., Cook, 1998; Owen, 1996; Prodromou, 1997a, 1997b; Seidlhofer, 1999; Widdowson, 2000). Sinclair (1991), for example, makes the case for the use of real language in the classroom by asserting that one does not study all of botany by making articial owers (p. 6). However, Widdowson (2000) warns that just because corpus data are real, one should not assume that using such data in the classroom will bring with it more reality (p. 7). The reality that corpus ndings represent is, he argues, third- rather than rstperson reality, and problems arise when partial description of decontextualised language (p. 7) is used to determine language prescription for the classroom. We argue, however, that in the end it is teachers who will engage in the process of recontextualising corpora and any useful ndings from corpus-based description. It is teachers who will mediate between corpus-based content and the needs of the learners in their individual classroom contexts. To do this, teachers will need to be able to make informed decisions, and, not least of all, they will need to be able to access the validity of the arguments that are made in relation to corpus ndings and corpus use. Carter and McCarthy (1995) and others have argued that language corpora are a useful resource for teachers and learners (p. 144). However, Tribble (2000) notes that despite the best efforts of people like Tim Johns, Guy Aston, John Flowerdew and myself not many teachers seem to be using corpora in their classrooms (p. 31). We argue that if corpus applications and corpus ndings are to reach the right
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 391

audience (i.e., language learners), they must be integrated at the very core of teacher education courses (see also Chapelle, 2001; Conrad, 2000). In the context of teacher education for teachers who are speakers of English as a lingua franca (ELF), Seidlhofer (1999) comments,
Teachers who have a good idea as to what options are in principle available to them, and have learnt to evaluate these critically, sceptically and condently, are unlikely to be taken in by the absolute claims and exaggerated promises often made by any one educational philosophy, linguistic theory, teaching method or textbook. (p. 240)

However, many teacher educators themselves have not had extensive experience with corpora. We therefore hope that our experience in integrating corpus linguistics into teacher education courses is informative.

CORPUS APPLICATIONS FOR TEACHER EDUCATION


We discuss our understanding of corpora in teacher education through the three characteristics Sternberg and Horvath (1995) ascribe to an expert teacher. To be an expert teacher, one must be more knowledgeable, be more efcient, and have better insight than nonexperts (either experienced or inexperienced). Whether or not one accepts this characterization, the responsibility of initial teacher education courses is ultimately to aim to produce teachers who have at least started their journey along the road to expertise. To do so, we at the University of Limerick have attempted to increase students pedagogic, linguistic, and sociocultural awareness by examining how linguistic choices are realized in the ESL/ EFL classroom. We came to use corpora for this purpose because the materials that we had been using for methodological skills acquisition (i.e., commercially available classroom transcripts and video recordings) have two major shortcomings: (a) They have traditionally lent themselves almost exclusively to qualitative scrutiny, the conclusions of which may sometimes be elusive to and oversubjectied by inexperienced students; and (b) they fail to allow the practices of teaching to be interpreted within their contexts of realisation because many local contextualization cues are lost in their reproduction and extraction for third-party analysis operating in far-removed realities. In other words, nonpresent third parties in different educational or cultural surrounds cannot easily capture in their entirety the sociocultural and environmental factors that create and cast the lesson. This is particularly true in our Irish context, as many of the teacher education materials available commercially are either British or U.S. produced and often mismatch the conditions experienced by our students.
392 TESOL QUARTERLY

To rectify the contextual mismatch, we have been engaged in the process of building our own English language teaching classroom corpus to use in teacher education. For example, Farr (2002) reports on a study in which teachers classroom interactions were recorded and then transcribed to form a minicorpus, which in turn was used as the basis for analysis of the correlation between question forms and productivity (i.e., the length of student response in numbers of words) in the language classroom. Our classroom corpus will ultimately include four types of transcriptions: experienced teachers operating in different sociocultural settings from our students, experienced teachers operating in the same sociocultural settings, other students operating in different sociocultural settings, and our students during their on-site teaching practice sessions. Another area of application for corpora in language teacher education that we look at is raising linguistic awareness (relating to the knowledge category as detailed by Sternberg & Horvath, 1995). However, in addition to pedagogical and linguistic awareness, and fundamental to the evolution of corpus use in the context of English language classrooms around the world, teachers need to develop a critical awareness of what corpus ndings represent. As we illustrate, corpus investigations can engender enquiry in prospective teachers so that they do not readily accept corpus ndings as absolute truths.

TECHNOLOGICAL EXPERTISE FOR CORPUS EXPLORATION


To work with corpora, students need some basic technological expertise. At rst, corpus linguistics can seem very daunting, and teacher educators should be careful not to frighten students off with seemingly complex statistics and computations. It is crucial, we have found, to start with a basic distinction between a corpus, which is essentially a collection of texts (see Biber, Conrad, & Reppen, 1998), and the software that one can use to analyse it. Teachers who choose to use corpora in their language classrooms will need to be discerning about software and corpora, and, at the most basic level, they will need to know the common functions and applications of the available software.

Concordancing
We always begin with concordancing as it is a core tool for analysis in corpus linguistics. Concordancing is the process of using software to search for all the occurrences of one word (or phrase) in a corpus. All of
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 393

the occurrences are presented with the node word/phrase (the one searched for) in the centre of the line, with seven or eight words presented at either side of the node word. Depending on the software, the number of words at either side of the node word or phrase can be adjusted to allow for more context. The sample of concordance lines for the word made in Figure 1 was produced with the Collins Cobuild (n.d.) Corpus Concordance Sampler (freely available online; see the Appendix for Web sites and software relevant to corpus linguistics). It provides 40 examples based on any or all of the following corpora: British books, ephemera, radio, newspapers, magazines (26 million words); U.S. books, ephemera, and radio (9 million words); and British transcribed speech (10 million words). Apart from free Internet concordancing sites, many commercially available software packages allow the user to go back to the original source text of any one of lines or at least provide a much larger amount of context if required. A key manipulation of a concordance involves sorting alphabetically to the left and to the right of the node word or phrase. Figures 2 and 3 show an example produced with WordSmith Tools (Scott, 1996) analysing the Corpus of Spoken Professional American English (CSPAE, 2000, a 2million-word corpus on CD-ROM made up of academic discussions, committee meetings, and White House press conferences). The node word is still made, but this time we present the line samples in two different sorting formats: sorted to the left (Figure 2) and sorted to the right (Figure 3) of the node word. By looking to the left and to the right of a word with our students, we nd more information about the grammatical and collocational patterns that emerge for the word. Comparing left and right concordance lines of the same word whets students appetites, and they are soon gripped by evolving patterns of collocationthat is, the tendency of words to combine with other words. The study of collocation is one of the main
FIGURE 1 Extract of Concordance Lines for the Word Made
Eighteen western governments have made a joint protest to the Burmese to come to London for it. Smith had made a unilateral declaration of I understand what you mean. I made a list of every regret I could think associated products similar to those made by Cooper. Before expending money Basso, a New York designer who has made clothes for Elizabeth Taylor and 000lb bomb. [p] The terrorists home-made device was discovered in a van just and several hundred submissions were made either in person or in writing. [p] also get help with interest on loans made for nancing essential repairs or wok. This impressively solid pan is made from carbon steel with easy-care nonchanged costs thousands. Home-made gift check whether it is genuine or word. Once all the words have been made, have them close their holders and forms of alternative treatment have made headlines. The rst, based on shark

Note. Generated with the Collins Cobuild Corpus Concordance Sampler (n.d.). 394 TESOL QUARTERLY

FIGURE 2 Concordance Lines of Made Sorted 1L and 2L


uestions. Somehow this math could be about the fact of whether it should be cuments of the DNC that ought not to be , are we have decisions already been second question. The statement has been ing thats in jeopardy, hes certainly e is one area in which President Chirac ho doesnt think that President Clinton GOLAN: I think we as a country u know now Deputy Secretary of Defense, s though, just as the point was earlier y it. Its an eighth grade test. Ed . Yes, in fact, that in fact, I even made made made made made made made made made made made made made a lot more specic, and we could b a bit more explicit. One reason I r a matter of public record because about the fact that this is going a couple of times that parents sho a, I think, a concerted effort to s a specic point about the U.S. ro a bold move. But Chapter 1, page 1 a commitment to spend money on TIMS a key recommendation that the Defe about the greater accessibility of a good suggestion that I thought ev a suggestion for this meeting that

Note. Generated with WordSmith Tools (Scott, 1996) using data from the Corpus of Spoken Professional American English (2000). 1L = rst word to the left; 2L = second word to the left.

applications of concordancing. Fox (1998) gives the example of high and tall. Even though they are roughly synonymous, they cannot always be used interchangeably; for example, one can say a high building but not a high man. Similarly, McCarthy (1990) gives the example of blonde, which is very likely to collocate with hair but unlikely to occur with wallpaper or car. Stevens (1995) suggests that using concordances with students can develop cognitive and analytic skills for solving real-language problems. However, we nd that learners need some training before they can make the most of concordance lines, including seeing collocational patterns. Reading a concordance line takes a little getting used to. The instinctive reaction is to try to read it in detail in the usual way, from left to right. We have found it is best to skim it initially from top to bottom, looking only
FIGURE 3 Concordance Lines of Made Sorted 1R and 2R
GOLAN: I think we as a country lear. He believes its important. Hes rticularly sort of concerned with this, of the lack of effort and they now have EINWAND: I dont think that Ed has inced over the last hour that anyones t come to that conclusion. He has not n intelligence activity in Bosnia. We on who will listen with whom they have ther participate in that process. We maybe a verbatim. I dont recall who second question. The statement has been ctions. VOICE: But we have tion. And in schools where they have made made made made made made made made made made made made made made a a a a a a a a a a a a a a commitment to spend money on TIMS commitment to get it done by the commitment at the beginning of t commitment. But they can answer compelling case to back away fro compelling case that we gain anyt conclusion of that. Its Senator condition of our train-and-equip connection with in their freshmen couple of determinations sugge couple of changes in the language couple of times that parents sho deal? MYERS: Were n decision not to use, they shouldn

Note. Generated with WordSmith Tools (Scott, 1996) using data from the Corpus of Spoken Professional American English (2000). 1R = rst word to the right; 2R = second word to the right. USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 395

at the central patterns and working outward from them. For example, doing this with the concordance lines for made in Figure 3 reveals that it collocates frequently with a case, a commitment, a decision, and so on. Thompson (1995) provides some activities for practising skimming concordance lines in class and for developing strategies for guessing the general context from sample line fragments. Fox (1998) notes that the use of concordances in the classroom is in its infancy as a language teaching technique (p. 43), and she provides many useful examples of their application and noteworthy considerations for their use. Other ideas for using concordances in class are given by Flowerdew (1996), Johns (1997), Stevens (1991), Tribble (1997), and Tribble and Jones (1990, 1997), among others. A number of Web sites also provide online samples and sample activities (see the Appendix).

Word Frequency Lists


Another function common to corpus software is the calculation of word frequency lists (or word lists) in any batch of texts. We nd that it is important to focus on this function as it facilitates enquiry in our students. If they have learned this function, when they see a statistic from corpus linguistics, they can use the corpora available to them to compare ndings across language varieties and contexts, and soon they become aware that contextual factors are paramount in analyses of corpora. Figure 4 presents a typical activity we might do with our students. We compared the word frequencies of the following sets of data: (a) shop encounters in Ireland (8,500 words from the Limerick Corpus of Irish English (L-CIE, 2003), (b) female friends chatting (40,000 words from the L-CIE), (c) the Australian Corpus of English (1 million words of written Australian English; see ICAME Collection of English Language Corpora, 2000) and (d) the 10 most frequent words from the Cambridge International Corpus based on a 100,000-word sample of newspapers and magazines as presented in McCarthy (1998, pp. 122123). From just the rst 10 words of these data sets, our students can see a divide between spoken and written language. In the spoken results, they nd markers of the interactive nature of spoken English, such as I, you, yeah (as a response token), like, please, and thanks. Comparing the Australian written corpus results with the rst 10 words from the Cambridge International Corpus, trainees nd that the results are almost identical. The other important issue highlighted by this short comparison is that even though both of the rst word lists are from the Irish spoken corpus (L-CIE), they are not identical. The shop data have obvious traces of context with high-frequency items, including thanks, please, and the discourse marker now.
396 TESOL QUARTERLY

FIGURE 4 Comparison of Word Frequencies for the 10 Most Frequent Words Across Four Different Data Sets Cambridge International Corpus (McCarthy, 1998) Written the to of a and in is for it that

Rank

Shop (L-CIE) Spoken

Friends (L-CIE) Spoken I and the to was you it like that he

Australian Corpus of English Written the of and to a in is for that was

1 2 3 4 5 6 7 8 9 10

you of is thanks it I please the yeah now

Note. L-CIE = Limerick Corpus of Irish English.

CORPUS APPLICATIONS TO THE ACQUISITION OF PEDAGOGIC PRACTICE


As mentioned above, Sternberg and Horvath (1995) present three characteristics associated with the prototypical category of expert teacher: (a) teaching knowledge, (b) teaching efciency, and (c) teaching insight. Within this framework, we have structured classroom corpus tasks in our programme. We present and discuss samples of these below. We draw upon the corpora for developing learning activities that attempt to develop these three areas of expertise.

Acquiring Teaching Knowledge


Three types of knowledge are necessary for expert teaching, according to Shulman (1987). The rst is content knowledge of the subject matter to be taught. Suggestions for how this knowledge, that is, knowledge of English, can be acquired with the aid of corpora are offered in the section Corpus Applications in Raising Linguistic Awareness. The second type is pedagogic knowledge, which includes skills such
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 397

as classroom management and motivational strategies (e.g., effective questions, nomination, instructions, student groupings, classroom organisation, use of teaching aids, lesson planning). Finally, contentspecic teaching knowledge (Sternberg & Horvath, 1995, p. 11) includes applying teaching knowledge in a specic sociocultural and organisational setting. This knowledge tends to be more tacit (Freeman, 1991) and therefore more elusive to acquisition, but it is nonetheless a determining feature of a distinguishable expert teacher (Sternberg & Horvath, 1995, p. 12). We use a series of tasks that incorporate the use of classroom corpora for advancing students pedagogic and content-specic teaching knowledge of effective questioning strategies. Using the example shown in Figure 5, trainees start by looking at questioning patterns in our classroom corpus. They investigate the correlation between a question type and its productivity (they quickly notice, e.g., how much more productive referential questions are than yes/no questions). They are then asked, in Task (c), to look more broadly at the placement of

FIGURE 5 Sample Material Based on the Limerick Corpus of Irish English for Raising Awareness of Pedagogic Knowledge a) Run concordances of questions used in the classroom corpus to determine their frequencies (wh- questions can be extracted by searching each of the wh- questions individually, and yes/no and intonation questions can be found by searching ?) b) Analyse and compare the productivity of each question type by running an analysis of student responses in terms of length and quality (use up to 10 examples of each question type). c) How does each type t in the typical initiation-response-follow-up (IRF; Sinclair & Coulthard, 1975) classroom exchange structure? Use the KWICa facility to help with your analysis. d) Compare and contrast the place of questions in the IRF model with their place in other discourse structures in two additional registers of your choice from L-CIE. e) Investigate how questioning integrates with other strategies, for example, nomination or gesture using both the transcriptions and video recordings in a qualitative way. Pay particular attention to the contextual and pragmatic factors at play. f) Compare data from Subcorpus X (expert teachers) with Subcorpus Y (nonexpert teachers)b and comment on good and bad practice in context. g) Transcribe part of one of your teaching practice lessons where you are eliciting from students using questions. Analyse your questioning strategies and note your reections in your teaching journals to form the basis of a comparative discussion with your peers in the coming weeks.
a

Key word in context. Instead of viewing only short concordance lines, students can view an extended context for each occurrence of the search term. bWe have found it useful sometimes to use data from expert and nonexpert teachers (instead of experienced versus inexperienced teachers) so that we do not establish a belief that inexperience equates with lack of expertise, and vice versa. 398 TESOL QUARTERLY

questions + response + follow-up for each question type within Sinclair and Coulthards (1975) initiation-response-feedback model. Task (d) asks students to compare the question patterns across nonclassroom contexts so that they see how different the structure is elsewhere. For example, in casual conversation, it would be usual for a friend to ask a question and to follow up the answer with an evaluation like very good. This comparison brings to light how predetermined teacher-led classroom discourse can be. Task (e) focuses the students on the broader realm of classroom management by asking them to look at the combination of strategies that are employed in questioning, such as asking the question, scanning, and then nominating. By comparing questioning patterns between expert and nonexpert teachers in Task (f), the students can discern effective and ineffective practices. Task (g) initiates a longer term reective process in which students use their own data and reect on their own strategies. We have found our classroom corpus to be very useful because it allows us to conduct quantitative and qualitative analysis of almost any aspect of classroom interactions. Wegerif, Mercer, and Rojas-Drummond (1999), for example, provide excellent commentary on and description of how they have applied corpus techniques to the comparative analysis of the effectiveness of different teaching approaches in a Mexican context. They empirically examine the inuence that the teachers sociocultural approach has on the development of problem-solving skills among students. In quantitative and qualitative analyses of a transcribed video and audio corpus of classroom language, they investigate the corpus of talk from two classrooms employing different teaching methodologies over a period of 1 year in relation to the problem-solving skills the students develop during this period. Verbal strategies are isolated, allowing the researchers to uncover techniques used for the social construction of knowledge through scaffolding the pupils engagement in independent problem-solving and reasoning (p. 133).

Acquiring Teaching Efciency


Through other activities, we attempt to engender awareness of efciency in students. The short activity shown in Figure 6, which has instruction giving as its focus, is based on the notion of teacher modes (see McCarthy & Walsh, 2003), whereby teachers are said to have various modes of talk in the classroom. By assessing and increasing their awareness of these modes, teachers can improve classroom competence. Here we focus on the instructional mode, in which teachers are giving instructions to the students. First, we ask students to generate a word list using our classroom corpus and then to isolate all the verbs within this
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 399

FIGURE 6 Sample Material Based on the Limerick Corpus of Irish English for Raising Awareness of Pedagogic Efciency a) Run a word frequency list for the classroom corpus and isolate all the verbs. b) Identify which verbs are likely to be used when the teacher is in instruction mode, and run concordances of their imperative forms to test your hypothesis. c) Search for any other key word(s) you think may be used frequently when giving instructions, for example, Lets, can you/we, please. d) Isolate three instruction-giving episodes and examine their entire contexts to comment on the language, procedures, and pacing. Find examples of redundancies or inaccuracies in the teachers instructions, and comment on the pace of delivery. e) Rewrite the instructions in a way that you consider to be more efcient.

list. Task (b) asks students to predict which of these verbs are used in giving instructions and to check their predictions by means of concordancing. Tasks (b) and (c) focus the students on the imperative nature of instructional talk, and Tasks (d) and (e) focus qualitatively on the need to conduct instructional episodes with precision and clarity. We nd that our students begin to develop the desired reectivity and insight from this activity because it provides them with a framework within which to measure their practice.

Acquiring Teaching Insight


Insight is the ability to solve problems in creative and effective ways. Sternberg and Horvath (1995, p. 14) give the example of teachers using analogy to help students understand difcult concepts. Instances of successful teacher insight skills can be isolated through qualitative analysis of classroom corpora with expert teachers, for example, asking questions such as In this lesson how does the teacher effectively explain differences in use between the various conditional structures in English? Relate your answers to the teacher presentation stage of the lesson and also to subsequent student production. Even more benecial is the remedial self-examination of novice teachers transcripts for parts of the lesson where they encountered difculties not anticipated during preparation. In the example shown in Figure 7, we again use our local classroom corpus to focus on a typical classroom dilemma that all novice teachers can relate to: A student asks for a detailed lexical explanation that the teacher has not anticipated. Tasks (a) and (b) rst ask students to draw on the standard dictionary resource to nd the difference between the problematic words and then
400 TESOL QUARTERLY

FIGURE 7 Sample Material Based on the Limerick Corpus of Irish English for Raising Awareness of Pedagogic Insight Student: Trainee: Student: Trainee: Whats the difference between collaborate and cooperate? Well collaborate is generally used for something which is negative and cooperate is more positive. So can I say I am cooperating with Maria on this project? Collaborate would be wrong here? Well yes, no, mm Im not too sure. What does the dictionary say? Lets check.

a) Use a dictionary to nd the differences in meaning between these two words. b) Use any large corpus from the electronic library to establish how these near-synonyms differ in terms of use and lexical patterns. c) Redesign the part of the lesson in the extract above to make it more effective.

to use a corpus concordancer to compare their patterns in contexts of use. Through this activity, students see how concordancing can greatly enhance a dictionary denition by allowing many patterns of use, in many contexts, to be viewed at once. Task (c) leads students inductively back to classroom application.

CORPUS APPLICATIONS IN RAISING LINGUISTIC AWARENESS


All teachers in an initial language teacher education programme expect to attain a high level of descriptive linguistic competence for the language they are going to teach. Gabrielatos (2002/2003) argues that if teachers are to become more than skilled materials operators, then teacher education needs to focus more consistently on research skills, as well as language analysis and its implications for ELT (p. 3). Corpora offer great potential for developing language awareness and research skills within teacher education (see Coniam, 1997; Hunston, 1995; Kennedy, 1995). The following examples illustrate corpus activities from our teacher education program intended to develop students understanding of word classes, register, and socioculturally conditioned choices of word classes in context. From our experience, a grammatically tagged corpus (one where all of the items used have been labeled for their word class) is a very useful supplement to the development of critical knowledge of the English syntactic system. A useful sequence of activities is as follows:
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 401

1. Students are presented, either deductively or inductively, with the theory of word classes, including information on meaning, distribution, and inection taken from a variety of grammar reference books. 2. They practise identifying the word classes in pedagogically designed texts. 3. They are presented with an untagged version of a text from a corpus and, in groups or individually, try to identify the word classes. 4. They check their answers against the tagged version of the same corpus, carefully examine any inconsistencies, and use them as the basis of a search for a particular word to further test their hypotheses. For example, they may examine the classication of the word right, which can function in different ways in different contexts. This process develops a sense of enquiry, leading from the students own research question to inductive exploration of a corpus as a problemsolving resource. Both the ICAME Collection of English Language Corpora (2000) and the International Corpus of English: The British Component (ICEGB; 1998) contain a rich supply of grammatically tagged data. A tagged corpus also proves a very useful resource for the independent study of syntax whereby the tagging serves as a ready-made answer key that students can consult. A sample activity with concordance lines, which we use to develop awareness of lexis and word classes, is shown in Figure 8. Such activities develop language awareness inductively and frequently lead students to form more research questions. Many student investigations, from our experience, lead to interesting comparisons across large-scale corpora available to students in our electronic library. Sometimes these mini research projects initiate a line of enquiry that can lead to the research question for an undergraduate project or even a graduate thesis.

Register-Specic Linguistic Choices


Although concordance-based searches and investigations can provide the basis for many insights into lexical patterns and proles, there is also scope to explore grammatical patterns using a corpus. Figure 9 displays a task that focuses students on a grammatical item commonly presented in textbooks: question tags. This task also aims to develop a sense of questioning about corpus ndings. Here the general aim is to show how results vary depending on the type of corpus used; these differences highlight the importance of contextual factors and of cross-checking ndings.

402

TESOL QUARTERLY

FIGURE 8 Sample Material Based on the Limerick Corpus of Irish English for Raising Lexical Awareness Below are concordance lines for the word dead. a) Identify its different word classes from these examples. b) Do any collocational patterns emerge from this evidence? c) Divide the different examples into positive and negative meanings. d) What synonyms could be used for the intensier uses of dead? e) Identify the examples of idioms based on the word dead. Use a corpus to nd some more.
by this time Pa wouldve been well at a street corner and shoot you trees some of them didnt take enough ground to bury our seven people were shot and Bernie is all the possums will be left up there he pays this tribute to the poet youre great height. chances are youd be over your sounds sounds a dleton Murray couldnt compete with the addressing her pretend it started off the guys still living with her and said Stan was you know i mean the guys police believe that they were also shot job under the table um and do it e hands are distraught winds waking the g the ultimate shot in bowls either the oh but thats its ant nd it and were both at a ts of ways especially as his mother was everyone has a way of burying their three now and er hes been dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead dead 7:00 of course 8:48 seven a great many big ones which and an eighth and he got him from thirty and so its like er you and so forth stanza four before you hit the ground body huh bore so far brother and felt resentful brother in her journal she said but hes sitting on the couch and but then the telegram said but they this other by the same trio cheap cymbalic reeds at the edge draw or the trail of the easy once you get used to it end um er for eleven hours

Using the example of question tags, we present ndings across various corpora: the U.S. CSPAE (2000; White House press conferences and academic meetings); the Wellington Corpus of Spoken New Zealand English (WSC; see Holmes, Vine, & Johnson, n.d.); the Bergen Corpus of London Teenage Language (COLT; see University of Bergen, 2000) and the Lancaster/IBM Spoken English Corpus (SEC, n.d.). We rst ask students to compare these ndings from spoken corpora with those from written sources so that they see how rare question tags are in writing (in fact, they are used only in direct speech or in cases where the author addresses the reader). The spoken ndings that we present show that question tags are vastly more frequent in COLT, but in Tasks (c) and (d) students see that it would be erroneous to assume that question tags are

USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION

403

FIGURE 9 Sample Material Based on the Limerick Corpus of Irish English for Raising Awareness of Grammatical Patterns In the graph below are the results for question tags ending in you? from: two subcorpora within the Corpus of Spoken Professional American English (CSPAE): 1 million words of White House press conferences and 1 million words of academic discussions and meetings the Wellington Spoken Corpus (WSC) from New Zealand (1 million words) The Corpus of London Teenage Language (COLT) (1 million words) The Lancaster/IBM Spoken English Corpus (SEC) (55,000 words; these results have been normalised).a 350 300 250 200 150 100 50 0 White House Acad. meetings WSC COLT SEC

Investigate the use of question tags in these and other spoken and written corpora to address the following questions: a) Are question tags more frequent in other spoken language corpora compared to written data? b) How are question tags used in written language? c) Do you think question tags are used less frequently in American English? d) What is the impact of context of use on the frequency? e) Use any two corpora to compare ndings for question tags ending in I, he, she, it, we, they. f)
a

What lessons can be learnt about care needed in selecting a corpus for your research?

To make frequency results comparable, they need to be normalised as follows: In the SEC we found three question tags ending in you? This was divided by the total corpus size (55,000 words) and multiplied by 1 million, resulting in 54.5. This gure is then comparable with the other results, which are all from 1-million-word corpora.

a British phenomenon. The fact that the U.S. data came from more formal contexts than the British data affects the results. Tasks (e) and (f) focus on the need to compare data across corpora and to consider the effect of the context of the different corpora.
404 TESOL QUARTERLY

Sociocultural Grammatical Choices


As we have stated, a central issue for the evolution of corpus use in English language classrooms around the world is the development of a critical awareness of what corpus ndings represent. As illustrated above, structured corpus tasks can promote enquiry in novice teachers so that they do not readily accept corpus ndings as absolutes. We feel strongly that the scrutinising of corpus ndings, especially those from large-scale corpora, needs to be given overt attention. In particular, we stress the need to consider the sociocultural factors from which corpus data come, as these factors reveal much about how language is pragmatically sensitive to context. In this section, we give practical illustrations of how corpora can be of benet in raising awareness of the sociocultural diversities that often underlie language use. In one teacher education activity, we compare the frequency of modal verbs in British and American English, presented in the LGSWE (Biber et al., 1999), with Irish English in the L-CIE and New Zealand English in the WSC (Figure 10). One of the noticeable differences is the high occurrence of would in the Irish data. The high frequency of would in Irish English corresponds to uses of would that go beyond the canonical characterisations in standard English. However, the Irish data also include would as a hedge consistent

FIGURE 10 Distribution of Modal Verbs Across Three Corpora (per Million Words) 5,000

4,000

3,000 LGSWE L-CIE 1,000 WSC

2,000

0 Would Can Could Should Might Will May Must Shall

Note. LGSWE = Longman Grammar of Spoken and Written English; L-CIE = Limerick Corpus of Irish English; WSC = Wellington Corpus of Spoken New Zealand English. USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 405

with other varieties of English; for example, in this extract from an encounter between a teacher educator and a novice teacher (following a teaching practice observation), the teacher educator uses would in a convergent sense instead of making a more direct statement like You should have allowed them to work through it all (for further discussion, see Farr & OKeeffe, 2002):
Trainer: Trainee: Trainer: Trainee: Trainer: Do you think it would have been possible at all to just leave them work through them all? . . . I would say so. Mm. Given your time I would say so. Umhum.

However, would is also used in Irish English where hedging might not be expected in many other varieties of English, and its use is thus central to the sociocultural level of the interaction. Irish speakers appear to be very tentative, far beyond the demands of the interaction itself, even in situations where the propositional content of the utterance is unquestionable. For example, in the extract below from an Irish radio call-in show, a caller hedges about her hair colour, which was black but is now brown:
Caller: . . . I would have had black hair you know my hair would be brownish now . . . Presenter: Right.

In another example, two friends reminisce, and would is again used for a factual statement about the location of a shop (Swamp refers to a chain of clothing stores):
Speaker 1: Speaker 2: Where was it? Upper William Street. William Street. Across the road from ah. Whats the name of it? Coffee place. Coffee. It would be across the road from say Swamp now. She used to take me in there and I used to get to drink coffee. I used to love it.

This analysis of local language use in contrast to British/American use allowed us to discover and explore a layer of tentativeness in Irish interactions, in which downtoning of indisputable facts appears to be a sociocultural norm. It is not unreasonable to expect advanced English learners, particularly as they become more procient, to become better at recognizing such sociocultural nuances in the language they hear if teachers help them develop an interest in and ability to work with naturally occurring data. However, Rundell (1997) raises a very pertinent, broadly related question: whether imposing the idiosyncratic
406 TESOL QUARTERLY

linguistic features of one specic dialect of English is really an appropriate model for a majority of learners (p. 97). In reply to his own scepticism, he points to the importance of recognizing that the specic ways in which people encode meaning reect deeply embedded cultural characteristics. We argue that initial teacher education programs must address this level of language variation and that prospective teachers need to learn cross-corpora comparison skills in order to facilitate critical investigation of the transferability and application of corpus ndings to the broader sociocultural context of their learners.

PRACTICAL ISSUES IN USING CORPORA IN TEACHER EDUCATION


Developing these and other activities over the past years has raised a number of practical issues, including the pros and cons of building versus buying a corpus; the inclusion of spoken versus written data; the choice of small versus large corpora; the inclusion of native speaker, nonnative speaker, or learner language; and the use of paper versus online corpus work for the students.

Build Versus Buy


Many corpora are now commercially available, and some can even be purchased for under U.S.$100. As we have illustrated, having a wide variety of corpora allows for more in-depth investigation across variables such as social context and language variety. However, in some cases, the varieties or contexts one wishes to include may never have been compiled into a corpus (e.g., as was the case for us with Irish English; see also Aston, 1997; Maia, 1997), and the only solution is to build ones own corpus. Much has been written on the principles of corpus design (see Biber et al., 1998; Crowdy, 1993, 1994; Hunston, 2002; McCarthy, 1998); however, the serious resource implications of building a corpus, especially a spoken corpus, are worth emphasizing. As an example, in building L-CIE, a 1-million-word spoken corpus, the following core costs needed to be budgeted carefully: 1. collection of data: Individuals need to be paid to record the data. One hour of recording includes 10,00015,000 words (depending on the type of talk). We therefore needed to record more than 100 hours of material to ensure that we would get 1 million words. The most costeffective means of collecting data that we have found is to pay a set
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 407

price per tape, for example, $30 per 1-hour tape, rather than costing the person-hours involved in the collection of 1 hour of data. 2. transcription of data: The data then need to be transcribed. The cost of transcription, which depends on the level of detail desired, was at a minimum $150 per hour of tape (i.e., around $15,000 for 1 million words). 3. corpus research associate: This person is responsible for the day-to-day maintenance and building of the corpus. Duties may include managing the collection and transcription of data, ensuring that recordings cover the desired distribution of contexts, cataloguing cassettes, making a database of header information, and ling speaker information and consent forms for all spoken data collected (consent forms serve as legal contracts in which the individuals who are recorded stipulate that their conversations may be used for research and possibly in the development of pedagogical materials). Ideally, this person should work full-time for the rst year in the collection of a 1-million-word spoken corpus and remain part-time thereafter to maintain and update the corpus. Written corpora are easier to compile because they do not involve recording and transcription (though if the original texts are not electronic, the time and cost of scanning must be factored in). At the same time, written corpora have the added concern of copyrights. In sum, a corpus compilation projectwhether spoken or writtenis not to be undertaken without considerable planning and nancial resources.

Spoken Versus Written


In general, because of the availability of data in electronic form, a written corpus is much easier to assemble than a spoken one; therefore, more written corpora exist. McCarthy (1998) accounts for the dearth of spoken corpora in light of costs (as discussed above), access to appropriate and representative speech data situations, quality of recording, time involved in transcription, and difcult decisions in relation to the level of detail to include in transcription, among other factors. However, we would argue that the efforts and resources are justiable on the grounds of the need to reassess language interpretation and pedagogy to account for spoken as well as written norms. Some of our sample tasks have highlighted some of the many differences between ndings from written versus spoken corpora, and, indeed, there are many differences within spoken corpora depending on the context and variety. It is crucial for students to be in a position to compare corpus ndings across spoken versus written corpora from as many varieties as possible.
408 TESOL QUARTERLY

Too often classroom descriptions of the English language are based on written norms. For this reason alone, the effort of assembling a spoken corpus is worth making. A small, specialised corpus can be assembled at a relatively low cost. For example, our classroom corpus comes from recorded data that teachers and students have donated and that we have transcribed ourselves. Though the corpus only amounts to under 100,000 words, it is rich in spoken data from our local context.

Small Versus Large


Whether to use a large, generalized corpus or a small, specialized corpus depends on the teacher educators particular needs. Fox (1998) remarks that a corpus is nothing more nor less than a collection of texts input into a computer, and the number of texts will depend upon the uses that will be made of the corpus (p. 25). To examine a relatively infrequent word and investigate generality of lexical use, a large, representative corpus is necessary so that adequate occurrences can be found from which to draw some conclusions about typical features (see, e.g., Coxhead, 2000). If, on the other hand, the object of enquiry is a word or structure that is quite common, smaller corpora may sufce, and the smaller they are, the easier they are to handle and exploit. Also, as Tribble (1997) suggests, a small corpus may be necessary if a specialized language register is involved. Small corpora are useful for training students in corpus techniques and methods, and they often allow the user to access contextual or pragmatic information about the spoken or written text. In addition, their limits are clearer, as they cannot claim to represent an entire language, and they therefore discourage the user from overgeneralizing. Aston (1997) makes an interesting and very practical distinction between the usefulness of small and large corpora: A large corpus is necessary for developing references, but for data-driven learning ( Johns, 1991) in the classroom, where the aims and needs are much more specic and localized, the smaller corpora are as good if not better. Even linguists who have traditionally favored large representative corpora exclusively now recognize the place of smaller data collections (Tribble, 1997). Of course, another advantage for the teacher educator is that such corpora are cheaper and easier to construct or buy. At one time, a 1-million-word corpus was considered large. Today, the Bank of English (Collins Cobuild, n.d.) has more than 500 million words. What constitutes a large or a small corpus today depends on whether one is referring to a spoken or a written corpus. In very general terms we adhere to the following guidelines: For spoken corpora, anything over 1 million words is moving into the larger range; for written
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 409

corpora, anything below 5 million words is quite small. However, it is often the design of the corpus as opposed to its size that determines its suitability; for example, a corpus containing only highly technical engineering language will be largely inappropriate for novice language teachers wanting to investigate the vocabulary of everyday casual conversation. Therefore, although size is an issue, it should be considered hand-in-hand with design appropriate to the long- and short-term pedagogic needs of the students. (For a full discussion of size and diversity in corpus design, see Biber et al., 1998, 1999; Coxhead, 2000; Hunston, 2002; McCarthy, 1998; Sinclair, 1991; Thomas & Short, 1996.)

Native Speaker Versus Nonnative Speaker and Learner Corpora


The corpus developer also has to decide on the varieties of English that should be included. Prodromou (1997a), among others, raises the possibility that corpora of native speaker language may present problems for nonnative teachers. He asks, What about the non-native speaker teacher, faced with varieties of English and cultures he or she can, by denition, never master, never own? (p. 5). (For further discussion of native speaker ownership of the English language, see Flowerdew, 2000; Graddol, 1999; Nero, 2000; Seidlhofer, 2001; Warschauer, 2000.) One answer to this question is to have more corpora of English spoken in contexts where English is a lingua franca, such as the one described by Mauranen in this issue. Seidlhofer (1999, 2001) details a corpus development project called the Vienna-Oxford International Corpus of English (VOICE), which aims to collect approximately half a million words of spoken data from speakers who make use of ELF. Such corpora will facilitate the proling of ELF as a robust variety that is independent of English as a native language. VOICE may, according to Seidlhofer (2001), establish something like an index of communicative redundancy (p. 147). Learner corpora are collections of texts produced by writers or speakers while they are learners. Granger (1998) advances theoretical and practical arguments for the place of learner corpora in the language classroom for studying phenomena such as interlanguage, fossilization, patterns of error, and cross-linguistic similarities and differences. Biber and Reppen (1998), Granger and Tribble (1998), and Milton (1998) outline useful procedures for using corpora as a supplementary tool for language learners, whereby students compare and analyse native speaker and learner data as a means of improving their language. Future teacher education may benet from a recent large-scale international corpus project focusing on the written English of learners from many different
410 TESOL QUARTERLY

L1 backgroundsthe International Corpus of Learner English (ICLE; see Granger, 1996, 1998, 1999; Granger, Hung, & Petch-Tyson, 2002). In 1995, a corpus of spoken learner English, The Louvain International Database of Spoken English Interlanguage, was set up to complement the ICLE project (see De Cock, 1998a, 1998b, 2000). Reder, Harris, and Setzler (this issue) describe a multimedia corpus of low-level learner language, with applications for second language acquisition studies as well as teacher education.

Handouts Versus Hands On


A very practical but important decision to make when using corpus evidence for pedagogic purposes is whether to prepare and print out the data for students in class or to give students access to the data on the computer. Of course, the latter assumes the ready availability of adequate levels of technology and support. In institutions where technological support may be a concern, students may be able to use the many online self-instructional options available (see the Appendix). Leech (1997) outlines the advantages of both the paper-based and the computer-based approaches as follows: Prepared printouts, which allow wider access to the data by more students, are most effective in lowering the affective lter of technophobic students and save class time as the teacher does the preliminary work prior to the lesson. On the other hand, using the computer in class can promote a more learner-centred approach, provides an open-ended supply of data, and allows for more tailored and customised learning; it also teaches strategies for learning with corpora beyond the classroom. Johns (1991), in describing the datadriven approach, strongly advocates the hands-on use of corpora by students because it makes the whole experience the epitome of induction. An additional argument for having students engage in concordancing is that it aims to give them control over their learning and build their competence by giving them access to the facts of linguistic performance (see Stevens, 1995). If practical reasons make it impossible for students to use computers themselves, Willis (1998) outlines at length the procedures that can be adopted for the use of paper-based concordances in the classroom. Educators who are familiar with inductive instruction will appreciate its effectiveness but will also recognise the increased time investment required. In shorter teacher education courses, already under time pressure, inductive instruction may not be a luxury one can afford. In our teacher education programmes, we have balanced both approaches and have found that starting with printouts and working up to computer use promotes a more progressive, inductive approach, which students
USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 411

tend to prefer. They need to understand the theoretical and practical applications before they become sidetracked or overwhelmed by the technology. Furthermore, using both instructional modes in teacher education programmes provides a richer variety of experience and presents students with more options for their own future teaching.

CONCLUSION
In this article we have outlined practical and theoretical aspects related to the integration of language corpora as an electronic resource in initial teacher education. Without doubt, language corpora will continue to develop as an inuence in language pedagogy. Many instructional materials, including software, dictionaries, and grammars, have been corpus based in recent years. For this reason alone, all teachers should learn about corpora. Beyond this, however, the more teachers know about corpora and how to use them, the more they will be empowered to (a) evaluate corpus-based materials more objectively and (b) question publishers and academics about specic details of the corpora they use. Native and nonnative teachers need to learn to manipulate language corpora for their own pedagogic ends and to evaluate ndings that are presented as facts. Corpus-using teachers will be better placed for the sociocultural mediation and pedagogic recontextualization of these resources and ndings in their language classrooms of the future. At the same time, much work remains for teacher educators in further developing methodological principles for the use of corpora and empirically evaluating corpus-based approaches and their effect on learning.
ACKNOWLEDGMENTS
We are grateful to the anonymous reviewers for detailed comments on an initial draft of this article. We also thank Susan Conrad, Gwyneth Fox, and Michael McCarthy for their feedback and encouragement on earlier versions.

THE AUTHORS
Anne OKeeffe is course leader in EFL/TEFL at Mary Immaculate College, University of Limerick. Her research centres on small spoken language corpora and particularly on how they can be used to explore sociocultural nuance. Along with her colleagues, she is involved in building the Limerick Corpus of Irish English, and she has recently completed a PhD on the discourse of radio call-in. Fiona Farr is director for the MA in English language teaching at the University of Limerick. Her research interest is in the application of spoken language corpora and

412

TESOL QUARTERLY

language varieties. She is part of a research group building the Limerick Corpus of Irish English and is completing a PhD on the discourse of language teacher education.

REFERENCES
Aston, G. (1995). Corpora in language pedagogy: Matching theory and practice. In G. Cook & B. Seidlhofer (Eds.), Principle and practice in applied linguistics: Studies in honour of H. G. Widdowson (pp. 257270). Oxford: Oxford University Press. Aston, G. (1997). Small and large corpora in language learning. In B. LewandowskaTomaszczyk & P. J. Melia (Eds.), PALC 97: Practical applications in language corpora (pp. 5162). Lodz, Poland: Lodz University Press. Barnes, A., & Murray, L. (1999). Developing the pedagogical ICT competence of modern foreign languages teacher trainees. Situation: All change and plus a change. Journal of IT for Teacher Education, 8, 165180. Baynham, M. (1991). Speech reporting as discourse strategy: Some issues of acquisition and use. Australian Review of Applied Linguistics, 14, 87114. Baynham, M. (1996). Direct speech: Whats it doing in non-narrative discourse? Journal of Pragmatics, 25, 6181. Biber, D., Conrad S., & Reppen R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Essex, England: Longman. Biber, D., & Reppen, R. (1998). Comparing native and learner perspectives on English grammar: A study of complement clauses. In S. Granger (Ed.), Learner English on computer (pp. 145158). London: Longman. Boxer, D., & Pickering L. (1995). Problems in the presentation of speech acts in ELT materials: The case of complaints. ELT Journal, 49, 99158. Carter, R. (1998). Orders of reality: CANCODE, communication and culture. ELT Journal, 52, 4356. Carter, R., & McCarthy, M. J. (1995). Grammar and the spoken language. Applied Linguistics, 16, 14158. Chapelle, C. A. (2001). ELT, technology and change. In A. Pulverness (Ed.), IATEFL 2001 Brighton conference selections (pp. 918). Kent, England: International Association of Teachers of English as a Foreign Language. Collins Cobuild. (n.d.). Corpus concordance sampler. Retrieved May 30, 2003, from http://titania.cobuild.collins.co.uk/form.html Coniam, D. (1997). A practical introduction to corpora in a teacher training language awareness programme. Language Awareness, 6, 199207. Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21st century? TESOL Quarterly, 34, 548560. Cook, G. (1998). The uses of reality: A reply to Ronald Carter. ELT Journal, 52, 5763. Corpus of spoken professional American English [CD-ROM]. (2000). Houston, TX: Athelstan. (Available from http://www.athel.com/cspa.html) Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34, 213238. Crowdy, S. (1993). Spoken corpus design. Literary and Linguistic Computing, 8, 259 265. Crowdy, S. (1994). Spoken corpus transcription. Literary and Linguistic Computing, 9, 2528. Cummins, J. (2000). Academic language learning, transformative pedagogy, and information technology: Towards a critical balance. TESOL Quarterly, 34, 537547.

USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION

413

De Cock, S. (1998a). Corpora of learner speech and writing and ELT. In A. Usoniene (Ed.), Proceedings from the International Conference on Germanic and Baltic Linguistic Studies and Translation (pp. 5666). Vilnius, Lithuania: Homo Liber. De Cock, S. (1998b). A recurrent word combination approach to the study of formulae in the speech of native and non-native speakers of English. International Journal of Corpus Linguistics, 3, 5980. De Cock, S. (2000). Repetitive phrasal chunkiness and advanced EFL speech and writing. In C. Mair & M. Hundt (Eds.), Corpus linguistics and linguistic theory: Papers from the Twentieth International Conference on English Language Research on Computerized Corpora (ICAME 20), Freiburg im Breisgau 1999 (pp. 5168). Amsterdam: Rodopi. Doering, A., & Beach, R. (2002). Preservice English teachers acquiring literacy practices through technology tools. Language Learning and Technology, 6, 127146. Retrieved June 2, 2003, from http://llt.msu.edu/vol6num3/doering/default.html Egbert, J., Paulus, T. M., & Nakamichi, Y. (2002). The impact of CALL instruction on classroom computer use: A foundation for rethinking technology in teacher education. Language Learning and Technology, 6, 108126. Retrieved June 2, 2003, from http://llt.msu.edu/vol6num3/pdf/egbert.pdf Farr, F. (2002). Classroom interrogationshow productive? Teacher Trainer, 16, 19 23. Farr, F., & OKeeffe, A. (2002). Would as a hedging device in an Irish context: An intra-varietal comparison of institutionalised spoken interaction. In R. Reppen, S. Fitzpatrick, & D. Biber (Eds.), Using corpora to explore linguistic variation (pp. 25 48). Amsterdam: Benjamins. Flowerdew, J. (1996). Concordancing in language learning. In M. Pennington (Ed.), The power of CALL (pp. 97113). Houston, TX: Athelstan. Flowerdew, J. (2000). Discourse community, legitimate peripheral participation, and the nonnative-English-speaking scholar. TESOL Quarterly, 34, 127150. Fox, G. (1998). Using corpus data in the classroom. In B. Tomlinson (Ed.), Materials development in language teaching (pp. 2543). Cambridge: Cambridge University Press. Freeman, D. (1991). To make the tacit explicit: Teacher education, emerging discourses, and conceptions of teaching. Teaching and Teacher Education, 7, 439 454. Gabrielatos, C. (2002/2003). Grammar, grammars and intuitions in ELT: A second opinion. IATEFL Issues, 170, 23. Graddol, D. (1999). The decline of the native speaker. AILA Review, 13, 5768. Granger, S. (1996). Learner English around the world. In S. Greenbaum (Ed.), Comparing English world-wide (pp. 1324). Oxford: Clarendon Press. Granger, S. (Ed.). (1998). Learner English on computer. London: Longman. Granger, S. (1999). Use of tenses by advanced EFL learners: Evidence from an errortagged computer corpus. In H. Hasselgard & S. Oksefjell (Eds.), Out of corpora studies in honour of Stig Johansson (pp. 191202). Amsterdam: Rodopi. Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: Benjamins. Granger, S., & Tribble, C. (1998). Learner corpus data in the foreign language classroom: Form-focused instruction and data-driven learning. In S. Granger (Ed.), Learner English on computer (pp. 199209). London: Longman. Holmes, J. (1988). Doubt and certainty in ESL textbooks. Applied Linguistics, 9, 2144. Holmes, J., Vine, B., & Johnson, G. (n.d.). The Wellington corpus of spoken New Zealand English. Retrieved May 30, 2003, from http://www.vuw.ac.nz/lals/wgtn_crps_spkn_ NZE.htm

414

TESOL QUARTERLY

Hughes, R., & McCarthy, M. J. (1998). From sentence to discourse: Discourse grammar and English language teaching. TESOL Quarterly, 32, 263287. Hunston, S. (1995). Grammar in teacher education: The role of a corpus. Language Awareness, 4, 1531. Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. ICAME collection of English language corpora [CD-ROM]. (2000). Bergen, Norway: Norwegian Computing Centre for the Humanities. (Available from http://www .hit.uib.no/icame.html) International corpus of English: The British component (ICE-GB) [CD-ROM]. (1998). (Available from http://www.ucl.ac.uk/english-usage/ice-gb/) Johns, T. (1991). Should you be persuadedtwo samples of data driven learning materials. English Language Research Journal, 4, 116. Johns, T. (1997). Contexts: The background, development and trialling of a concordance-based CALL program. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 100115). London: Longman. Kennedy, C. (1995). Wish you were here: Little texts and language awareness. Language Awareness, 4, 161172. Kettermann, B. (1995). Concordancing in English language teaching. TELL and CALL, 4, 415. Lancaster/IBM Spoken English Corpus (SEC) tag-set. (n.d.). Retrieved May 30, 2003, from http://www.comp.leeds.ac.uk/amalgam/tagsets/sec.html Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 123). London: Longman. Limerick Corpus of Irish English. (2003). Retrieved July 23, 2003, from http:// www.mic.ul.ie/lcie Maia, B. (1997). Do-it-yourself corpora . . . with a little help from your friends. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), PALC 97: Practical applications in language corpora (pp. 403410). Lodz, Poland: Lodz University Press. McCarthy, M. J. (1990). Vocabulary. Oxford: Oxford University Press. McCarthy, M. J. (1998). Spoken language and applied linguistics. Cambridge: Cambridge University Press. McCarthy, M. J. (2001). Issues in applied linguistics. Cambridge: Cambridge University Press. McCarthy, M. J., & Walsh, S. (2003). Discourse. In D. Nunan (Ed.), Classroom-based language teaching methodology (pp. 173195). New York: McGraw-Hill Meskill, C. J., Mossop, S., DiAngelo, R., & Pasquale, K. (2002). Expert and novice teachers talking technology: Precepts, concepts and misconcepts. Language Learning and Technology, 6, 4657. Retrieved June 2, 2003, from http://llt.msu.edu /vol6num3/meskill/default.html Milton, J. (1998). Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In S. Granger (Ed.), Learner English on computer (pp. 186198). London: Longman. Murison-Bowie, S. (1996). Linguistic corpora and language teaching. Annual Review of Applied Linguistics, 16, 182199. Murray, L. (1998). CALL and Web training with teacher self-empowerment: A departmental and long-term approach. Computers and Education, 31,1723. Nero, S. J. (2000). The changing faces of English: A Caribbean perspective. TESOL Quarterly, 34, 483510. Owen, C. (1996). Do concordances need to be consulted? ELT Journal, 50, 219224.

USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION

415

Pennington, M. (2001). Writing minds and talking ngers: Doing literacy in an electronic age. In CALL in the 21st century [CD-ROM]. Kent, England: International Association of Teachers of English as a Foreign Language. Prodromou, L. (1997a). Corpora: The real thing? English Teaching Professional, 5, 26. Prodromou, L. (1997b). From corpus to octopus. IATEFL Newsletter, 137, 1821. Rundell, M. (1997). Understatement and indirectness in English: From corpus evidence to classroom practice. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), PALC 97: Practical applications in language corpora (pp. 9098). Lodz, Poland: Lodz University Press. Scott, M. (1996). WordSmith Tools (Version 3.0) [Computer software]. Oxford: Oxford University Press. (Available from http://www.liv.ac.uk/~ms2928/) Seidlhofer, B. (1999). Double standards: teacher education in the expanding circle. World Englishes, 18, 23345. Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as a lingua franca. International Journal of Applied Linguistics, 11, 133158. Shulman, L. S. (1987). Knowledge and teaching: Foundations of the new reform. Harvard Educational Review, 19, 414. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J., & Coulthard, M. (1975). Towards an analysis of discourse: The English used by teachers and pupils. Oxford: Oxford University Press. Sternberg, R. J., & Horvath, J. A. (1995). A prototype view of expert teaching. Educational Researcher, 24, 917. Stevens, V. (1991). Classroom concordancing: Vocabulary materials derived from relevant, authentic text. ESP Journal, 10, 3546. Stevens, V. (1995). Concordancing with language learners: Why? When? What? CLL Journal, 6, 210. Svartvik, J. (1991). What can real spoken data teach teachers of English? In J. A. Alatis (Ed.), Linguistics and language pedagogy: The state of the art (pp. 555565). Washington, DC: Georgetown University Press. Tammelin, M. (2001). Empowering the language teacher through ICT training and media education: Case HSEBA. In CALL in the 21st century [CD-ROM]. Kent, England: International Association of Teachers of English as a Foreign Language. Thomas, J., & Short, M. (Eds.). (1996). Using corpora for language research. New York: Longman. Thompson, G. (1995). Collins Cobuild concordance sampler 3: Reporting. London: HarperCollins. Tribble, C. (1997). Improvising corpora for ELT: Quick and dirty ways of developing corpora for language teaching. In B. Lewandowska-Tomaszczyk & J. Melia (Eds.), PALC 97: Practical applications in language corpora (pp. 106117). Lodz, Poland: Lodz University Press. Retrieved June 2, 2003, from http://web.bham.ac.uk /johnstf/palc.htm Tribble, C. (2000). Practical uses of for language corpora in ELT. In P. Brett & G. Motteram (Eds.), A special interest in computers: Learning and teaching with information and communications technologies (pp. 3141). Kent, England: International Association of Teachers of English as a Foreign Language. Tribble, C., & Jones, G. (1990). Concordances in the classroom. London: Longman. Tribble, C., & Jones, G. (1997). Concordances in the classroom: Using corpora in language education. Houston, TX: Athelstan. University of Bergen. (2000). Bergen corpus of London teenage language. Retrieved May 30, 2003, from http://www.hit.uib.no/colt/ Warschauer, M. (2000). The changing global economy and the future of English teaching. TESOL Quarterly, 34, 511535.

416

TESOL QUARTERLY

Wegerif, R., Mercer, N., & Rojas-Drummond, S. (1999). Language for the social construction of knowledge: Comparing classroom talk in Mexican preschools. Language and Education, 13, 133150. Widdowson, H. G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21, 325. Willis, J. (1998). Concordances in the classroom without a computer: Assembling and exploiting concordances of common words. In B. Tomlinson (Ed.), Materials development in language teaching (pp. 4466). Cambridge: Cambridge University Press.

APPENDIX Web Sites and Software for Corpus Linguistics


Corpora
American National Corpus
http://americannationalcorpus.org/

Australian Corpus of English


Available in ICAME Collection of English Language Corpora (2000). Peters, P., & Smith, A. (n.d.). Manual of information to accompany the Australian Corpus of English (ACE). http://khnt.hit.uib.no/icame/manuals/ace/INDEX.HTM

Bergen Corpus of London Teenage Language (COLT)


http://www.hit.uib.no/colt/; available in ICAME Collection of English Language Corpora

British National Corpus


http://info.ox.ac.uk/bnc/

Corpus of Spoken Professional American English (CSPAE)


http://www.athel.com/cspa.html

ICAME Collection of English Language Corpora


http://www.hit.uib.no/icame.html

English-Norwegian Parallel Corpus


http://www.hd.uib.no/enpc.html

English-Swedish Parallel Corpus


http://www.englund.lu.se/research/corpus/corpus/webtexts.html

International Corpus of English: The British Component (ICE-GB)


http://www.ucl.ac.uk/english-usage/ice-gb/

International Corpus of Learner English


http://www.abo./fak/hf/enge/icle.htm Granger, S. (n.d.). International corpus of learner EnglishICLE. http://www.tr.ucl.ac.be/tr /germ/etan/cecl/Cecl-Projects/Icle/icle.htm

IViE Corpus
Grabe, E., & Slater, A. (2002). The prosodically transcribed IViE corpus on-line. http://www.phon .ox.ac.uk/~esther/ivyweb/search_trans.html

Lancaster/IBM Spoken Corpus of English (SEC)


http://www.comp.leeds.ac.uk/amalgam/tagsets/sec.html; available in ICAME Collection of English Language Corpora

Limerick Corpus of Irish English (L-CIE)


http://www.mic.ul.ie/lcie USING LANGUAGE CORPORA IN INITIAL TEACHER EDUCATION 417

Longman Spoken American Corpus


http://www.longman-elt.com/dictionaries/corpus/lcaspoke.html

Longman Learners Corpus


http://www.longman-elt.com/dictionaries/corpus/lclearn.html

Louvain International Database of Spoken English Interlanguage (LINDSEI)


http://www.tr.ucl.ac.be/tr/germ/etan/cecl/Cecl-Projects/Lindsei/lindsei.htm

Michigan Corpus of Academic Spoken English (MICASE)


http://www.hti.umich.edu/m/micase/

Tractor
http://www.tractor.de/faq.htm

Wellington Corpus of Spoken New Zealand English


http://www.vuw.ac.nz/lals/wgtn_crps_spkn_NZE.htm; available in ICAME Collection of English Language Corpora

Concordancing
Software
Collins Cobuild. Corpus concordance sampler. (n.d.). http://titania.cobuild.collins.co.uk /form.html Conc (Version 1.80). (2000). Dallas, TX: SIL International. (Available from http://www.sil.org /computing/conc/) MonoConc Pro. (2000). Houston, TX: Athelstan. (Available from http://www.athel.com /mono.html) Multiconcord. (1998). Hants, England: CFL Software Development. (Available from http:// web.bham.ac.uk/johnstf/lingua.htm) Scott, M. (1996). WordSmith Tools (Version 3.0). Oxford: Oxford University Press. (Available from http://www.liv.ac.uk/~ms2928/ and http://www4.oup.co.uk/isbn/0-19-459286-3) UltraFind. (1997). London: Ultradesign Technologies. (Available from http://www.ultradesign .com/ultrand/ultrand.html) Watt, R. J. C. (2002). Concordance (Version 3.0). (Available from http://www.rjcw.freeserve .co.uk/)

Suggestions for Classroom Use of Concordancing


Hot Potatoes home page. (n.d.). http://web.uvic.ca/hrd/halfbaked Ruthven-Stuart, P. (2000). How to use concordances in teaching English: Some suggestions. http:// www.nsknet.or.jp/~peterr-s/concordancing/usingconcs.html Ruthven-Stuart, P. (2000). Online concordance quizzes. http://www.nsknet.or.jp/~peterr-s /concordancing/onlineconcquiz/online_conc_quizzes.html

Corpus Linguistics
Corpus linguistics pages. (1998). http://info.ox.ac.uk/bnc/corpora.html#Corpus Barlow, M. (n.d.). Parallel corpora. http://www.ruf.rice.edu/~barlow/para.html The Fifth Teaching and Language Corpora Conference. (2003). http://www.sslmit.unibo.it/talc/ The Tuscan Word Centre. (2003). http://www.twc.it/ University of Birmingham Centre for Corpus Linguistics. http://clg1.bham.ac.uk/

Corpus Linguistics Tutorials


Aston University, School of Languages and European Studies. (2000). Introduction to text analysis. http://www.les.aston.ac.uk/txtintro.html Ball, C. N. (1997). Tutorial: Concordances and corpora. http://www.georgetown.edu/cball /corpora/tutorial.html University of Essex, W-3 Corpora Project. (2000). The W3-corpora site. http://clwww.essex.ac.uk /w3c/

418

TESOL QUARTERLY

A Corpus-Based Study of Idioms in Academic Speech


RITA SIMPSON and DUSHYANTHI MENDIS
University of Michigan Ann Arbor, Michigan, United States

A mastery of idioms is often equated with native speaker uency (Fernando, 1996; Schmitt, 2000; Wray, 2000), but it is difcult for language teachers and material writers to make principled decisions about which idioms should be taught, given the vast inventory of idioms in a native speakers repertoire. This article addresses the advantages and limitations of a corpus-based approach to researching and teaching idioms in a specic genre by drawing on a specialized corpus of 1.7 million words of academic discourse, the Michigan Corpus of Academic Spoken English. We argue that evidence from such a corpus can be quite informative for language teachers when the primary target language domain matches that of the corpus. In terms of pedagogical applications, we demonstrate the use of corpus data to construct teaching materials aimed not only at helping students learn unfamiliar idioms but also at raising their awareness of the speech contexts idioms occur in and the discourse functions they perform.

he teaching of idioms raises a number of challenging practical and research-related questions. What is an idiom? Are idioms worth teaching, and, if so, why? If idioms should be taught, which of the thousands in English should be included in any particular course or textbook, and how should they be taught? Researchers such as Fernando (1996), Wray (1999), and Schmitt (2000) equate mastery of idioms with successful language learning and native speaker uencya perception that many language learners share and that often translates into a desire to acquire as many idioms as possible. Moreover, the mention of the word idiom conjures up language that is thought to be entertaining, engaging, casual, charming, colorful, and memorable. If such perceptions are not sufcient reasons for teaching these expressions, idioms also perform many important discourse functions, further warranting their inclusion in an ESL curriculum. The inventory of idioms in a native speakers repertoire is indeed vast, and therefore the frequency of occurrence of any individual idiom is
TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

419

relatively rare and unpredictable in any given stretch of discourse. How then can language teachers and material writers make principled decisions about which idioms are worth teaching? As Biber and Conrad (2001), Mauranen (2002), and others have demonstrated, a corpus can be used to identify the most important linguistic exemplars to teach. Although no single corpus can provide a comprehensive selection of idioms, a corpus is arguably a much better starting point than an invented list of idioms, in part because such lists are by and large entirely devoid of a coherent focus on a particular language domainsuch as, for example, business or academic English. This research reports the results of a corpus-based study of the idioms of academic speech with the aim of examining the feasibility of identifying those worth teaching to students of English for academic purposes (EAP). It identies the idioms that occur in academic speech, analyzes their functions, and offers some implications for teaching based on the corpus data.

THE STUDY OF IDIOMS


The majority of research on idioms has looked at them as a lexical phenomenon that is equally relevant across registers of English. More recently, however, attention has been directed toward idioms as a more register-specic linguistic feature.

Idioms as Formulaic Language


A number of studies consider idioms as one subcategory of the more general lexical phenomenon of formulaic language (Nattinger & DeCarrico, 1992; Moon, 1998; Wray, 1999, 2000, 2002; Wray & Perkins, 2000). A recurring theme throughout this literature is that an ability to understand and use formulaic language (including idioms) appropriately is a key to nativelike uency. In fact, according to Fernando (1996), No translator or language teacher can afford to ignore idioms or idiomaticity if a natural use of the target language is an aim (p. 234). Pawley and Syder (1983) make a strong case for the daunting nature of the task learners face in guring out which grammatically possible utterances are commonly used by native speakersthat is, which are idiomaticand which utterances, though grammatically possible, are not nativelike. Wray (1999) supports this claim, adding that the absence of formulaic sequences in learners speech results in unidiomaticsounding speech. Nattinger and DeCarrico (1992) take this argument a step further and present a typology and pragmatic analysis of what they

420

TESOL QUARTERLY

call lexical phrases, along with a number of suggestions for incorporating them into the L2 curriculum. Wray (1999, 2000, 2002) and Wray and Perkins (2000) approach the study of formulaic language primarily from a psycholinguistic perspective, arguing convincingly for a function-based account of formulaic language that acknowledges the sociopragmatic and interactional purpose of such expressions. Wray (1999) maintains that formulaic language benets both comprehension and production, in part because such expressions appear to be stored and retrieved as holistic, unanalyzed chunks and thus contribute to economy of expression. As she puts it, The whole point of selecting a prefabricated string is to bypass analysis (p. 480). As stated above, because idioms are considered to be a subset of formulaic language, these claims and ndings for formulaic expressions are certainly applicable to idioms as well.

Functional Descriptions of Idioms


Early studies of idioms tended to focus on formal properties such as typologies based on semantic and syntactic criteria (e.g., Makkai, 1972; Weinreich, 1969) whereas more recent work, beginning with Strssler (1982), has turned to an analysis of the pragmatic, interactional, and discourse-level features of idioms (Fernando, 1996; McCarthy, 1998; Moon, 1998). McCarthy demonstrates that idioms are highly interactive items and cannot always be identied by their formal properties. He also claims that idioms are not used randomly or without motivation and convincingly argues that they should be looked at as communicative devices rather than as mere quirks of the language (p. 146). He goes on to identify a number of socio-interactional functions for selected idioms in his corpus. Although several in-depth studies of idioms have been conducted, to date none has examined idioms in a specialized corpus with specic pedagogical aims. Moon (1998) presents one of the most in-depth corpus-based studies of xed expressions and idioms in English, but as she acknowledges, her corpus has certain limitations: It consists almost exclusively of written texts, and more than two thirds of those texts are of journalistic writing. McCarthys (1998) discussion of idioms is based on a relatively large spoken corpus consisting of predominantly interactive spoken data from a number of different registers. As McCarthy points out, the distribution and function of idioms in spoken discourse are areas that need more research. Our research, which drew on a specialized corpus of both interactive and monologic speech, therefore lls a gap not yet sufciently addressed

IDIOMS IN ACADEMIC SPEECH

421

namely, the study of idioms in speech from a specic institutional context. In particular, the research questions for this study were (a) How many idioms occur in the various subregisters within academic spoken language? (b) What functions do these idioms perform?

METHOD
We began this research project to nd out whether idioms occurred at all in academic speech and, if so, what conclusions could be drawn based on a corpus of under 2 million words, given that a majority of idioms have frequencies in the range of 1 token or fewer per million words (Moon, 1998). The methods of this research extended beyond that question to look at precisely how many idioms were found and what functions they served.

The Corpus
Our research was based on the Michigan Corpus of Academic Spoken English (MICASE), a specialized corpus of contemporary speech recorded at the University of Michigan between 1997 and 2001 (Simpson, Briggs, Ovens, & Swales, 2002). MICASE, which is freely available and searchable via the Web, contains 197 hours of recorded speech, totaling about 1.7 million words in 152 speech events. These speech events range from large lectures, to dissertation defenses, to one-on-one ofce-hour interactions and small peer-led study group sessions, and each transcript is categorized along several dimensions, including primary discourse mode and academic division. Primary discourse mode is a three-way classication referring to the level of interactivity, labeled monologic, interactive, or mixed. Academic division refers to one of four divisions dened according to the University of Michigan graduate schools classication of Departments: Humanities and Arts, Social Sciences and Education, Biological and Health Sciences, and Physical Sciences and Engineering. These two parameters were the dimensions of variation we analyzed in the quantitative part of the present study. MICASE also contains information about speaker attributes such as age, gender, and academic role (e.g., junior faculty, senior faculty, undergraduate or graduate student), but these were not taken into account in this study.

422

TESOL QUARTERLY

Procedures
We began by examining ESL textbooks to identify idioms, but this was not very successful. We therefore returned to a basic denition of idioms and searched the corpus. In the exploratory phase of our research, we compiled lists of idioms from three ESL textbooks aimed at university-level learners (Madden & Rohlck, 1997; McCarthy & ODell, 1997; Redman & Shaw, 1999) that had been published around the same time MICASE was being compiled that is, between 1997 and 2001. We found, however, that only about 25% of the idioms in these lists occurred in MICASE. This is partly because MICASE is a relatively small corpus, and the frequency of any given idiom in naturally occurring discourse is typically low. In addition, as we noted above, the selection criteria used by textbook authors for including idioms are somewhat unprincipled and idiosyncratic; thus it is not entirely surprising that there was little overlap between these lists and the idioms found in MICASE.

Dening Idioms
Because we needed to nd idioms in the corpus that were not drawn from any list, we developed criteria for deciding what an idiom is. The relatively narrow denition we adopted resulted in a manageable database of examples to examine in detail. The most prevalent description of an idiom is a group of words that occur in a more or less xed phrase and whose overall meaning cannot be predicted by analyzing the meanings of its constituent parts. Starting from the premise that an idiom is a multiword expression, we used three criteria: compositeness or xedness, institutionalization, and semantic opacity, all of which are also noted by Fernando (1996), McCarthy (1998), and Moon (1998) in their denitions of an idiom. Compositeness or xedness means that the individual lexical units of these expressions are usually set and cannot easily be replaced or substituted for. Idioms such as off the deep end, odds and ends, and making out like bandits are all examples of such xed expressions (attested in MICASE). Institutionalization refers to the conventionalization of what was initially an ad hoc, novel expression (Fernando, 1996), resulting in its currency and acceptance among the wider discourse community rather than by a small subcommunity. Semantic opacity indicates that the meaning of such expressions is not transparent based on the sum of their constituent parts. For example, the individual words in the idioms tongue in cheek, on the ball, and put a spin on it provide no clues to their composite meaning. In applying the test of semantic opacity to an expression, we were
IDIOMS IN ACADEMIC SPEECH 423

frequently reminded of McCarthys (1998) assertion that the boundary between the opaque, idiomatic meaning of a xed expression and its transparent, more literal meaning is often blurred. As almost all researchers of idioms have noted, dening the phenomenon to be studied is indeed problematic, and we do not claim to have solved this problem. What we have done is to arrive collaboratively at an agreed-on working denition that we rened as we proceeded.

Searching the Corpus


With these criteria in mind, we found idiom tokens by intensively reading a selection of the transcripts in the corpus and then searching the entire corpus for those tokens or variations thereof (e.g., twisted/ twisting my/your/his/her arm, arm twisting) using a concordance program (WordSmith Tools; Scott, 1996) in order to obtain overall frequency counts for each idiom. Transcripts were randomly selected from each of the major academic division and speech event categories, and each one was read through in its entirety. Although we do not claim to have uncovered all the idioms in MICASE, we have read through at least half of the transcripts in this manner and are condent that we have a representative and relatively complete inventory. To compile our master list of idioms, we used the three criteria supported by the collective intuition of three raters to eliminate some expressions. For example, expressions that may be considered metaphorical but are not idioms (e.g., a sad showing) were eliminated, as were phrasal verbs (e.g., catch on, bounce back, jump into) because these expressions constitute a separate subcategory often considered to be simply idiomatic verb phrases. In addition, we eliminated certain binomial expressions that we considered to be semantically transparentfor example, pick and choose, little by little, trial and errorbut we included such binomial expressions as odds and ends, nuts and bolts, and crash and burn because we felt that they satised the criterion of semantic opacity. This master list of idioms was entered into a database (created with Microsoft Access) that was linked to the existing database of all the speech events in the corpus. The linked database allowed us to compile various sorted lists of all the idioms and the transcripts they occurred in, which we used to obtain the frequency distributions. Finally, we used a discourse-analytic approach to identify and illustrate the primary pragmatic functions of a selection of idioms in the corpus.

424

TESOL QUARTERLY

FINDINGS Freqencies and Distribution of Idioms in MICASE


Two main questions guided the quantitative analysis. First, are there more idioms overall in the interactive transcripts of the corpus than in the monologic ones? Idioms are often assumed to be more prevalent in informal and interactive communication; certainly, a glance at many of the popular ESL textbooks purporting to teach idioms lends support to this belief. Second, do the overall frequencies of idioms show any noticeably differing trends across the four academic divisions? Preliminary forays into the corpus led us to suspect that we would nd more idioms in the humanities and social science divisions than in the hard sciences, perhaps reecting the stereotypical expectation of encountering colorful and rhetorically sophisticated speech in the arts and humanities and technical speech in the hard sciences and engineering. Both of these assumptions appear to be unfounded, according to our results. In all, we found 238 idiom types (unique idioms), with 562 tokens in the corpus, that met our criteria (see Table 1, which shows four frequency ranges and the number of types in each range). Of these, 123, or over half of all types, occurred only once, and only 23, or 10%, occurred more than four times in the corpus. As for the frequency of occurrence across the four academic divisions and primary discourse modes, the results of our research show a remarkably patternless distribution (see Table 2). Expressed in terms of either instances (i.e., omitting repetitions of the same idiom in a given transcript) or total tokens (i.e., counting all repetitions) per 100,000 words, neither the humanities nor the hard sciences show striking differences, and, similarly, idiom frequencies were only slightly higher in the monologic than in the interactive speech events.
TABLE 1 Idiom Types in MICASE by Frequency of Occurrence Range of occurrences Raw frequency 1017 59 24 1 Frequency per million words (approximate) 5.810.0 3.05.3 1.22.4 0.6 No. of idiom types in range 8 15 92 123

Note. Total idiom types = 238; total idiom tokens = 562. IDIOMS IN ACADEMIC SPEECH 425

TABLE 2 Frequency Distributions of Idioms in MICASE by Primary Discourse Mode and Academic Division Per 100,000 words Category Primary discourse mode Monologic/panel Interactive Mixed Academic division Inter-/nondepartmental Social Sciences & Education Physical Sciences & Engineering Humanities & Arts Biological & Health Sciences
a

Total

Transcripts

Words

Idiom instancesa

Idiom Idiom Idiom tokensb instancesa tokensb

70 57 25 13 33 36 37 33

688,875 714,906 284,323 156,171 394,780 358,729 447,487 330,937

30 25 21 37 31 24 24 20

37 32 25 45 41 32 31 22

203 180 60 58 123 85 110 67

256 227 72 70 162 113 137 73

One token of an idiom occurring in a transcript, not counting multiple tokens. bAll occurrences of idioms.

As further evidence that idioms are distributed across a wide range of academic speech events, even after a not quite exhaustive search, as described above, we found that 11 transcripts of a variety of speech events contained 10 or more idioms. The speech events included in these transcripts were as diverse as an organic chemistry study group, a meeting between a graduate student and his adviser, a class on ethics issues in journalism, and a large public lecture by a recent Nobel laureate in physics. In addition to the two primary, corpus-internal comparisons, another general question motivating the quantitative analysis was whether or not idioms would turn out to be more or less frequent in academic speech than in any other spoken genre. We speculated that idioms might prove to be rare in academic speech compared with corpora of general conversation, but no studies have attempted to quantify the number of idioms in any genre. Moon (1998) compared the frequencies of 24 idioms in the entire Bank of English (a continually growing corpus, currently containing over 400 million words of spoken and written English; see Collins Cobuild, n.d.) with their frequencies in the 20million-word subcorpus of conversation and found only ve of the items to be more common in conversation. She sums up this nding by speculating that people may be impressionistically overreporting high
426 TESOL QUARTERLY

incidences of idioms in conversation (pp. 7273). In light of these ndings, we would surmise that idioms are neither rare nor particularly frequent in academic speech or any of its subgenres. The overall frequency of idiom tokens in MICASE, using our rather narrow criteria, is about 330 total tokens per million words, or 260 per million not counting repetitions within the same speech event. Thus taken as a whole, these items constitute a not-insignicant feature of the lexical landscape of academic speech. At the same time, quantitative comparisons of idioms are difcult to assess because of the relative infrequency of any given idiom combined with the difculty of ensuring that other researchers have followed similar criteria in deciding what counts as an idiom.

Pragmatic Functions
As mentioned above, McCarthy (1998) strongly supports a discourseoriented approach to the study of idioms in spoken language, which he points out has not been thoroughly researched for the occurrence and functions of idioms. Accordingly, we discuss the most salient discourse functions associated with certain idioms in MICASE that best embody these functions. Of the functions identied here, several (paraphrase, emphasis, and metalanguage) are particularly relevant in the context of academic discourse. However, by identifying one particular discourse function for an idiom, we do not mean to overlook the fact that a single idiomatic expression often performs more than one function. Moon (1998) refers to this phenomenon as cross-functioning, or the use of an expression in functions other than their primary or most obvious one in the discourse. Many of the idioms we discuss below exhibit such crossfunctioning.

Evaluation
Like McCarthy (1998), who illustrates evaluative uses of idioms in conversational interactions, we found idioms used for these purposes in academic discourse. Two examples from MICASE that illustrate this function are out of whack and threw her for a loop. Example 1, while providing an evaluation, is also an instance of what McCarthy refers to as the observation plus comment function (p. 133), in which evaluative comments are preceded by a factual observation. Example 2 illustrates the observation made by Strssler (1982) and supported by McCarthy that when speakers use an idiom for evaluation, it is much more likely to occur in a third-person context, especially if it represents a threat to face.
IDIOMS IN ACADEMIC SPEECH 427

1. the paintings didnt seem to use the conventional skills of perspective the perspective is all out of whack here. um and that the paint itself is kind of slopped onto the canvas with these heavy dark art- outlines, without the. . . . (Visual Sources lecture)1 2. they would just be like, what are you talking about? you know just, suck it up deal with it hang with it you know and, she_ it really threw her for a loop and she actually um, she came_ she, was in the math department she then chose to transfer to the statistics department (Women in Science conference panel)

We nd, however, that under certain circumstances, face-threatening idioms can be used in a second-person context. Example 3 is taken from a study-group interaction between two students. The two speakers are peers and obviously friends, as the interaction is tempered with both sarcasm and humor, two important factors that mitigate the face threat implicit in the idiom.
3. S2: S3: S2: S3: S2: what were you saying? I was mumbling something to myself, I guess. oh Im sorry that I interrupted your conversation with yourself. I thought I was talking to you but I guess not. <Laugh> you were just mean to me again. <S3 Laugh> oh youve got this evil side that rears its ugly head. . . . (math study group)

Description
We also found idioms used for description, a function that often overlaps with that of evaluation, representing an instance of crossfunctioning, discussed above. Evaluative uses are often also descriptive (as in Example 1), but description does not always entail evaluation. Examples of idioms used in this manner in MICASE are hand in hand, run of the mill, and out of whack.
4. the Roman empire was not acquired for, economic reasons. this is very different, from European empires, uh in the eighteenth and nineteenth centuries. uh where both national pride and, uh notions of economic expansion go hand in hand. (lecture on sports in ancient Rome)

Transcription conventions are as follows: <> contextual comment [S1] speaker identification [SU] unidentified speaker . words omitted (xx) unclear speech word_ truncated speech 428 TESOL QUARTERLY

5. the reason why we like really good, premium ice creams, is because they have high fat content, thats the main thing that separates a premium ice cream from your run-of-the-mill ice cream. (Introduction to Psychology lecture) 6. if your thyroid is severely out of whack, it has all the exact same symptoms of depression. (Introduction to Psychopathology lecture)

The primary function of each of these expressions is descriptive, and in Examples 4 and 5 the idiom is used to highlight a contrast. Example 6 is an instance of a speaker using an idiom for economy of expressionthe unanalyzable chunk out of whack here substitutes for not functioning properly or another more detailed, literal description.

Paraphrase
Another function closely related to description is that of providing a paraphrase or gloss of the discourse content. This function is particularly well suited to academic spoken discourse, given its heavy use of explication. Examples from MICASE illustrating this use are put up a stink, no mean feat, and a dime a dozen, which also illustrate McCarthys (1998) observation-plus-comment function. Such paraphrasing uses of idioms often have the effect of reducing the distance between the speaker and listener, through the juxtaposition of either a formal academic word or a longer literal phrase with a more colloquial expression (as in Example 7). Moon (1998) discusses a similar interactional function in her analysis of idioms and other xed expressions. The use of a casual, almost slang expression could indicate an attempt on the part of a speaker to reduce the formality and highly transactional nature of academic discourse.
7. so women knew that uh if they were labeled noncompliant if they were you know put up a stink about where they delivered their children, this might have a negative impact on health care for their whole family, . . . . (Medical Anthropology lecture) 8. at the beginning of airlifting women out it was not a very easy thing to do getting those women out of there to deliver them in hospitals was no mean feat. (Medical Anthropology lecture) 9. we conclude that uh, {[SU] we win.} these these uh, you know percentages of of favoring Bush are a dime a dozen they they they, they occur so often, that whats more important is is, a key decision of a reversal that Bush made. . . . (Ethics Issues in Journalism discussion/lecture)

IDIOMS IN ACADEMIC SPEECH

429

Emphasis
Speakers also use idioms to emphasize content or reinforce an explanation, two functions that complement the goals of academic discourse particularly well. In such cases, there is a tendency for the speaker to repeat the idiom, often with truncation or creative variations. One such instance is the idiom carrot and stick, used by a lecturer six times in the span of about 5 minutes to explain the concept of rewards and punishments in the context of international relations. Another such case is Example 10, where the kitchen sink is used three times with variations to refer to a statistical concept. Put the heat on, discussed later, serves a similar function, where multiple speakers use variants of the idiom to reiterate or emphasize the same point. In fact, idioms are often associated with a certain amount of hyperbole, which is closely related to emphasis. Examples 8 and 9 above, identied as paraphrasing uses, both add a degree of emphasis as well.
10. mostly in, doing data analysis were interested in posing particular questions thatre interesting to us, and were not interested in explaining the total variance in our outcome by throwing everything in but the kitchen sink. so, by looking only at R-square of how good is our model that would be the kind of what I would call the kitchen sink model, and in education data even the kitchen sink model is not gonna do you very well, the_ unless you have, unless your outcome is . . . . (Statistics in Social Science lecture)

Collaboration
Idioms can also be used to create collaborative discourse and establish a sense of solidarity within a group of speakers. Like emphasis, collaborative uses of idioms are often achieved through repetition. McCarthy (1998) found idioms performing the same function in his conversational data when participants express shared views and ideas. A striking example of this from MICASE is the way multiple speakers use put the heat on in discussing ethics in journalism. In fact, in this extended excerpt (Example 11), the discussion at its most animated point actually revolves around the idiom itself. The example also shows how certain idioms allow variations in form. The base form of the idiom, put the heat on, produces at least six creative or syntactic variants from different speakers. Thus we have under some heat, puts some heat on, put heat on, putting heat on themselves, heat put on them, and put more heat on them.
11. S9: just I, I uh tend to think why is it anybodys business, to see these photos? I mean, the guy is dead, (. . .) and if four drivers have died

430

TESOL QUARTERLY

in the last 9 months, Im sure NASCAR will take some steps to correct that and it, doesnt require the involvement of journalists. S1: are they more likely to take steps if the press puts some heat on them? SS: mhm yeah S2: the press has already put heat on em theyre putting heat on themselves. I dont think its necessary. S1: well theyre obviously not having enough heat put on them because it keeps happening. {S2: I think (xx)} I mean for_ at least its possible. yes? S10: showing pictures of a dead body is not gonna put the heat on em thats necessary to make the changes though, I mean its gonna be the reporting on the incident, you know its th- one photo of a dead_ I mean I i just think that its, denitely an invasion of privacy, that isnt necessary. (several minutes/turns later:) S1: yeah, theres the danger it will get published, number one number two its not clear to me I dont know how you feel but its not clear to me that, having someone look at this photos gonna make any difference as you said, whether its gonna put more heat on NASCAR I dont, it doesnt seem clear. (Ethics Issues in Journalism discussion/lecture)

Metalanguage
In academic discourse, the use of idioms in the metalanguage is particularly interesting. In her analysis of idioms, Fernando (1996), among others, observes that the academic register is a highly complex one and requires distinctive organization and signposting devices which function as creators of coherence and intelligibility (pp. 232233). Moon (1998) mentions a similar organizational function for xed expressions and idioms, in which they organize texts by signaling logical connections between, for example, propositions and summaries. Some of the idioms in our data that clearly function as such signals or signposting devices are go off on a tangent, on that note, cut to the chase, and train of thought. In Example 12 the speaker uses the idiom to signal that the commentary is about to move in a different though not entirely irrelevant direction. In Example 13 the idiom signals a change of focus, with the added implication that the discussion is drawing to a close. In Example 14 the idiom functions as an organizational device to create coherence in the discourse, and in Example 15 the idiom is a ller commenting on the apparent lack of coherence. In each of these examples, the meaning of the idiom is much more closely tied to the function than in the previous examples.

IDIOMS IN ACADEMIC SPEECH

431

12. and the Basques by the way have been some in the history of Spain, if I may just uh, go off on a tangent the Basques have been, in Spanish history and Spanish religious history some of the most fervent Catholics. (Historical Linguistics lecture) 13. well on that note, have another cup of coffee on your way out and lets thank John and Ivette for a really nice (xx) (Ecological Agriculture colloquium) 14. what to do? what to do? lemme do it this way. you will never stand for this derivation I know it. <Laugh> not today. the weathers too nice, so Im gonna cut to the chase and I might come back and ll in some details on Monday, (Chemical Engineering lecture) 15. um, uh... the issue, comes down to, lost my thou- I almost lost my train of thought here. um, how . . . . (graduate education advising)

PEDAGOGICAL APPLICATIONS
The discovery of a signicant number of idioms in a corpus of academic speech and, more importantly, the evidence that they perform a variety of important pragmatic functions provides the rationale for including them in an EAP curriculum. We therefore outline our general approach to the teaching of idioms and offer some suggestions for incorporating corpus data on idioms into classroom materials by introducing some corpus-based pedagogical materials that we developed for our own teaching. We then discuss some of the challenges the learner faces in attempting to acquire English idioms. Like Wray (2000), in reference to formulaic sequences more generally, we advocate striking a balance between a holistic approach that focuses on learning idioms as chunks, that is, paying attention only to their composite meaning, and an analytical approach that teaches the meaning of an idiom by explaining the meaning of its constituent parts. In L2 teaching, some idioms lend themselves better to the latter approach even though most native speakers store them as holistic chunks. For example, students are much more likely to understand and remember the idiom a drop in the bucket if they know what drop and bucket mean, as the metaphor is rather transparent even if speakers do not commonly associate the literal meaning with its composite, metaphorical meaning. Similarly transparent idioms include, for example, on the fringe, shift gears, and making out like bandits. In contrast to these are idioms like run of the mill, cut to the chase, one fell swoop, or on that note, none of which embodies transparent conceptual metaphors for which knowledge of the lexical components or origins of the expressions is likely to contribute to
432 TESOL QUARTERLY

learning and remembering their forms and meanings. The point here is that when the conceptual metaphor of a given idiom has a relatively high degree of transparency, it may well be advantageous, as Wray (2000) puts it, to [legitimize] the classroom learners inherent desire and ability to analyse (p. 484). When the holistic meaning of the idiom is too far removed from the domains of the conceptual metaphor, however, it makes much more sense to resist the desire to analyze and encourage students to learn the idiom purely as a chunk. The teachers task, then, is rst to convince students to learn someif not mostidioms as chunks and not attempt a constituent analysis, which becomes in many cases a futile exercise with little payoff in terms of learning outcomes. Second, as teachers, we must also recognize that some metaphorical idioms lend themselves to a constituent analysis and that we would do students a disservice to insist that they learn all idioms as unanalyzed chunks simply because this is the way native speakers seem to learn and process them. Another strategy we advocate is to incorporate the notion of discourse function into the teaching of idioms. As mentioned above, some idioms are in fact used in quite predictable ways, so that the pragmatic function is integral to the meaning. For instance, on that note and shift gears are typically used at a discourse boundary, signaling the end of one topic or episode and a transition to another. Similarly, the idiom lost my train of thought is almost always used in a situation of disuency. Other idioms, as previously noted, may be used for a variety of pragmatic functions, but even though the correspondence between function and meaning is not as strong, teaching idioms from a function-based perspective still offers signicant advantages. Students should be made aware of the discourse functions (e.g., evaluation, emphasis, paraphrase) idioms are associated with and likewise should be taught to identify them.

Corpus-Based Teaching Materials


According to McCarthy (1998), idioms are highly interactive, engaging both the speaker and the listener, and are therefore best studied in context, yet they tend to be taken out of their contexts and taught as disembodied items. Moreover, if a context is provided, it tends to be an imagined and contrived one. Using real speech samples from contexts that learners will be exposed to has distinct advantages over using conventional methods for teaching idioms. Also, although the highly interactive nature of idiomatic expressions calls for a discourse-oriented pedagogical approach, for practical reasons such an approach may not be possible in every classroom. To compensate for the difculty of constructing the appropriate interactional climate for the teaching of
IDIOMS IN ACADEMIC SPEECH 433

idioms, McCarthy proposes the raising of students awareness of idiom usage as a rst step. At several workshops on idioms we have conducted at our universitys English Language Institute, students responded positively to an approach that began with consciousness raising and moved on to the introduction of idioms in authentic discourse contexts. First, we briey discussed the nature of an idiom and had students identify opaque idiomatic expressions in excerpts from transcripts of spoken discourse. Next, we gave them the contexts of selected idioms from MICASE and helped them identify the lexical and semantic cues in a speakers utterance that would be helpful in understanding the meaning of the idiom. In this exercise, we found that contextual extracts rich in discourse markers signaling, for example, similarity, contrast, and sequence lend themselves particularly well to teaching. Similarly, a high frequency of content words provides learners with additional contextual clues that help in processing the meaning of an idiom. The ideal context for pedagogical purposes, however, is one in which the idiom is preceded or followed by a paraphrase or gloss, such as in Examples 79. In terms of pedagogical materials, some students responded well to a multiple-choice exercise designed to summarize the meaning of an unfamiliar idiom (see Exercise 1 in the Appendix). However, excerpts from the corpus that provided a context for each idiom proved to be the most popular (Exercise 2 in the Appendix). In selecting such excerpts for teaching, teachers must take care to choose excerpts that are rich in contextual clues, as described above. We also used audio recordings of selected excerpts from the corpus to facilitate listening comprehension and to attune students to how idiomatic expressions sound in native speaker interactions. The excerpt in Example 11 illustrating the use of put the heat on was particularly effective as a listening exercise because the same idiom recurs frequently in an extended stretch of discourse. Based on our experience, we would not recommend the use of single, unedited concordance lines of spoken text as teaching materials. Such data were not as compelling and seemed to pose comprehension difculties for the students. These difculties can be avoided if the examples are cleaned up slightly and presented as individual excerpts with more than a single line of context. Additional teaching suggestions arising from this study include comparing and discussing perceived differences in the communicative effects of using an idiom versus an alternative literal expression (with the caveat that such differences can sometimes be difcult to ascertain). For example, possible communicative effects as referred to in the section on pragmatic functions include exaggeration, informality, and rhetorical air. Again, the examples illustrating paraphrase, in which the literal expression occurs with the idiomatic expression, are particularly useful
434 TESOL QUARTERLY

in this regard. Another valuable exercise would be one that showed the same idiom used for different discursive functions, to reinforce the idea that idioms are not necessarily bound to a single function. Finally, some general observations about the frequency of idioms may be worthwhile, such as the fact that they are not limited to strictly informal contexts, as illustrated by the Nobel laureate physics lecture mentioned above. Another useful resource that we culled from this research was a list of 20 idioms (see Table 3) that lend themselves particularly well to an academic context, based primarily on their semantic content but also partly on frequency. Although some of these idioms have a low frequency in our corpus, we believe they provide a useful starting point in compiling a list of idioms for a future EAP curriculum. Finally, we present here a list of the most frequent idioms in MICASE: 32 idiom types that occurred four or more times in the corpus (Table 4). Whereas this list is not an authoritative or denitive list of the most common idioms in academic speech, given the stated limitations of a 1.7-millionword corpus, it is nevertheless a worthwhile resource. Some of these high-frequency idioms were used by only one speaker or in one speech event, or appear in contexts that do not lend themselves particularly well to teaching, but many are prime candidates for inclusion in an EAP or other advanced ESL curriculum.

Difculties for Learners


One nal point to keep in mind is that idioms in authentic discourse do not always occur in their canonical forms. Such variations could pose difculties in comprehension when learners encounter idioms as actually

TABLE 3 Particularly Useful Idioms for English for Academic Purposes Curricula Total tokens in MICASE 17 7 7 1 16 7 1 4 1 3 Total tokens in MICASE 8 2 3 3 1 4 3 1 4 1

Idiom bottom line the big picture carrot and stick chicken-and-egg question come into play draw a line between get a grasp of get a handle on get to the bottom of things go off on a tangent

Idiom hand in hand hand-waving in a nutshell ivory tower litmus test on the same page play devils advocate shift gears split hairs thinking on my feet

IDIOMS IN ACADEMIC SPEECH

435

TABLE 4 Idioms Occurring Four or More Times in MICASE Total tokens 17 16 14 12 11 10 10 9 8 8 7 7 7 7 7 6 Total tokens 6 6 6 5 5 4 4 4 4 4 4 4 4 4 4 4

Idiom 1. bottom line 2. the big picture 3. come into play 4. what the hell 5. down the line 6. what the heck 7. flip a coin; flip side of a/the same coin 8. on (the right) track 9. knee-jerk 10. hand in hand 11. right (straight) off the bat 12. carrot(s) and stick(s) 13. draw a/the line (between) 14. on target 15. thumbs up 16. fall in love

Idiom 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

out the door rule(s) of thumb take (something) at face value beat to death put the heat on a ballpark idea/guess come out of the closet full-fledged get a handle on goes to show nitty-gritty on the same page ring a bell split hairs take (make) a stab at it take my/someones word for it

used by uent speakers. A number of idioms are subject to truncation. Examples from MICASE include havent the foggiest, where the word idea was left out; rearing its head, where the word ugly was left out; and simply carrot (used to refer to carrot and stick). Speakers subject idioms to what Barlow (2000) refers to as creative blending. We found ip side of the same coin, with the insertion of the word same (presumably for emphasis), and an even more creative derivational extension, coin ipping. We also found arm twisting, from twist someones arm and one nit to pick, from nitpicky. The kitchen sink model and under some heat, referred to above, are two more examples of this process. McCarthy (1998), commenting on this phenomenon, observes that speakers use idioms creatively by a process of unpacking them into their literal elements and exploiting these (p. 137). We also found performance variations, a term we use to describe idiomatic expressions that are not instances of conscious creative manipulation or exploitation as described above but rather appear to be spoonerisms or the accidental substitution of a different verb, noun, or pronoun and, in at least one case, the use of the idiom for a meaning entirely different from its commonly understood one. Some examples are walking through a landmine instead of walking through a mineeld, side in your thorns instead of a thorn in your side, and pick up where we took off instead of pick up where we left off. The idiom used with a different meaning was take a stab at, used to mean criticize instead of its usual attempt
436 TESOL QUARTERLY

to do something. Although none of these variants is likely to cause comprehension difculties to a uent or native speaker, a learner who uses a reference guide to idioms or a textbook with only the pure forms of an idiom could indeed encounter some difculties with these forms. By using authentic examples and including naturally occurring variants, students can be alerted to the fact that despite the generally xed nature of idioms, creative and unpredictable uses occur. Another set of interesting but potentially problematic idioms are those that rely on and assume a specic cultural schema for interpretation. From the MICASE data, we identied cry wolf, in bed with, wearing the pants, the same ballpark, and revolving door as examples of such idioms. Each of these expressions is bound to a specic social or cultural context that may or may not be part of a learners schema. For example, wearing the pants alludes to the person in a household who has the ultimate authority, assumed at one time to be the man, who was traditionally the one who would wear pants; by extension, the idiom refers to whoever has the implied authority in a particular situation. To understand the meaning of the idiom in bed with rst necessitates associating the literal phrase with the concept of intimacy and then making the further analogy of two parties having a secret affair to two companies or organizations granting each other illicit favors or privileges. These idioms are among those that would best be taught through explanation and analysis rather than merely as unanalyzable chunks.

CONCLUSION
Our research has shown, for one, that idioms occur in academic speech and are not as rare a phenomenon as they might appear when taken as a whole. Secondly, the distribution of idioms in the subgenres of academic speech seems not to be predictable on the basis of categories of either level of interactiveness or academic division. Rather, we would conclude that the use of idioms seems to be a feature more of individual speakers idiolects than of any linguistic or content-related categories. Some speakers in our corpus used idioms quite frequently whereas others rarely did, regardless of their socio-interactional roles. In terms of a functional discourse grammar of academic speech, the role of idioms cannot be ignored, as they clearly fulll several important functions particularly relevant to the unique discourse features of this speech genre. Finally, we hope we have shown that a specialized corpus such as MICASE provides a rich resource for teaching materials that allow teachers not only to use authentic, attested examples of idioms in context but also to consider larger issues of discourse and sociopragmatics. This type of resource should go a long way toward relieving teachers of
IDIOMS IN ACADEMIC SPEECH 437

the need to create contrived contexts for idioms and teach them as disembodied items. Two important methodological advantages ensue when a corpus can be consulted for examples of idiom usage. First, the idioms can be presented in authentic contexts rather than in the contrived ones often found in textbooks or thought up by teachers. A second and closely related methodological benet of using a corpus is that idioms can then be taught from a discourse perspective rather than as isolated lexical items, with attention not only to their immediate context but also to their sociopragmatic and interactional features. Both points are important because another challenge of learning idioms is developing an awareness of when it is appropriate to use a particular idiom. A specialized corpus that directly reects the students universe of discourse is particularly valuable in this regard, as it provides not only attested examples of idioms in use but examples embedded in contexts that learners will nd familiar and relevant.
ACKNOWLEDGMENTS
An earlier version of this article was presented at the 2002 conference of the American Association for Applied Linguistics in Salt Lake City. We gratefully acknowledge the contributions of Angela Komsic, our coauthor on that paper. We also thank the two anonymous reviewers and the editor of this special issue for their insightful suggestions for revising this article, and our colleagues at the English Language Institute, who provided helpful comments at different stages of our research.

THE AUTHORS
Rita Simpson is a research associate at the English Language Institute of the University of Michigan. She is currently project director for MICASE and served as the founding project manager from 1997 until 2002. Her research interests include discourse analysis and pragmatics of academic speech as well as various topics in corpus linguistics. Dushyanthi Mendis is a senior lecturer at the Department of English at the University of Colombo. She is currently a doctoral student at the University of Michigans Department of Linguistics, researching the use of metaphor in academic speech.

REFERENCES
Barlow, M. (2000). Usage, blends, and grammar. In M. Barlow & S. Kemmer (Eds.), Usage-based models of language (pp. 315345). Stanford, CA: CSLI. Biber, D., & Conrad, S. (2001). Quantitative corpus-based research: Much more than bean counting. TESOL Quarterly, 35, 331336. Collins Cobuild. (n.d.). Corpus concordance sampler. Retrieved May 30, 2003, from http://titania.cobuild.collins.co.uk/form.html
438 TESOL QUARTERLY

Fernando, C. (1996). Idioms and idiomaticity. Oxford: Oxford University Press. Makkai, A. (1972). Idiom structure in English. The Hague: Mouton. Madden, C., & Rohlck, T. (1997). Discussion and interaction in the academic community. Ann Arbor: University of Michigan Press. Mauranen, A. (in press). Spoken corpus for an ordinary learner. In J. Sinclair (Ed.), Corpus linguistics and language learning. Amsterdam: Benjamins. McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge: Cambridge University Press. McCarthy, M., & ODell, F. (1997). Vocabulary in use: Upper intermediate. Cambridge: Cambridge University Press. Moon, R. (1998). Fixed expressions and idioms in English. Oxford: Clarendon Press. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike uency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191226). New York: Longman. Redman, S., & Shaw, E. (1999). Vocabulary in use: Intermediate. Cambridge: Cambridge University Press. Schmitt, N. (2000). Vocabulary in language teaching. Cambridge: Cambridge University Press. Scott, M. (1996). WordSmith Tools (Version 3.0) [Computer software]. Oxford: Oxford University Press. (Available from http://www.liv.ac.uk/~ms2928/) Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. Ann Arbor: The Regents of the University of Michigan. Retrieved June 4, 2003, from http://www.hti.umich.edu/m/micase/ Strssler, J. (1982). Idioms in English: A pragmatic analysis. Tbingen, Germany: Gunter Narr. Weinreich, U. (1969). Problems in the analysis of idioms. In J. Puhvel (Ed.), The substance and structure of language (pp. 2381). Berkeley: University of California Press. Wray, A. (1999). Formulaic language in learners and native speakers. Language Teaching, 32, 213231. Wray, A. (2000). Formulaic sequences in second language teaching: Principle and practice. Applied Linguistics, 21, 463489. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Wray, A., & Perkins, M. R. (2000). The functions of formulaic language: an integrated model. Language and Communication, 20, 128.

APPENDIX Corpus-Based Exercises


Exercise 1: Explaining the Meaning of Idioms
Circle the answer that best explains the meaning of the idiom. 1. take a stab at a) try to do b) criticize c) fail at d) betray

IDIOMS IN ACADEMIC SPEECH

439

2. on target a) xed as an absolute b) completely accurate c) not moving d) busy at work 3. shift gears a) wait for a few minutes b) end the discussion c) move to a different topic d) start an argument 4. tune someone out a) ignore them b) go somewhere with them c) praise them d) misunderstand them 5. keep tabs on a) agree with something b) continue at the same pace c) observe or record carefully d) keep a secret 6. odds and ends a) the final events b) strange events c) harsh words d) various small items 7. garden variety a) new and exciting b) common type c) forbidden or illegal d) colorful 8. take the plunge a) commit to something important b) make a large profit c) set unrealistic goals d) appear out of nowhere 9. off the wall a) a waste of time b) dangerous or risky c) odd or unexpected d) very important 10. take someone to task a) scold or criticize them harshly b) talk to them privately c) buy them lunch d) encourage them to succeed

Exercise 2: Idioms in Context


Now read the following excerpts from MICASE, and see if you can gure out the meaning of the idiom from its context. 1. there were some denitions in the reading anyone, wanna take a stab at it? mhm? 2. estimator three has the lowest overall variability, even though it might not be on target. 3. anything else about The American that we wanna touch on? well let me shift gears for a few minutes then to prep you for what well do next week.

440

TESOL QUARTERLY

4. my boss talks to himself all the time and Ive learned to tune him out when hes doing that so I dont even hear what he says. 5. how about the roles of the other people in the group? cuz youve been kind of the minder in a way keeping tabs on things and keeping things going. 6. Ive got a few pieces of business and odds and ends I have to deal with and I want to say a little bit about how things are going with the projects so Id like to reconvene if we could at half past nine. 7. next slide shows we do, in addition, we do have your sort of garden variety of clear cutting, so there is some extensive forest cutting going on, although the present time, thats sort of stagnated because of . . . . 8. for the time of the post-doc, you know, others will expect you to function something like an assistant professor. so again it just depends on the institution but its a wonderful opportunity for you to move to that next step before you take the plunge of getting on a tenure-track line. 9. wow, thats an interesting question um, you know, I mean the rst thing that comes to my mind is this may be a little off the wall but the rst thing that comes to mind is this has done a lot for me as a musician but I havent had the time to actually be a musician, so Im looking forward to how this will transpire in my practice and in my playing. this, really doesnt have much to do with, the academic part but the musician was sort of on hold for a year . . . 10. the poem basically carries that, you know that old-fashioned, reactionary idea that the primary purpose for women in life is to bear children and thats what this poem is all about. Diane Di Prima, another San Francisco poet, took him to task on the subject, and youll nd her poem on page three sixty-one. (. . .) so its good to know that she was she was ready to take him to task, on that subject. thatd be kind of an interesting thing for you to study if youre talking about gender, relations and gender ideas in the Beat Generation would be to contrast, that poem by Gary Snyder with that poem, uh, by Diane Di Prima. 11. that fty-fty means in terms of the money, spent. half the money is spent on treatment, half the money is spent on law enforcement. maybe the half thats spent on treatment goes a lot farther. you get more bang for your buck in that fty than you get in the other fty.

IDIOMS IN ACADEMIC SPEECH

441

A Corpus Analysis of Would-Clauses Without Adjacent If-Clauses


STEFAN FRAZIER
University of California Los Angeles, California, United States

This article reports the ndings of a corpus analysis of a grammatical structure taught in intermediate- or advanced-level ESL/EFL texts: clauses that contain the modal would to signify hypothetical and counterfactual meaning. Contrary to the way these structures are represented in ESL/EFL textbooksas would-clauses adjacent to conditional clauses with ifthese corpus data indicate that would-clauses in counterfactual/hypothetical environments occur more often quite distant from or entirely without any corresponding if-clauses. Often the hypothetical and counterfactual conditions are present but are marked in ways other than by prototypical if-clauses. The study categorizes the conditional and hypothetical uses of would-clauses in spoken and written corpora, and it offers pedagogical suggestions based on the ndings.

uring the week of May 6, 2002, amid the child-abuse scandal that rocked the Catholic Church, the U.S. weekly magazine Newsweek ran a cover story entitled What Would Jesus Do? (Meacham, 2002). It is unlikely that anyone who read this headline asked, What would Jesus do if what? Framed as a question about a conditional result and marked by a hypothetical would-clause, the condition itself was immediately accessible to the magazines readership through the dominance of the story in all media at the time. The ndings of the present study demonstrate that counterfactual and hypothetical conditions are in fact very often, if not usually, contextually implied rather than explicitly stated in an environment adjacent to their consequent would-clauses. This usage of hypothetical/counterfactual conditionals is rarely covered in ESL/EFL grammar textbooks or in standard reference grammars. The purpose of this article, therefore, is to demonstrate empirically the quantitative prevalence of such would-clauses, provide a qualitative analysis of the tokens found in the corpora used for this project, and suggest pedagogical implications.

TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

443

MULTIPLE FUNCTIONS OF WOULD


The English modal would has several functions, often causing confusion for L2 learners as to its appropriate usage. One occurrence of would is in hypothetical conditional sentences, referring either to a present imaginary state or event, as in the following sentences:
1. It would be enough if committees were to meet. (Santa Barbara Corpus of American Spoken English [SBC] 0010)1 2. If, say, the Russians intended to stop Tom Jones going to the pub, then Tom Jones would ght the Commies. (Brown corpus C )

Examples 1 and 2 are hypothetical conditionals in that they describe imaginary situations of relatively remote possibility. In Example 1 the speaker is in a business meeting with several other participants discussing the possible need for certain committees to meet and when their meeting times might be. The speakers use of the hypothetical conditional over, say, a more real conditional (e.g., Itll be enough if committees meet) reects a degree of tentativeness toward the possibility. In Example 2 the writer is using an imaginary person to support a point about a political issue. The modal would is also used in referring to a past hypothetical or counterfactual state or event, in combination with the perfect aspect have + en:
3. Like if I was, George Bushs, son like the rst one? I probably wouldntve even gone to school, you know I wouldve had all the money, I wouldntve wanted, wanted to be a politician. (Michigan Corpus of Academic Spoken English [MICASE], Discussion [DIS] 115) 4. Yet had he not visited the girl at Saw Buck he would never had been involved in this latest tangle. (Brown N)

Note that in Example 4 the if-construction is replaced by subjectauxiliary inversion. Examples 3 and 4 are different from Examples 1 and 2 in that, apart from their different time frame, they refer to clear impossibilities: The speaker in Example 3 cannot become George Bushs son, and the person named he in Example 4 did not, in fact, visit the girl at Saw Buck and cannot undo that past inaction. They are both thus counterfactual. Examples 14 all exhibit both overt condition and result clauses, following the general formula
{if + [condition], or S-AUX inversion}, (then) would + [result].
1

See the Method section for a description of these corpora.

444

TESOL QUARTERLY

In the rst example, the conditional and result clauses are reversed, though this order is not considered typical (Ford & Thompson, 1986; Hwang, 1979). Quite often, however, the results of conditions appear in one of the following circumstances: (a) with conditional if-clauses that are distant from their corresponding would-clauses, (b) in the absence of any condition clause whatsoever, or (c) with clauses in which the condition is marked by a structure other than an if-clause, as in these instances:
5. Lynne: Could you imagine all those horses out there, if we had em all shod? thats Lenore: How many have you got. Lynne: Thirty bucks, for twenty horsesLenore: Twenty. Lynne: Twenty-eight horses, you know? I mean, Jesus, that would be a th- nine hundred dollars, (SBC 0001)

6. But you could substitute anything in the Krebs cycle for pyruvate, and it would work. (DIS 175J) 7. When someone says, for example, They took X rays to see that there was nothing wrong with me, it pays to consider how this statement would normally be made. (Brown F)

In Example 5, would in Lynnes last utterance is separated from its condition by several clauses and some amount of interactional work. In Example 6, the hypothetical condition is implied by the clause preceding the would-clause (and could be expressed as and if we did substitute something for pyruvate . . .). In Example 7 an if-clause could be tacked on at the end but would be superuous: . . . if the statement were made.

WOULD IN REFERENCE GRAMMARS AND ESL/EFL TEXTBOOKS


Reference texts generally discuss conditional forms at only the sentence level, but they also provide important clues to the use of these forms at higher levels of discourse. These clues are essential to understanding would-clauses without overt conditional if-clauses. Quirk, Svartvik, Leech, and Greenbaum (1985) provide a complex and comprehensive accounting of all modal and conditional forms, noting specically that would hypotheticals appear in many contexts without if (p. 234). CelceMurcia and Larsen-Freeman (1999, chapter 27) present most of their conditional examples in single sentences, but the volume in general recommends the analysis of all grammatical structures at the discourse
WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES 445

level, and coverage on conditionals includes discussion of the Bull Framework2 (p. 553) as well as short segments of authentic discourse (mostly from other sources, pp. 557560). Finally, although Biber, Johansson, Leech, Conrad, and Finegan (1999) omit explicit mention of would-clauses in the absence of if-clauses, their analyses are all based on authentic data from corpus research and are therefore informed by larger units of discourse than the sentence. The literature in functional linguistics also provides some clues to the pragmatic uses of conditionals in general (not just hypothetical/ counterfactual conditionals), allowing a more contextual overview of the presuppositions that conditional structures carry. One important text is On Conditionals (edited by Traugott, ter Meulen, Reilly, & Ferguson, 1986). In this book, Fillenbaum (1986) and Van der Auwera (1986) discuss conditional sentences of inducements and deterrents in which the if-clause may drop the if and replace it with and or or before the result clause; for example, If you open the window, Ill kill you becomes Open the window and Ill kill you (Van der Auwera, 1986, p. 206). Haiman (1986) notes further that this kind of substitution renders the conditional clause paratactic (i.e., places it in connection with its result clause with no connecting words), raising it from subordinate to coordinate status, and warns that these structures should not be dismissed as marginal idiosyncracies of English (p. 218). This warning is certainly valid, and although Haiman (1986) does not discuss hypotheticals or counterfactuals specically, the notion that conditionals without ifclauses carry different meanings from their if-clause equivalents is well worth further study. A brief survey of coverage of hypothetical and counterfactual conditionals in several popular ESL textbooks (Carter, Hughes, & McCarthy, 2000; Celce-Murcia & Hilles, 1988; Danielson, Porter, & Hayden, 1990; Frodesen & Eyring, 2000; Raimes, 1998a, 1998b; Riggenbach & Samuda, 1997; Thewlis, 1997) shows a marked neglect of discussion on hypothetical or counterfactual would-clauses that do not occur in the same sentence as their supposed conditional if-clauses, and on the use of these clauses within larger units of discourse. Notable exceptions that use authentic discourse are the texts by Carter et al. and Frodesen and Eyring. However, almost all discussions of presentation, analysis, and practice in these texts offer invented sentences in the following forms:
2 The Bull Framework provides a tabular method for determining the tense and aspect choices of verbs in different time frames. In an example taken from Celce-Murcia and LarsenFreeman (1999), the present and future hypothetical or present counterfactual is marked by the simple past in the if-clause and a would structure in the result clause (If you mowed my lawn, I would give you $5). The past counterfactual takes the past perfect in the if-clause and a perfect-aspect would in the result clause (If you had mowed my lawn, I would have paid you $5 [p. 553]).

446

TESOL QUARTERLY

(even) if + [condition], (then) would + [result] would + [result] (even) if + [condition] had + [condition], would + [result] (with subject-verb inversion) condition clauses with unless condition clauses with as if, as though, in case, or provided that result clauses with otherwise wish sentences Of these, only wish sentences do not contain an overt conditional, and these are generally covered in sections of their own. Four of the texts reviewed go some distance toward explaining, presenting, or encouraging exposure to alternative instances of wouldclauses. Riggenbach and Samuda (1997) include one (invented) conversation (pp. 375376) and one practice exercise (p. 378) in which learners are presented with and encouraged to offer a series of wouldclauses after a single initial if-clause (If we were on a desert island . . .), demonstrating at least that not all would-clauses need to be immediately adjacent to their corresponding if-clauses. (The unattested notion that an if-clause is necessary somewhere nearby is still reinforced, however.) Celce-Murcia and Hilles (1988) propose controlled exercises in which the instructor tells a personal past hypothetical story (such as If I had married my high school sweetheart . . ., p. 56), a genre that, if told naturally, is quite likely to display several instances of would have without a nearby if, thereby offering learners a relatively authentic example on which to model their own stories. The authors also suggest playing songs in the classroom that exhibit authentic models of conditional forms. Danielson et al. (1990) include a section entitled Conditional Sentences With Implied If-Clauses (p. 144), explaining at the paragraph level of written discourse the framing function of a single if-clause that dictates the presence of several following would-clauses. Finally, Frodesen and Eyring (2000) present one explanation box on the notion of the implied unreal condition (p. 237), though this box appears in a unit on modal perfect verbs rather than in the unit on conditionals and pertains only to past counterfactuals. Several of the cited textbooks and analyses include comprehensive treatments of one technically hypothetical use of would, namely, the polite formulaic expressions (e.g., Would you like . . . ?, Id ask you to . . .) important to the contextual analysis of conversation. This treatment is crucial to functional approaches to language pedagogy and indispensable to the language students pragmatic competence. However, as the structures in question are near-frozen idiomatic expressions rather than those characteristic of novel syntactic structures, they are excluded from the analysis below.
WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES 447

Among the grammar accounts cited here is one standout: Carter et al.s (2000) work, which is representative of a new generation of discourse-based approaches to grammar instruction. The authors make specic reference to the use of would in contexts of underlying conditional idea[s] (i.e., ideas in which an if-clause is not necessarily present) and contexts of volition (p. 42). In this discussion and elsewhere, the emphasis is on investigating grammatical structures in longer passages of authentic, corpus-drawn discourse and the benets learners enjoy from inducing patterns of usage within these larger passages. Notwithstanding this one exception, the traditional instructional paradigm dominates: Would-clauses are almost always presented with adjacent or nearby if-clauses. The data examined in this article suggest that this treatment of would-clauses in ESL/EFL texts needs to be improved (as Carter et al., 2000, have begun to do). In particular, based on a corpus of oral and written language, the study addresses the following questions: 1. How many hypothetical or counterfactual would-clauses occur with adjacent if-clauses and in other positions? 2. What are the functions of the would-clauses occurring without adjacent if-clauses, and what are the relative frequencies of the functions?

METHOD
The investigation drew on a combination of quantitative and qualitative analyses of three corpora.

Corpora
The three corpora used for this project were 1. the approximately 1,014,000-word Brown corpus (Francis & Kucera, 1964; see also Francis, 1979) 2. the current approximately 140,000-word sample of the Santa Barbara Corpus of Spoken American English (SBC; see Chafe, DuBois, & Thompson, 1991; Tao, 2001) 3. 25,420 words in three transcripts of the Michigan Corpus of Academic Spoken English (MICASE; Simpson, Briggs, Ovens, & Swales, 1999) made from the audio data of classes at a major U.S. university All three corpora consist of American English. The Brown corpus comprises 500 approximately 2,000-word pieces of written prose from a
448 TESOL QUARTERLY

variety of sources. Though somewhat dated in subject matter and writing style (the selections are from the early to mid-1960s), the Brown data exhibit usages of hypothetical and counterfactual conditionals similar to contemporary usages and are therefore appropriate for analysis here. The SBC and the discussion (DIS) sections of MICASE represent the spoken data for this analysis. The SBC data cover informal conversational English. The portions of MICASE were chosen to demonstrate the use of hypothetical conditionals in demonstrations (Clark & Gerrig, 1990; see the Qualitative Analysis section.) The three classes are discussion sections led by teaching assistants (TAs) in introductory anthropology (DIS 115), introductory biology (DIS 175J), and economics (DIS 280).

Procedures
Because of the relatively xed use of the modal would in clauses expressing the result of hypothetical and counterfactual conditions, I was able to use concordancing software (MonoConc Pro, Barlow, 2000) to search for the term would and its contracted form d (with editing to weed out contractions of had ). The search returned all occurrences of would with its surrounding context. The modal would occurs, however, in situations other than the focus here, so I eliminated tokens that conveyed historical past tense of will (especially in reported speech):
8. She shrugged her shoulders and said that Reuveni wanted her to marry him. I asked her if she would, and she said she would not. (Brown P)

directly quoted speech, inside quotation marks in the written data:


9. This is a poor boys bill, said Chapman. Dallas and Fort Worth can vote bonds. This would help the little peanut districts. (Brown A)

habitual past:
10. Carolyn: But we got Missis Lindberg, who was like, the rst granola woman I ever met. Pam: Granola woman Carolyn: (sigh) She would say, Lets go on a eld trip. (SBC 004)

formulaic expressions (polite requests; see the discussion in the previous section):
11. Alina: Would you move, so I can come park my car. (SBC 006)

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

449

After identifying the occurrences, I determined their relationships to if-clauses and determined the function of would-clauses without adjacent if-clauses. The analysis involved all the tokens identied in the spoken corpora. However, complete analysis of the large number of tokens in the Brown corpus was not feasible; therefore, I selected one of each seven tokens from the concordance results for analysis. In all, 467 conditional and hypothetical would-clauses from the three corpora were included in the study (see Table 1).

FINDINGS Quantitative Analysis of Conditional Categories


For the quantitative analysis, I determined the conditional category of each would-clause (see Table 2): adjacent if-clauses, in which an accompanying overt if-clause directly adjoins the would-clause. This category includes alternatives to straight if-constructions, such as even if-, only if-, and unless-clauses, as well as examples with subject-auxiliary inversion. nonadjacent if-clauses, in which an accompanying overt if-clause is more than one clause distant from the would-clause no overt conditional alternative conditionals, in which the condition is marked in other ways than the use of an if-clause

TABLE 1 Token Count of Would-Clauses Included in the Study, by Corpus Tokens per 1,000 corpus words

Corpus Spoken SBC MICASE Total Written Brown Overall total

Tokens

154 80 234 233a 467

1.1 3.1 1.4 1.6b 1.6

Note. SBC = Santa Barbara Corpus of American Spoken English; MICASE = Michigan Corpus of Academic Spoken English. a Based on one seventh of the tokens in the corpus. bCalculated from all the tokens in the corpus.

450

TESOL QUARTERLY

TABLE 2 Would-Clauses by Conditional Category Adjacent if-clauses Corpus Spoken (N = 234) Written (N = 233) Total (N = 467) n % Nonadjacent if-clauses n % No overt conditional n % Alternative conditionals n % Indeterminate n %

45 54 99

19 23 21

16 3 19

6.8 1.3 4.1

54 16 72

23 6.9 15

115 160 275

49 69 59

4 4

1.7 0.8

Note. Percent columns show the percentage of the total tokens in the corpus. Percentages may not add to 100% because of rounding.

The spoken data included several would-clauses whose category could not be determined because the speaker cut an utterance short. The clearest implication of the analysis of would-clauses shown in Table 2 is that much more often than not, conditional and hypothetical clauses with the modal would are not accompanied by if-clauses anywhere near them, much less in the same sentence. Would-clauses with nonadjacent if-clauses, those with no overt conditionals, and those with alternative conditionals (Columns 3, 4, and 5 of Table 2) together account for nearly 80% of the hypothetical/counterfactual would-environments in the corpus. This nding runs counter to the ndings of Quirk et al. (1985) that the most typical context in which hypothetical would/should occurs (p. 234) is in the presence of if.

Qualitative Analysis: The Functions of Would-Clauses Without Adjacent If-Clauses


Analysis of the three types of would-clauses without adjacent if-clauses revealed the following functions: nonadjacent if-clauses: conditional frames no overt conditional tentativeness and varying degrees of commitment emphatic negativity alternative conditionals non-if conditions in hypothetical environments (including otherwise-clauses)
WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES 451

non-if conditions in counterfactual environments (including wish-clauses) displaced perspectives in demonstrations The following analysis covers these subcategories in turn, offering examples of each within their larger contexts. In many cases, the relevant conditionals are actually signaled very near the would-clause, but an analysis of their surrounding contexts provides a much richer accounting for the presence of these clauses than a sentence-level analysis. I do not discuss otherwise - and wish-clauses here as they are commonly covered in ESL/EFL textbooks and reference grammars.

Nonadjacent If-Clauses: Conditional Frames


Often a single if-condition acts as the trigger or frame for a series of results that pertain to that condition. As the series continues, the wouldclauses appear at greater and greater distances from their initial overt condition. Thus, in Example 12, the speaker picks a hypothetical situation to exemplify a point. An initial framing if-clause foregrounds the situation, and several results of the initial condition follow:
12. I mean it just depends what makes you happy. Like if I was, George Bushs, son like the rst one? I probably wouldntve even gone to school. You know I wouldve had all the money, I wouldntve wanted, wanted to be a politician. I wouldve been happy you know? get a job from my dad and (DIS 115)

In Example 13, a segment from a business meeting, the conversation has turned to the topic of how often to meet. Possibly because of the topics tentative nature, there is a good deal of interactional pressure between the conversing participants, with occasional conversational side sequences. Phils nal analysis in his last turn thus comes at a sizable remove from his initial conditional statement.
13. Brad: One thing that comes to mind is, is the, well the question of meeting quarterly, is that enough. Meaning having, you know, theIt would be enough if committees were to meet. If committees yeah. And thats theButThatthatsthat was Pauls, uh, desire. And I think was uhthats what we wanna talk ab- well talk about that there too. And say look Mhm.
TESOL QUARTERLY

Phil: Brad: Phil: Brad: Phil: Brad: Phil: Brad:


452

Phil: Brad:

You know, is quarterlyyou know, we- would we need to meet every other month. Mhm. (SBC 0010)3

The result clause is accessible to Brad (as evidenced by his acknowledgment token Mhm in the last line) because it remains within the conditional context framed by his own earlier if-clause. The framing function of if-clauses often occurs in written articles in which the overall subject matter is a hypothetical possibility. (The written use of this framing device is covered in Danielson et al., 1990, although as Examples 12 and 13 show, the device is not limited to written language.) For example, at the time of this writing, voters in the city of Los Angeles were debating the possible secession of three of its urban areas, including the San Fernando Valley. The Los Angeles Times Op-Ed section ran several articles (as it did often over many months) weighing in on the hypothetical possibility of the Valleys secession. Example 14 is one of the many sentences, even entire paragraphs, that included wouldclauses giving consequences of the hypothetical secession but needing no if-clause because of the (literal) framing of the issue on the page.
14. Valley cityhood would increase accountability, both in the Valley and in Los Angeles. Valley residents would have their own mayor and city council members who live in the Valley. L.A. residents would enjoy better representation because their council districts would be 40% smaller. (Katz, 2002, p. B13)

The condition in the rst two sentences in Example 14 is actually marked intrasententially via the lexical noun phrase Valley cityhood, but that phrase would need to be expanded to a full if-clause if there were no framing conditional. (If-clauses indeed occur earlier and later in the article.)

No Overt Conditional: Tentativeness and Varying Degrees of Commitment


Commentators often refer tentatively to contexts of hypothetical possibility because, without any certainty that hypothetical events will actually take place (or took place in the past), speakers and writers cannot be sure that the condition will have (or had) any consequences. In Example 15, the author is hazarding a guess on the mental state of a writer drafting a text:
3 Many of the transcription markers in the SBC (e.g., indicating overlap, breathing sounds, transcribers notes) have been omitted here for the sake of clarity of presentation. These markers are essential for other types of projects; for this corpus analysis of grammatical structures, however, they could be omitted with no detriment.

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

453

15. It is most probable that Freud and the Oedipus complex never entered his [D. H. Lawrences] head in the writing of this story. He was simply writing a story that wanted to be told, and in the writing a childhood fantasy of his own emerged. He would not have cared why it emerged, he only wanted to capture a memory to play with it again in his imagination and somehow to x and hold in the story the disturbing emotions that accompanied the fantasy. (Brown G)

The author of Example 15 does not use a would-clause in earlier sentences, appearing quite certain (perhaps as a literary critic indulging a certain expertise) that D. H. Lawrence was actually simply writing a story, that a childhood fantasy emerged, and that he only wanted to capture. . . . But the author of Example 15 is less certain as to exactly what Lawrence would or would not have cared about and makes a guess by using a would-clause. Such tentative suggestions are often marked with the idiomatic expression It would seem, as in the following example:
16. First of all there is ample area in East Greenwich already zoned in the classication similar to that which petitioner requested. This land is in various stages of development in several locations throughout the town. The demand for these lots can be met for some time to come. This would seem to indicate that we are trying neither to halt an inux of migrants nor are we setting up such standards for development that only the well-to-do could afford to buy land and build in the new sites. (Brown B)

The use of the innitive structure after the copular seem is notable in Example 16. Thus one might argue that this sentence carries triple hypotheticality: the use of would, the tentative copula seem, and an innitive, which according to Bolinger (1968) often carries a feature of hypothesis or potentiality (p. 124). Tentative suggestions occur in the spoken data as well. The MICASE examples selected for this project (discussion sections of university classes) offer contexts in which TAs lecture on certain topics. A TA is in some ways a professors lieutenant, meaning that the professor actually has the nal authority on questions; thus a TAs comments in a discussion section may sometimes take on a tentative tone:
17. TA: And I mean especially in a, in a really big society like ours in in a, state society, um, there are different, there are different, factors and different choices, that um, you know an advantage maybe there wouldnt be quite as many, um, but therere different, inuences, that can, come in. (DIS 115)

454

TESOL QUARTERLY

A key word in Example 17 is maybe, one more marker (besides the would structure) of a tentative stance. Examples 1517 index the speakers or writers degree of commitment toward his or her statement. Celce-Murcia and Larsen-Freeman (1999), in their analysis of reported speech, describe mental commitment (p. 691) as a factor in determining whether or not a speaker chooses a backshifted tense. I suggest that varying degrees of commitment can be indexed by many other grammatical structures; in the examples above, a would-clause outside the context of an if-clause can fulll this purpose.

No Overt Conditional: Emphatic Negative Statements


In a less prominent, but still remarkable, usage of would in non-overt conditionals, speakers and writers show their denial of or disbelief toward a hypothetical or counterfactual possibility. In my corpus data, all the would structures in these situations were connected with the use of the adverbial never. In Example 18, Rebecca, a lawyer, is speaking with her client about a case they have brought against a man for lewd behavior on a subway:
18. Rebecca: Rickie: Rebecca: Rickie: Rebecca: Rickie: Rebecca: And, especially, I have some young single men, on my jury panel. Mhm, And, I, my worry is that they dont relate to what a woman feels whe[n so]mething like that is happening [Mhm]. Because their experience would be totally different if a man exposes [himself] [(SNIFF)] Which, a man would never do that. (SBC 0008)

Rebecca is noting her emphatic attitude toward a hypothetical situation related to the actual case at hand. This emphatic negative usage also appears in writing, though much more rarely. The three written tokens in the current data appear only in ction or popular lore in environments where the writer is appropriating the voice of a character in the text (i.e., taking on the perspective of one of the characters). Example 19 is an excerpt from a story by Ralph J. Salisbury (On the Old Santa Fe Trail to Siberia):
19. When Johnson ejaculated Hows about my buying us all a nice cold Cocola, Maam? Mrs. Roebuck smilingly declined and began suddenly to go on about her son, who was onleh a little younguh than you bawhs. Johnson never would have believed she had a son that age. (Brown N)

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

455

Using the verb of consciousness believe, Salisbury is momentarily placing himself inside the mind of the character Johnson and reporting Johnsons attitude toward a particular situation.

Alternative Conditionals: Non-If Conditions in Hypothetical Environments


This category is the largest in the present data. In almost all of the cases, the would-clause indicating the consequences of a certain condition has an implied, covert conditional (like the Newsweek cover story mentioned in this articles introduction). The conditionals are marked grammatically in four primary ways: (a) with indenite reference; (b) via innitives, gerunds, or adverbials; (c) via anaphora referring to a previous implied condition; or (d) with the generic you. Examples of each of these subcategories follow. Indenite reference. These subtypes are much more common in the written than in the spoken data. Additionally, although the indenite reference is generally made in the would-clause, the availability of a larger surrounding context paints a more accurate picture of the condition. In Example 20, the meaning of the would-clause would be uncertain if the referent to these were not provided, and its presence has bearing on the hypothetical nature of the would-clause:
20. Richards view of the aesthetic experience might constitute a sixth variety: for him it constitutes, in part, the organization of impulses. A sketch of the emotional value of the study of literature would have to take account of all of these. (Brown G)

The indenite reference to a sketch, marked with the indenite article a, refers not to an actual sketch but (a gurative) one that could be drawnthat is, a hypothetical one. And the reference to these (i.e., impulses) is necessary in juxtaposing the real element in the sentenceemotional valuewith the hypotheticala sketch. Innitives, gerunds, and adverbials. Like indenite references, these subtypes are much more common in the written data (but see Example 25 below). They often occur, as do indenite references, with the condition and consequential elements marked adjacently; however, as above, looking at surrounding context leads to more insight.
21. If there is nothing evil in these things, if they get their moral complexion only from our feeling about them, why shouldnt they be greeted with a cheer? To greet them with repulsion would turn what before was neutral into something bad; it would needlessly bring badness into the world;

456

TESOL QUARTERLY

and even on subjectivist assumptions that does not seem very bright. On the other hand, to greet them with delight would convert what before was neutral into something good; it would bring goodness into the world. (Brown J)

What is interesting here is the framing of the discourse within a real (as opposed to an unreal) condition in the rst sentence followed by the possible consequences of this framing in the hypothetical (because there are two conicting and unresolved possibilities). The direct hypothetical conditions leading to the would-clauses are implied by the innitive structure within their own sentences (to greet), once again invoking Bolingers (1968) principle, which attributes a sense of hypotheticality to innitives. The same effect may be achieved by gerund constructions, however, somewhat undermining Bolingers (1968) principle, which attributes to gerunds a more reied (p. 124) implication. The following example comes from a written discussion of the possible restructuring of a university division:
22. How well do faculty members govern themselves? There is little evidence that they are giving any systematic thought to a general theory of the optimum scope and nature of their part in government. They sometimes pay more attention to their rights than to their own internal problems of government. They, too, need to learn to delegate. Letting the administration take details off their hands would give them more time to inform themselves about education as a whole, an area that would benet by more faculty attention. (Brown H)

The rst would-clause in Example 22 is the result of an implied conditional in the gerund phrase Letting administration take details off their hands. The whole sentence could be reformulated as If they let the administration take details off their hands, they would have more time to inform themselves. . . . But this would carry a different tone. Haiman (1986) points out that removing the word if from a conditional clause can raise the subordinate clause to a coordinate (i.e., make the matrix and subordinate clauses paratactic) and lend the sentence a different pragmatic value. In Example 22, the embedded nature of the gerund letting does not quite allow for paratactic status, but nominalizing the condition and placing it within the subject of the sentence (thus avoiding a subordinate clause) shows the stronger position of the writer regarding the possibility of the condition. An if-clause (and, arguably, an innitive structure) would be more hypothetical and more concessive; in Example 22, the only hypothetical marker is the would construction itself. Example 22 also contains an adverbial that expresses a condition
WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES 457

without if: by more faculty attention answers the question How would the area benet? and could be paraphrased as if they had more faculty attention. Other common adverbials that mark conditions are ordinarily, without + [noun phrase], and ideally:
23. REQUIREMENTS: This help [subsidies to mineral miners] is offered to applicants who ordinarily would not undertake the exploration under present conditions or circumstances at their sole expense and who are unable to obtain funds from commercial sources on reasonable terms. (Brown H) 24. Without the good magazines, without their [the good magazines] book reviews, their hospitality to European writers, without above all their awareness of literary standards, we might very well have had a generation of Krims heroesWolfes, Farrells, Dreisers, and I might add, Sandburgs and Frosts and MacLeishes in verseand then where would we be? Screwed, stewed, and tattooed, as Krim might say after reading a book about sailors. (Brown G; a negative critique of writer/critic Seymour Krim) 25. Phil: Brad: Phil: Brad: Phil: [Hes a] banker [Yeah]. And hed be good on one hand. [But], [Yeah]. I would like, ideally Id want em both. (SBC 0010)

Anaphora referring to a previous implied hypothetical condition. This subcategory was prevalent in both the spoken and the written data. The spoken data include many would structures in environments in which the speaker is offering an evaluation or assessment of a hypothetical suggestion previously described, via a pronoun referring to that scenario. In Example 26, the interlocutors are discussing possibilities for cooking a meal, and in Marilyns last turn that refers to the idea (that she herself has offered) of preparing chutney sauce:
26. Marilyn: Well we could make Pete: I mean, that doesnt matter, I suppose [it just] Marilyn: [Oh], you know what, we have this neat [island] man[go sauce]. Roy: [Fields] Pete: [Mm]. Roy: [(sneeze)] Marilyn: Chutney [sauce]. Roy: [Chut]ney. Marilyn: That would be good. (SBC 0003)

458

TESOL QUARTERLY

This subtype of implied conditional is not limited to the spoken data. Paragraph cohesion often requires the use of anaphora in written language as well, as in Example 27, in which the author is discussing the virtues of a positivist approach to philosophical argument:
27. The rst argument is thus an ideal experiment in which we use the method of difference. It removes our present expression and shows that the badness we meant would not be affected by this, whereas on positivist grounds it should be. (Brown J)

Example 27 shows an interesting combination of real and unreal conditional structures in a hypothetical environment: The passage could be paraphrased as If we made the rst argument, we would use the method of difference, which would remove our present expression and show. . . . Generic you. Examples in this category appear only once in the written data but enough in the spoken to render them important. They seem to occur when the speaker or writer is adding moral commentary to something just said or written, as in Example 28, when Miles and Jamie are discussing the questionable sexual practices of an acquaintance of theirs who apparently refuses to engage in safe sex:
28. Miles: I- I was just really amazed to hear that,[people in their twenties] Jamie: [Were innocent]. Miles: Cause I gure Jamie: We dont have [sex]. Miles: [growing] up around here you would, know better. (SBC 0002)

Alternative Conditionals: Non-If Conditions in Counterfactual Environments


These instances of past counterfactual would-clauses without accompanying if-clauses appeared in the present data generally in descriptions of past events using either past tense or historic present tense (in which a past event is described using present tense). This function is commonly covered in ESL/EFL textbooks and thus will not be discussed further here. Quite often, however, a past counterfactual is used to provide a contrast between two times (either past and present or past and prior past). In Example 29, the character Rousseau is bemoaning how times have changed, employing the past counterfactual to highlight an essential difference (for him) between past and present:

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

459

29. What is slovenly about me? Rousseau asked. Is it because of my slovenliness that hair grows on my face? Surely it would grow there whether I washed myself or not. A hundred years ago I would have worn a beard with pride. And those without beards would have stood out as not dressed for the occasion. Now times have changed, and I must pretend that hair doesnt grow on my face. Thats the fashion. (Brown K)

Alternative Conditionals: Displaced Perspectives in Demonstrations


As noted in the discussion of Example 17 above, the selection of the MICASE data (the three discussion sections of university lecture classes) is partly responsible for the prevalence of would conditionals indicating a certain degree of tentativeness. Also, in such settings a TA is charged with clarifying to students in a hands-on, practical manner the concepts they have learned and read about in their lectures. For example, in the biology discussion section that makes up part of MICASE, the TA is constantly drawing gures on the board and referring back to them. But a board diagram is merely an approximation of what is being described, not a fully authentic temporal or spatial rendition. In the discussion section data, these situations often call for the use of would structures, indexing a description of an object that exists, or an event that takes place, from a perspective displaced from its original reality. In Example 30, for instance, the TA is explaining a concept in an introductory biology course in genetics:
30. TA: Um, alternatively, we could show crossing over okay? And what that would look like, is lets say, you had a, reciprocal exchange between, the A gene on, chromatids two and three okay? What that would look like, ((0.6-sec pause while instructor writes)) would be something like this. Bs, were not involved so they stay the same ((0.5-sec pause while instructor writes)) and you should be able to follow this through and see, why you wouldnt get, all parentals here. (DIS 175)

The would structure in Example 30 is not the only marker of a nonpresent, imaginary event: The TA also uses the fairly idiomatic Lets say . . . along with the past form of the verb (had), another way of marking a hypothetical condition. Yet another marker of tentativeness is the phrase something like this. Example 30 is actually the beginning of a long string of similar instances in which the TA is either drawing on the board while explaining the signicance of the genetic event or referring to an existing diagram. What is a speaker doing in such a situation? Because the event in Example 30 is a description of the unreal or imaginary, the TA must

460

TESOL QUARTERLY

momentarily place herself in an imaginative space, thus shifting her footing (Goffman, 1981), in this case from a literal description of an event too small to actually produce to a demonstration of the event. Although this function of would conditionals might correspond to the welldocumented imaginary function of conditional structures, this accounting does not seem adequate. Clark and Gerrig (1990), in their detailed analysis of quotations as demonstrations, differentiate descriptions from demonstrations: Descriptions of past events or actions are verbal utterances that interlocutors comprehend by recognizing . . . [the speakers] intentions in uttering (p. 765) those events or actions whereas demonstrations are partial physical reenactments of those past events or actions. Demonstrations work by enabling others to experience what it is like to perceive the things depicted (p. 765), and are non-serious (p. 770) and selective (p. 771). In this way, Clark and Gerrig account for verbal and physical resources in spoken as well as nonlinguistic quotations (pp. 781782). A classroom is not the only place where such instances of demonstration can occur. Although no such instances turned up in the (nonclassroom) SBC used for this project, research in sociology and anthropology has demonstrated that people in face-to-face conversation constantly use hand gestures and other nonlinguistic means to provide their interlocutors with demonstrations (see, e.g., Goodwin, 1996, 2002).

Quantitative Analysis of Functional Categories


Tables 3 and 4 show the number of would-clauses in the data that represent each of the subcategories explained in the previous section. Data for the spoken corpora (SBC and MICASE) are shown in Table 3, and Table 4 shows the breakdown of the data from the written (Brown) corpus, further categorized by genre. In the spoken data (Table 3), the MICASE data (the discussion sections) showed a higher incidence of conditional would-clauses than the SBC data did. Those clauses are concentrated in Categories (a) (adjacent if-clauses) and (g) (displaced perspectives in demonstrations). In the written data (Table 4), the only concentration of note appears in Category B, editorial. A look at the occurrences conrms what one may intuit about editorial writing: It contains a good deal of speculation (with the grammatical help of conditional would-clauses) about present and future political and societal events and situations as well as many suggestions whose hypothetical results often appear in conditional grammatical environments. Note that a sizable majority of these appear with no nearby if-clauses.

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

461

462

TABLE 3 Would-Clauses in Spoken Corpora by Conditional Category Category (c) (d) (e) (f) (g) % n % 3.8 7.5 5.1 0 20 20 n % n 6 6 12 n % n 4 0 2.6 2 74 12 90 48.0 15.0 38.5 4 % 22.7 15.0 20.0 n 35 12 47 (h) %

(a)

(b)

Corpus

Tokens per Total 1,000 tokens words 154 8.5 80 234

SBC MICASE Total

21 24 45

14 30

12

7.8 50.0

25.0

19

4 16

6.8

2 2 4

1.2 2.5 1.7

1.1 3.1 1.4

Note. (a) adjacent if-clause; (b) conditional frames; (c) tentativeness and varying expressions of commitment; (d) emphatic negative statements; (e) non-if conditions in hypothetical environments (including otherwise clauses); (f) non-if conditions in counterfactual environments (including wishclauses); (g) (spoken corpus only) displaced perspectives in demonstrations; (h) (spoken corpus only) indeterminate. Percentages may not add to 100% because of rounding.

TESOL QUARTERLY

TABLE 4 Breakdown of the Written Corpus (Brown Corpus) by Conditional Category and Genre Category (a) (b) (c) (d) (e) (f) % n % % n % n % n % n

Corpus

Total tokens 20 27 5 9 12 20

Tokens per 1,000 words 1.6 3.5 1 1.8 1.2 1.4

I. 1 7 3 2 1 6 5 26 60 22 17 30 0 0 0 0 0 0 0 2 0 0 0 2 7.0 10.0 0 0 0 0 0 1 5.0 18 16 2 6 9 9 90 59 40 67 75 45 1 2 0 1 1 2 12.0 13.0 0 7.0 7.0 7.0 13 5.6 0 0 1 1 0 3 7.0 7.0 1.3 0 0 0 31 21 17 1 8 0 1 0 1 1 0 7 1 0 0 1 0 0 3 29 20 33 27 40 23 1 0 0 4 3 0 3 13 14 23 4 9 4 5 6 2 140 50 100 68 33 64 80 33 40 40 60 1 0 1 5 0 0 2 3 1 20 8 0 7 2 4 1 5 4 2

Informative prose A. Press: reportage (88,660) B. Press: editorial (54,405) C. Press: reviews (34,255) D. Religion (34,255) E. Skills & hobbies (72,540) F. Popular lore (96,720) G. Belles lettres, biography (151,125) H. Miscellaneous (60,450) J. Learned (161,200)

5.0 7.0 11.0 8.0 10.0 4.0 3.0 42.0 13.0 20.0 20.0 8.6

26 13 34 12 14 5 15 15 5 233

1.2 1.6 1.5 1.4 2 2.9 1.8 1.8 1.9 1.6

WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES

II. Imaginative prose K. General fiction (58,435) L. Mystery & detective fiction (48,360) M. Science fiction (12,090) N. Adventure & Western fiction (58,435) P. Romance & love story (58,435) R. Humor (18,135)

Total

54

463

Note. (a) adjacent if-clause; (b) conditional frames; (c) tentativeness and varying expressions of commitment; (d) emphatic negative statements; (e) non-if conditions in hypothetical environments (including otherwise clauses); (f) non-if conditions in counterfactual environments (including wishclauses); (g) (spoken corpus only) displaced perspectives in demonstrations; (h) (spoken corpus only) indeterminate. Percentages may not add to 100% because of rounding.

CONCLUSION
As noted in the introduction, this studys ndings would seem to dictate, at the very least, a move away from the common practice of teaching hypothetical and counterfactual results in would-clauses only at the sentence level and only as adjacent to overt if-clauses. Most often, these types of clauses do not appear together. Instead, hypotheticals and counterfactuals with would often follow conditions that are not overtly marked or are indicated by means other than if-clauses. In other words, although hypotheticals and counterfactuals are quite xed in their use of the modal would to mark the hypothetical or counterfactual nature of a clause, the marking of the condition is much more variable. Furthermore, because this nding has emerged from a close analysis of written and spoken language as authentically produced (rather than as a result of intuition), future ESL/EFL textbook writers should look to similar authentic language for examples in their textbooks and should develop a new, comprehensive categorization of conditional structures based on corpora of authentic language. Finally, this initial research could extend to other sorts of conditionals, for example, those traditionally referred to as real. An additional concern from these ndingsand the ndings of much other research based on corpus datais that grammarians need to examine the terminology heretofore used in grammatical descriptions. For example, many of the so-called result clauses exemplied in this paper do not express results in any meaningful sense; even the term conditional may come into question, given that some of the examples considered here (e.g., Examples 29 and 30) are not conditionals but rather irrealis (dened by Mithun, 1999, as situations . . . purely within the realm of thought, knowable only through imagination [p. 173]) or the like. In addition, the names of many of the categories I have devised from the present analysis may require revision as additional data are investigated. However, until now ESL/EFL textbooks and reference grammars have offered little discussion on these matters. One of the aims of this article, and one of the promises of corpus linguistics in general, is to coax more analysis and discussion of, and more rethinking of categories for, this grammatical construct and others whose received descriptions may be misleading. The results of this study also speak to the benets of the balance between a quantitative and a qualitative accounting of corpus data. With the rise in popularity of the personal computer, linguists, language teachers, and textbook writers can check the relative statistical prevalence of many grammatical and lexical structures. Information of this sort can often play a role in determining, for example, a useful sequence in which to teach certain grammar structures. But a quantitative study
464 TESOL QUARTERLY

would not be complete without the complementary, more intensive functional analysis, which reveals the diversity of environments in which a structure may appear, its different behaviors in different environments, and the types of structures that occur nearby. As the present study demonstrates, a structure can have different meanings and uses in different contexts, and one of the functions of pedagogical grammar texts and students textbooks is to clarify those nuances for the language learner and language teacher.
ACKNOWLEDGMENTS
I am greatly honored to have had the guidance of Marianne Celce-Murcia and Hongyin Tao, whose inuence and suggestions pervade this article and are too numerous to mention in footnotes. I am also grateful to two anonymous TESOL Quarterly reviewers. This project was partially supported by a summer internship stipend from the graduate division at the University of California, Los Angeles.

THE AUTHOR
Stefan Frazier is a doctoral candidate in applied linguistics/TESL at the University of California, Los Angeles. His research interests include discourse analysis, classroom group-work interaction, ESL composition pedagogy, functional grammar, conversation analysis, corpus linguistics, and the functions of gesture and embodiment in talkin-interaction.

REFERENCES
Barlow, M. (2000). MonoConc Pro (Version 2.0) [Computer software]. Houston, TX: Athelstan. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow, England: Pearson Education. Bolinger, D. (1968). Entailment and the meaning of structures. Glossa, 2, 119127. Carter, R., Hughes, R., & McCarthy, M. (2000). Exploring grammar in context. Cambridge: Cambridge University Press. Celce-Murcia, M., & Hilles, S. (1988). Techniques and resources in teaching grammar. Oxford: Oxford University Press. Celce-Murcia, M., & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL teachers course (2nd ed.). Boston: Heinle & Heinle. Chafe, W., Du Bois, J. W., & Thompson, S. (1991). Towards a corpus of spoken American English. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 6482). London: Longman. Clark, H. H., & Gerrig, R. J. (1990). Quotations as demonstrations. Language, 66, 764805. Danielson, D., Porter, P., & Hayden, R. (1990). Using English: Your second language (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Fillenbaum, S. (1986). The use of conditionals in inducements and deterrents. In E. C. Traugott, A. ter Meulen, J. S. Reilly, & C. A. Ferguson (Eds.), On conditionals (pp. 179196). Cambridge: Cambridge University Press.
WOULD-CLAUSES WITHOUT ADJACENT IF -CLAUSES 465

Ford, C. E., & Thompson, S. (1986). Conditionals in discourse: A text-based study from English. In E. C. Traugott, A. ter Meulen, J. S. Reilly & C. A. Ferguson (Eds.), On conditionals (pp. 353372). Cambridge: Cambridge University Press. Francis, W. N. (1979). A tagged corpusproblems and prospects. In S. Greenbaum, G. Leech, & J. Svartvik (Eds.), Studies in English linguistics (pp. 192209). New York: Longman. Francis, W. N., & Kucera, H. (1964). A standard corpus of present-day edited American English, for use with digital computers. Providence, RI: Brown University, Department of Linguistics. Frodesen, J., & Eyring, J. (2000). Grammar dimensions 4 (2nd ed.). Pacic Grove, CA: Heinle & Heinle. Goffman, E. (1981). Forms of talk. Philadelphia: University of Pennsylvania Press. Goodwin, C. (1996). Transparent vision. In E. Ochs, E. A. Schegloff, & S. Thompson (Eds.), Interaction and grammar (pp. 370404). Cambridge: Cambridge University Press. Goodwin, C. (2002). Pointing as situated practice. In S. Kita (Ed.), Pointing: Where language, culture and cognition meet (pp. 217241). Hillsdale, NJ: Erlbaum. Haiman, J. (1986). Constraints on the form and meaning of the protaxis. In E. C. Traugott, A. ter Meulen, J. S. Reilly, & C. A. Ferguson (Eds.), On conditionals (pp. 215228). Cambridge: Cambridge University Press. Hwang, M. O. (1979). A semantic and syntactic analysis of if conditionals. Unpublished masters thesis, University of California, Los Angeles. Katz, R. (2002, May 13). Core values are the engine behind independence. Los Angeles Times, p. B13. Meacham, J. (2002, May 6). What would Jesus do? Beyond the priest scandal: Christianity at a crossroads. Newsweek, 139, 2232. Mithun, M. (1999). The language of native North America. Cambridge: Cambridge University Press. Quirk, R., Svartvik, J., Leech, G., & Greenbaum, S. (1985). A comprehensive grammar of the English language. New York: Longman. Raimes, A. (1998a). Grammar troublespots: An editing guide for students. Cambridge: Cambridge University Press. Raimes, A. (1998b). How English works: A grammar handbook with readings. Cambridge: Cambridge University Press. Riggenbach, H., & Samuda, V. (1997). Grammar dimensions 2 (2nd ed.). Pacic Grove, CA: Heinle & Heinle. Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, S. L. (1999). The Michigan corpus of academic spoken English. Ann Arbor: The Regents of the University of Michigan. Tao, H. (2001). Discovering the usual with corpora: The case of remember. In R. Simpson & J. Swales (Eds.), Linguistics in North America: Selections from the 1999 symposium (pp. 116144). Ann Arbor: University of Michigan Press. Thewlis, S. (1997). Grammar dimensions 3 (2nd ed.). Pacic Grove, CA: Heinle & Heinle. Traugott, E. C., ter Meulen, A., Reilly, J. S., & Ferguson, C. A. (Eds.). (1986). On conditionals. Cambridge: Cambridge University Press. Van der Auwera, J. (1986). Conditionals and speech acts. In E. C. Traugott, A. ter Meulen, J. S. Reilly, & C. A. Ferguson (Eds.), On conditionals (pp. 197214). Cambridge: Cambridge University Press.

466

TESOL QUARTERLY

Amplier Collocations in the British National Corpus: Implications for English Language Teaching
GRAEME KENNEDY
Victoria University of Wellington Wellington, New Zealand

This study examines how adverbs of degree tend to collocate with particular words in the 100-million-word British National Corpus and considers some possible implications for English language teaching. The mutual information measure is used to show the strength of the bond between 24 selected ampliers such as extremely or greatly and other words (typically adjectives or participles such as rare or appreciated, which result in collocations such as extremely rare or greatly appreciated). Each amplier is shown to collocate most strongly with particular words having particular grammatical and semantic characteristics. Research in cognitive science has shown the extent to which words and collocations become established as units of learning depending on the frequency with which they are experienced. In the light of the corpus-based evidence on the nature of collocations presented in this study, the teaching of collocations might be expected to have a more explicit and prominent place in the language teaching curriculum. In class, teachers can draw attention to collocations not only through direct teaching but also by maximizing opportunities to acquire them through an emphasis on autonomous implicit learning activities such as reading.

or many years a central concern for language teaching has been how to specify what people learn when they learn a language. The major units of language learning have often been assumed to be similar to the traditional levels and units of language description, namely, the sounds, words, and rules of grammar and discourse. But since the 1930s, a succession of leading gures in the profession have urged English language teachers to recognise that particular words tend to occur in the company of other words and that uent use of a language depends on learning to use these word groups. The word idiom has long been familiar to language teachers as one of a number of terms for this phenomenon, but Palmer (1933), arguably the most inuential English language
TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

467

teaching specialist of the 20th century, adopted the term collocation for recurring groups of words. He dened a collocation as a succession of two or more words that must be learned as an integral whole and not pieced together from its component parts (p. i) (e.g., on the whole). Palmer went so far as to suggest that even a selection of common collocations . . . exceeds by far the popular estimate of the number of single words contained in an everyday vocabulary (p. 13). The possibility that there are many more collocations to learn than there are words in a language perhaps helps explain why learning a language usually takes so long in comparison with other complex learning tasks. Palmers (1933) pioneering work on collocations in English language teaching was paralleled in different branches of the language sciences. Among a number of scholars who took account of the phenomenon of collocation, Firth (1957) emphasized the importance of both linguistic and situational context for the description of languages in his maxim, You shall know a word by the company it keeps (p. 195); Peters (1977, 1983) focused on the learning of groups of words as the units of rst and second language acquisition; Hakuta (1974) and Wong-Fillmore (1976) explored the learning of routinized or formulaic speech by L2 learners; Nattinger (1980) applied the concept of formulaic speech to the development of curricula for language learners; and in one of the most widely cited articles of the past two decades in applied linguistics, Pawley and Syder (1983) proposed a collocational model involving lexicalized clauses to account for spoken language uency. Corpus-based evidence was used by Sinclair (1991) to support what he called the idiom principle in language learning and use (characterized by the use of routinized combinations of words in speech and writing), and to highlight the neglect of collocations in the theory and practice of English language teaching. He suggested that whereas some linguistic processes (e.g., subordination) might appropriately be classed as grammatical (open-choice) phenomena, other items or processes (e.g., phrasal verbs or adjective modication) might appropriately be considered as lexical. Such lexicalization has occurred, for example, with very much in thank you very much; it is uncommon to nd thank you much in English speech or writing. Until relatively recently, the true complexity of collocations was largely hidden, and L2 professionals had no way of understanding their nature and the extent of their use. Over the past decade or so, however, revolutionary developments in the new technology of computer corpus linguistics, and the availability of huge collections of text in electronic form from spoken and written sources (and of English in particular), has made possible new insights into how words are distributed in a language. Increasingly sophisticated software for the analysis of corpora has allowed researchers to explore more deeply the nature of collocations
468 TESOL QUARTERLY

and, in doing so, to reconceptualize the nature of vocabulary, challenge syntax-based approaches to language description and pedagogy, and throw light on the nature of language learning. The purpose of this article is to explore aspects of how one group of degree adverbsampliersform collocations with other words in the British National Corpus (BNC), one of the largest and most representative corpora of a single variety of English currently available. More generally, the analysis aims to illustrate the way corpus-based descriptions can be used to increase the understanding of how English is structured, and the nature and extent of collocations in language learning and use. Finally, possible implications for L2 teaching are considered.

MODIFIER COLLOCATIONS
The use of adverbs of degree to modify adjectives and verbs has been described most comprehensively by Quirk, Greenbaum, Leech, and Svartvik (1985), whose analysis took account of earlier work by Bolinger (1972) and others. The framework used was essentially a semantic one, by which ampliers, such as absolutely, completely, really, and very, were considered to express degrees of increasing intensication upwards from an assumed norm whereas downtoners, such as rather, a bit, somewhat, and quite, were considered as scaling the sense of an adjective downward from an assumed norm, often with a hedging or softening effect. In addition to expressing degrees of intensication for adjectives and verbs, another important function of ampliers and downtoners is to serve as llers to give speakers planning time and to assert epistemic meaning associated with speakers level of condence in the truth of their assertions. The present analysis is concerned specically with the use of ampliers. There are two major subcategories of ampliers, namely, maximizers and boosters. Maximizers such as absolutely, completely, entirely, fully, perfectly, totally, and utterly maximally intensify the sense of an adjective or verb. One can be, for example, absolutely thrilled, completely unclear, totally devastated, or utterly ruthless. Boosters, on the other hand, signify less than maximal intensity. One can be very unclear, really annoyed, particularly helpful, extremely unwise, heavily sedated, highly skilled, incredibly stupid, deeply suspicious, enormously grateful, severely depressed, terribly sorry, very much appreciated, greatly outnumbered, and so on. About 50 of these two kinds of ampliers are in common use in British English. Together, in total, they occur about 3,700 times in every 1 million words in the BNC (described below), or one amplier every 270 words. Both maximizers and boosters are open-class words, and a small number of new ones evolve from time to time to replace those which have become ineffectual, as Quirk et al. (1985, p. 590) noted.
AMPLIFIER COLLOCATIONS IN THE BNC 469

Recent Corpus-Based Research on Amplier Collocations


Among recent corpus-based studies of ampliers, Paradis (1997) used a small corpus to explore a number of the phonological and semantic variables associated with the classication of adjectives and their degree modiers, and the importance of collocational relationships between them. Following Allerton (1987), Paradis noted constraints on the possible combinations of degree modiers and gradable adjectives (p. 41). She suggested that gradable adjectives such as nice or interesting can be modied by words such as very, rather, or terribly whereas nongradable (or absolute) adjectives, such as dead, excellent, or classical, cannot and instead are associated with maximizers such as absolutely, clearly, or perfectly. She argued that on semantic grounds, the gradable feature in the adjective must harmonize with the grading function of the degree modier in terms of totality and scalarity to make a successful match (p. 158). Paradis also suggested that strong collocational associations can exist between particular modiers and particular adjectives (e.g., terribly kind ). Lorenz (1999) suggested that collocational associations such as those between modier and adjective lie at the heart of idiomaticity in English. He argued that the main function of modiers is to assert the importance or relevance of the quality represented by the adjective and to elaborate on its core meaning. Based on a comparison of the characteristic modier-adjective associations of native and nonnative speakers in corpora, he suggested that German learners of English may have a tendency to overuse particular modiers and hyperbole. An analysis of amplier-adjective use is included in Biber, Johansson, Leech, Conrad, and Finegans (1999) work, based on a large (40-millionword) corpus of American and British spoken and written English. The description shows that the most frequent ampliers immediately preceding adjectives in British English conversation are very, so, really, and too followed by absolutely, bloody, damn, real, completely, and totally. In American English conversational genres the distribution is similar, with the exception of bloody, which is infrequent, and real, which is much more frequent than in British English. Extremely, highly, entirely, fully, incredibly, perfectly, strongly, and terribly also occur frequently in both regional varieties, especially in written, academic genres. Biber et al. (1999, p. 545) show that the most frequent amplier-adjective collocations in British English conversation are very good, very nice, really good, really nice, and too bad, whereas in American English conversation really good, too bad, very good, real good, real quick, really bad, really nice, too big, and very nice are the most frequent. Biber et al. (1999) suggest that speakers and writers have a variety of

470

TESOL QUARTERLY

degree adverbs to choose from in modifying adjectives, and whereas some (e.g., fully and strongly) are not interchangeable, in many cases, there is little semantic difference between the degree adverbs. Thus the adverbs could be exchanged in the following pair of sentences with little or no change of meaning: Thats completely different. Its totally different. On the other hand, they also suggest that even for similar degree adverbs, there are differing preferences across registers, and associations with different adjectives (p. 564). These suggestions remain somewhat speculative, however, until the appropriate data are examined, and therefore this study examines linguistic data that show precisely which degree adverbs tend to collocate with particular adjectives.

METHOD The British National Corpus


The BNC (Leech, Rayson, & Wilson, 2001) is a 100-million-word structured collection of spoken and written texts. The corpus was compiled by a consortium of universities, publishers, and the British government in the 1990s to be representative of the spoken and written English used in Britain at the end of the 20th century. The BNC was selected for the present study mainly because of its size and because it incorporates grammatical tagging of each word, thus facilitating automatic retrieval and analysis, making it one of the most signicant research tools currently available for the corpus-based study of a particular variety of English. It includes 90 million words of written English from eight genres (80% informative prose, 20% imaginative prose) and 10 million words of spoken English from four social class groupings, collected in 38 locations in the United Kingdom. One hundred million words is equivalent to about 10,000 hours of continuous speech, or what a person might be exposed to by listening to or reading English nonstop 8 hours per day for 3.5 years. The spoken and written texts in the corpus cover a wide range of domains of use, from classrooms, courtrooms, and boardrooms to radio chat shows, bedrooms, and pubs. The texts include casual conversation as well as more formal written genres from sources such as newspapers, biographies, and novels. Analyses of representative corpora of other varieties of English are, of course, likely to identify differences in the amplier collocations found when compared with those found in this study, but indications are that the pervasiveness of the phenomenon of collocation will be nevertheless equally well established.

AMPLIFIER COLLOCATIONS IN THE BNC

471

Procedure
Collocations associated with the 24 ampliers listed in Table 1 were the focus of the present study. The ampliers were selected largely because they are among the most frequent in the corpus. All except utterly, dead, severely, terribly, enormously, and incredibly occurred at least 3,000 times in the 100 million words of text in the BNC. As shown in Table 1, the most frequent of the 8 maximizers selected from the BNC are fully (with 89 occurrences per million words), completely, entirely, absolutely, totally, and perfectly. The most frequent boosters selected are very (with 1,228 occurrences per million words), really, particularly, clearly, highly, very much, extremely, badly, heavily, deeply, greatly, and considerably. Very occurs about once in every 800 words in spoken or written texts, accounting for over 40% of all boosters in the BNC. The other 15 boosters included in this study also account for about 40% of boosters in the corpus, with another 20 or so boosters, including far, so, too, and certain expletives (e.g., far outweighed, bloody marvellous, so cold, too much, fucking useless), accounting for the remaining 20%. The 24 ampliers and the words they modify were retrieved from the BNC using the BNCweb (2002) interface in order to explore collocational relationships between them. Following Sinclair (1991, p. 117), who argued that in collocational research it is desirable to identify collocates within a span of words each side of a key word in order to retrieve collocations that may have been separated by intervening words, the

TABLE 1 Frequency of Selected Ampliers in the British National Corpus (per Million Words) Maximizer fully completely entirely absolutely totally perfectly utterly dead Frequency 89 86 69 58 58 44 13 8 Booster very really particularly clearly highly very much extremely badly heavily deeply greatly considerably severely terribly enormously incredibly Frequency 1,228 476 219 153 91 80 68 43 41 37 33 30 18 13 8 8

472

TESOL QUARTERLY

present study adopted the window of two words on each side of the amplier. The consequence of such a window, however, was that whereas manual checking showed that most of the amplier collocates are adjectives or participles, some are verbs. That is, the automatic retrieval from the corpus would allow both I felt completely overshadowed by the others and She overshadowed completely all the others in her class. In the rst sentence, the adverb completely modies a participial adjective. In the second case completely modies a verb. Because a primary focus of the study was on the collocational bonds between particular words rather than between word classes, I decided to include both adjective and verb forms that collocate with the degree adverbs rather than including only the adverb-(participial) adjective collocations and excluding the others manually. The analysis involved only those collocations occurring at least ve times in the corpus and only the 40 collocates associated most strongly with each amplier. The statistical measure chosen to show the strength of the associations between ampliers and adjectives in the study was the mutual information (MI) measure (Church & Hanks, 1990). The MI measure compares the probability of two words occurring together through intention with the probability of the two words occurring together by chance. The actual frequency of co-occurrence of two words is compared with the predicted frequency of co-occurrence of the two words if each were randomly distributed in the corpus. If there is a genuine association between the two words, then the joint probability of occurrence will be much larger than chance, and consequently the MI score will be greater than zero. A ratio of 0 shows that the occurrence of two words together is highly unlikely to be linguistically important. The higher the ratio, the stronger the association between the words. An MI score greater than 2 can be considered high enough to show a substantial association between two words. The MI score is calculated with the following formula: MI = log2 (( f (n, c) N )/( f(n) f(c)),

where f(n, c) is the number of times the collocation occurs, f(n) is the frequency of the amplier, f(c)is the frequency of the adjective or other word modied, and N is the number of words in the corpus. Church, Gale, Hanks, Hindle, and Moon (1994, p. 174) argued that because the MI measure identies the strength of association between word pairs in a corpus, the measure makes it possible to go further than simply nding the most frequently occurring collocations. Estimating the strength of the bond between words is considered to be a measure of idiomaticity. Because the ratio of joint probability of occurrence to chance occurrence can be large, the MI measure expresses the ratio as a
AMPLIFIER COLLOCATIONS IN THE BNC 473

logarithm. (Inevitable errors in automatic tagging and retrieval, and revisions in the tagging program used in the later releases of the corpus, do not have a substantial effect on the results but may account for a possible small margin of error in the rank orderings and MI scores presented below.)

RESULTS
At rst glance, the ampliers might indeed appear to be interchangeable to a large extent, as Biber et al. (1999) state, but close examination of the results in this study suggests that this is not the case. Whereas boosters such as very, really, particularly, highly, extremely, deeply, terribly, and incredibly may all appear to be synonymous and interchangeable, occurring before useful, interesting, obnoxious, or upset, boosters such as clearly, badly, heavily, greatly, considerably, and severely are not synonymous and interchangeable. Some ampliers do not seem to t comfortably with particular adjectives and are not likely to be found in a corpus or be considered acceptable by most native speakers of Englishfor example, completely easier, fully classical, badly dead, heavily unique, heavily frustrating, deeply valuable, deeply excellent, considerably kind, highly outnumbered, severely amazing, and enormously nice.

Maximizers
Table 2 shows the 40 most strongly bonded words associated with each of eight maximizers in the BNC. The collocational value in the adjacent column is the strength of the association in the corpus between the maximizer and the word, as represented by the MI measure. Each maximizer tends to collocate strongly with quite different words. For example, absolutely collocates most strongly with diabolical, with a collocational value of 5.89. By comparison, completely collocates most strongly with retted, with a collocational value of 6.09. The data in Table 2 suggest that the adjectives and verbs each tend to bond most strongly with particular maximizers. Thus, whereas appalling is associated strongly with both absolutely and utterly, appalling is not preceded by entirely, fully, perfectly, or totally in Table 2. Besides the strength of the bonding, more general semantic and grammatical characteristics can be involved in the collocations, as these examples reveal: Absolutely tends to be associated with adjectives that are used hyperbolically (e.g., fabulous, marvellous, fantastic, brilliant, lthy, freezing); the adjectives have both positive (wonderful) and negative (disgusting)
474 TESOL QUARTERLY

TABLE 2 Forty Strongest Collocations With Selected Maximizers in the British National Corpus
entirely fully perfectly totally 7.07 4.81 4.21 4.19 4.02 3.98 3.81 3.79 3.75 3.71 3.51 2.73 3.69 2.48 2.33 2.31 2.07 2.05 2.04 2.01 1.98 1.93 1.88 1.81 1.51 1.30 1.23 1.00 utterly

AMPLIFIER COLLOCATIONS IN THE BNC


blameless fortuitous coincidental altruistic obliterated absent understandable fanciful nanced devoted satisfactory dissimilar untrue devoid composed predictable hypothetical arbitrary consistent unrelated attributable inappropriate convincing unsuitable unexpected discretionary innocent preoccupied conned incompatible optional feasible focused legitimate compatible ignorant irrelevant factual accurate different 5.98 5.31 5.31 4.98 4.88 4.50 4.48 4.47 4.44 4.43 4.34 4.31 4.28 4.27 4.18 4.13 4.08 3.98 3.96 3.94 3.91 3.84 3.73 3.72 3.63 3.55 3.53 3.50 3.49 3.49 3.41 3.40 3.38 3.37 3.36 3.33 3.31 3.29 3.24 3.22 edged conversant battened clothed air-conditioned deductible elucidated congured comprehended automated washable equipped programmable operational sighted dilated rigged staffed utilized briefed integrated computerized exploited matured carpeted adjustable inclusive informed aligned computerised compatible manned justied licensed loaded implemented turbulent articulated glazed assimilated 7.77 6.90 6.37 6.22 6.06 5.82 5.77 5.70 5.61 5.49 5.27 5.24 5.22 5.19 5.19 4.92 4.91 4.83 4.83 4.83 4.82 4.73 4.65 4.61 4.58 4.55 4.53 4.43 4.40 4.36 4.35 4.34 4.26 4.26 4.26 4.22 4.22 4.20 4.20 4.19 contestable proportioned manicured spherical groomed timed understandable symmetrical acceptable complemented intelligible feasible balanced legitimate elastic respectable sane adequate honest capable harmless valid reasonable competitive satisfactory plausible healthy tailored normal safe happy compatible competent rational transparent honourable innocent sensible genuine smooth 7.56 6.75 6.12 5.84 5.68 5.40 5.29 5.15 5.07 4.97 4.93 4.88 4.87 4.74 4.71 4.70 4.68 4.65 4.57 4.53 4.47 4.44 4.37 4.37 4.31 4.27 4.23 4.16 4.12 4.05 4.02 4.02 3.96 3.90 3.78 3.78 3.76 3.74 3.66 3.60 unsuited 6.23 unprepared 5.89 illegible 5.65 unsuitable 5.61 impractical 5.60 uncharacteristic 5.58 illogical 5.57 unacceptable 5.55 unconnected 5.51 devoid 5.49 unintelligible 5.38 symmetric 5.35 unfounded 5.34 untrue 5.33 unmoved 5.27 unjustied 5.25 oblivious 5.18 incomprehensible 5.16 unrelated 5.16 unconcerned 5.15 eclipsed 5.11 immersed 4.99 unrealistic 4.97 unaware 4.97 engrossed 4.91 inadequate 4.87 fucked 4.85 unexpected 4.77 one-sided 4.76 bemused 4.73 lacking 4.65 incapable 4.60 insane 4.59 alien 4.59 dependent 4.59 submerged 4.58 irrelevant 4.55 pissed off 4.53 disproportionate 4.51 bafed 4.46

absolutely

completely

dead

chuffed boring drunk funny keen straight smart lucky calm tired easy quiet at slow white simple serious certain right nice interesting cold set wrong black level sure still

475

diabolical knackered gorgeous livid thrilled devastated frightful ludicrous immaculate disgraceful horrendous disgusted fabulous marvellous fantastic terric delighted disgusting horric brilliant hopeless essential ridiculous appalling lthy furious motionless mint delicious outrageous meaningless vital superb wonderful fascinating useless forbidden freezing incredible crazy

5.89 5.75 5.33 5.25 5.04 5.02 5.01 4.98 4.88 4.88 4.87 4.84 4.68 4.67 4.59 4.58 4.57 4.56 4.48 4.36 4.28 4.26 4.24 4.18 4.17 4.16 4.15 4.06 4.05 4.02 3.97 3.97 3.96 3.93 3.92 3.90 3.88 3.81 3.79 3.67

retted inelastic outclassed redesigned refurbished overhauled eradicated disorientated renovated mystied sequenced gutted revamped uninterested untrue overshadowed healed submerged untouched cured lifeless unrelated wrecked ignored bafed disregarded bald destroyed numb self-contained obscured devoid irrelevant overgrown eliminated automated insane forgotten absorbed ignorant

6.09 5.85 5.69 5.51 5.42 5.35 5.33 5.27 5.18 5.16 5.08 4.91 4.86 4.76 4.76 4.66 4.59 4.59 4.41 4.34 4.33 4.26 4.25 4.22 4.22 4.21 4.14 4.11 4.11 4.10 4.06 4.05 4.03 4.02 4.01 3.92 3.91 3.91 3.90 3.90

desolate disgraceful irresponsible ruthless compelling miserable ridiculous horried helpless divorced exhausted alien appalling deserted useless convincing foolish charming opposed unexpected transformed shocked defeated absorbed destroyed confused worn mad inadequate silent dependent devoted stupid remote brilliant impossible failed convinced false felt

5.42 5.31 4.80 4.80 4.61 4.55 4.18 4.10 3.96 3.84 3.82 3.73 3.71 3.71 3.68 3.62 3.62 3.62 3.60 3.51 3.45 3.40 3.38 3.33 3.21 3.04 3.04 2.97 2.94 2.91 2.91 2.85 2.77 2.72 2.71 2.70 2.65 2.59 2.47 2.24

Note. Values are strength of the collocation as calculated by the MI measure.

semantic associations; only incredible has a negative prex; 23% of the collocates have an -ous sufx; 15% have an -ed sufx. Completely tends to be associated with abolition (e.g., eliminated, eradicated, wrecked); 23% of the collocates have a negative prex; 10% have an out- or over- prex; 78% have an -ed sufx. Dead is found particularly with words that have positive associations; none of the collocates has a negative prex; only two have an -ed sufx. Entirely is found with words having positive or negative associations; 18% of the words have an -able or -ible sufx; 23% have an -ed sufx. Fully has exclusively positive associations; 13% of the adjectives have an -able or -ible sufx; 78% have an -ed sufx. Perfectly has exclusively positive associations; 28% of the adjectives end in -able or -ible; only 18% have an -ed sufx. Totally tends to have mainly negative associations (e.g., unsuited, lacking, insane); 65% of the adjectives have a negative prex; 45% have an -ed sufx. Utterly has negative associations in 75% of the collocations (e.g., desolate, stupid, ruthless, miserable); 38% of the collocates have an -ed sufx.

Boosters
Like the maximizers, each of the 16 boosters tends to collocate strongly with quite different words. As shown in Table 3, badly bonds most strongly with mauled whereas clearly bonds most strongly with demarcated. As was the case with maximizers, almost any booster can in theory be associated with almost any adjective or verb. In reality, there are probabilities of occurrence, and learning and applying them contributes greatly to uency. Badly is particularly associated with damage (e.g., bruised, corroded, mutilated); 88% of the collocates end in -ed, and 35% end in the /id/ allomorph (e.g., corroded, wounded, ventilated). Clearly tends to be associated with perception (e.g., visible, audible, labelled); 30% of the adjectives end in -able or -ible; 20% have a negative prex (un-, in-, mis-); 53% have an -ed sufx; 28% end in an /id/ allomorph. Considerably tends to be associated with change of state (e.g., loosened, broadened, slowed, altered); most of its collocates are positive, and only worsened and weakened have negative associations; 38% of the collo476 TESOL QUARTERLY

cates are comparatives ending in -er (e.g., cheaper, easier, fewer); 65% have an -ed sufx. Deeply is especially associated with indentation (e.g., ingrained, incised, entrenched, recessed, etched); 83% of the collocates have an -ed sufx; 25% end in an /id/ allomorph. Enormously tends to be associated with change (e.g., varied, expanded); all the adjectives have positive associations except for difcult; 42% have an -ed sufx. Extremely tends to be associated more with adjectives that have negative associations (e.g., difcult, risky, wasteful, dangerous) than with adjectives that have positive associations (e.g., versatile, valuable, fruitful, lucrative); 20% of the adjectives end in -ful; 15% have a negative prex; only two end in -ed (distressed and agitated). Greatly mainly has positive associations, with the exception of outnumbered, alarmed, and distressed; none of the collocates has a negative prex; all 40 end in -ed; 40% have the /id/ allomorph. Heavily tends to have more negative than positive associations (e.g., taxed, polluted, biased, skewed, contaminated); none of the collocates has a negative prex; 98% end in -ed; 35% have the /id/ allomorph. Highly tends to be associated especially with adjectives that have positive associations, with the exception of toxic, repellent, and contagious; 28% end in -able or -ible; only one word (improbable) has a negative prex; 55% have an -ed sufx. Incredibly tends to be associated with adjectives that express subjective judgement (e.g., sexy, nave, handsome, brave, clever, boring, beautiful); some collocates have a -y sufx (e.g., lucky, funny, sexy, easy); only three have an -ed sufx. Particularly has both positive (e.g., fond, helpful, valuable) and negative (e.g., galling, irksome, obnoxious) associations; six adjectives have -able or -ible sufxes; four have -ive sufxes; six end in -ing; only two (5%) have -ed sufxes. Really has both positive (e.g., cute, nice, funny, tasty) and negative (e.g., scary, pathetic, horrible, vile) associations; 25% of the collocates have -y sufxes; 13% end in -ing; 15% end in -ed. Severely is associated especially with constraint or damage (e.g., undernourished, curtailed, depleted, hampered, limited, wounded, injured, bruised); 98% of the collocates end in -ed. Terribly is associated with words that have positive (e.g., excited, impressed, glad, nice, helpful, keen) as well as negative (e.g., unhappy, sad, depressed, angry) associations; 25% of the collocates end in -ed.
477

AMPLIFIER COLLOCATIONS IN THE BNC

TABLE 3 Forty Strongest Collocations With Selected Boosters in the British National Corpus
deeply enormously extremely greatly 4.94 4.51 4.38 4.33 4.17 4.09 3.95 3.92 3.82 3.67 3.53 3.43 3.42 3.39 3.31 3.30 3.27 3.24 3.22 3.14 3.03 2.81 2.69 2.52 2.49 2.27 2.26 1.91 1.85 1.75 1.40 0.84 0.77 hard-working time-consuming versatile rare arduous valuable difcult frustrating distressing wasteful knowledgeable unhelpful durable distressed fruitful risky fortunate helpful unwise wary costly reactive painful doubtful lucrative annoying hazardous agitated stressful useful economical unpopular damaging grateful uncomfortable dangerous cautious unpleasant unlikely tedious 3.90 3.88 3.86 3.83 3.71 3.71 3.66 3.65 3.61 3.56 3.55 3.51 3.49 3.47 3.42 3.41 3.40 3.39 3.35 3.35 3.35 3.32 3.30 3.29 3.25 3.25 3.24 3.24 3.21 3.18 3.16 3.16 3.15 3.15 3.13 3.10 3.10 3.09 3.08 3.08 facilitated appreciated outnumbered admired exaggerated enhanced enlarged beneted strengthened simplied elongated indebted inuenced expanded improved hindered differed accelerated reduced diminished underestimated amplied impressed respected hampered elaborated contributed varied alarmed assisted increased exacerbated honoured boosted distressed aided relieved altered widened encouraged 5.21 5.07 5.03 4.95 4.94 4.93 4.81 4.76 4.64 4.63 4.62 4.62 4.56 4.50 4.47 4.43 4.28 4.25 4.09 4.08 4.08 4.06 4.06 4.03 4.01 4.01 3.99 3.96 3.94 3.93 3.92 3.91 3.84 3.65 3.49 3.48 3.39 3.33 3.32 3.31 varied inuential impressed strengthened enhanced expanded grown appealed helped valuable increased proud contributed successful enjoyed fat helpful grateful popular improved exciting expensive complex liked rich changed powerful important difcult wide long different high heavily

478
5.31 4.86 4.64 4.59 4.32 4.21 4.17 4.17 3.96 3.94 3.89 3.80 3.77 3.76 3.74 3.63 3.59 3.57 3.56 3.52 3.36 3.36 3.26 3.24 3.23 3.22 3.21 3.19 3.13 2.95 2.89 2.88 2.86 2.85 2.85 2.74 2.73 2.73 2.64 2.54 engrained ingrained inhaled exhaled rooted incised entrenched saddened embedded recessed indebted etched indented imbued unpopular awed resented immersed tanned penetrated regretted suspicious shocked offended ashamed disturbed resentful blushed disturbing unhappy implicated researched compromised distressed troubled satisfying wrinkled inuenced sceptical divided 7.34 6.89 6.24 5.88 5.67 5.66 5.52 5.43 5.34 5.32 5.20 5.12 5.03 4.98 4.86 4.72 4.68 4.63 4.57 4.57 4.33 4.33 4.32 4.14 4.12 4.08 4.06 4.01 4.01 3.99 3.99 3.91 3.89 3.89 3.88 3.83 3.79 3.79 3.78 3.68 trafcked cratered sedated accented made-up subsidized indebted muscled populated reliant weighted subsidised laden mortgaged outnumbered forested fortied censored skewed congested taxed polluted burdened inltrated wooded soiled biased armoured dependent contaminated inuenced guarded criticised veiled stocked laced stacked criticized scented discounted

badly

clearly

considerably

TESOL QUARTERLY

mauled sprained bruised corroded damaged behaved decomposed unstuck shaken wounded injured ventilated charred mutilated hurt dented burned weathered scarred deformed affected bitten leaking swollen beaten deteriorated treated eroded handled drafted disrupted hit stained burnt ooded needed lit neglected battered crushed

6.23 5.85 5.68 5.48 5.34 5.19 5.05 5.05 4.79 4.63 4.59 4.58 4.57 4.55 4.48 4.46 4.37 4.31 4.21 4.15 4.02 3.98 3.90 3.89 3.80 3.71 3.58 3.56 3.55 3.51 3.43 3.38 3.36 3.28 3.18 3.16 3.07 3.01 2.97 2.88

demarcated delineated articulated identiable signposted visible dened distinguishable audible enunciated denable differentiated discernible demonstrated recognisable stated labelled marked recognizable understood undesirable advantageous unambiguous distinguished desirable observable formulated illustrated evident misunderstood incompatible inuenced documented numbered identied unsatisfactory inconsistent inappropriate correlated unacceptable

4.85 4.71 4.15 4.07 4.06 3.93 3.89 3.83 3.76 3.76 3.72 3.57 3.24 3.23 3.20 3.06 2.76 2.61 2.61 2.51 2.49 2.48 2.48 2.45 2.43 2.43 2.41 2.37 2.33 2.30 2.24 2.23 2.23 2.20 2.18 2.17 2.09 2.04 2.03 2.00

lessened worsened broadened varied differed expanded slowed strengthened simplied weakened eased enhanced cheered improved widened reduced altered brighter enlarged diminished cheaper risen increased beneted decreased higher slower shorter less longer larger lower lighter declined smaller changed modied easier greater fewer

7.09 7.01 6.28 6.13 5.93 5.86 5.84 5.62 5.54 5.48 5.35 5.34 5.24 5.21 5.20 5.13 5.11 5.09 5.08 5.04 5.02 4.96 4.73 4.71 4.46 4.46 4.39 4.38 4.36 4.35 4.34 4.30 4.29 4.28 4.26 4.15 4.13 4.04 4.01 3.96

AMPLIFIER COLLOCATIONS IN THE BNC


really severely terribly very 5.74 4.15 3.87 3.81 3.76 3.75 3.73 3.62 3.43 3.36 3.35 3.33 3.32 3.22 3.17 3.16 3.11 3.10 3.01 3.00 2.89 2.82 2.79 2.76 2.68 2.67 2.67 2.66 2.66 2.61 2.57 2.56 2.56 2.52 2.34 2.25 2.21 2.13 2.12 2.02 choosy fond gratifying likeable clever tiring creditable distressing helpful attering nice pleasant considerate distressed frustrating tasty grateful difcult careful fussy prim methodical enjoyable handy time-consuming sad astute posh observant knowledgeable possessive proud impressed keen upsetting exciting windy versatile amusing low-key 4.25 3.92 3.69 3.68 3.54 3.52 3.51 3.40 3.39 3.39 3.34 3.30 3.28 3.27 3.27 3.26 3.24 3.21 3.19 3.16 3.16 3.15 3.13 3.11 3.11 3.08 3.04 3.03 3.03 3.01 2.99 2.98 2.98 2.97 2.97 2.97 2.97 2.96 2.95 2.93 homesick sorry excited upset unhappy sad embarrassing boring embarrassed disappointed painful depressed afraid brave worried frightened impressed tired expensive hurt complicated ill funny glad keen proud helpful busy wrong exciting important angry guilty difcult hard dangerous interested clean nice interesting 3.36 3.15 3.10 2.92 2.86 2.86 2.58 2.54 2.51 2.49 2.41 2.31 2.29 2.29 2.29 2.25 2.20 2.17 2.11 2.06 1.98 1.97 1.95 1.93 1.93 1.91 1.90 1.87 1.86 1.84 1.82 1.81 1.80 1.80 1.74 1.72 1.72 1.72 1.71 1.71 undernourished 6.50 censured 6.45 curtailed 6.26 reprimanded 6.09 disadvantaged 6.00 circumscribed 5.97 incapacitated 5.94 depleted 5.68 handicapped 5.66 hampered 5.63 demented 5.60 damaged 5.56 retarded 5.55 disrupted 5.55 dented 5.54 punished 5.53 impaired 5.35 disabled 5.32 restricted 4.94 criticised 4.82 eroded 4.82 constrained 4.72 shaken 4.62 strained 4.52 depressed 4.42 polluted 4.41 undermined 4.38 affected 4.28 wounded 4.24 weakened 4.22 injured 4.17 tortured 4.15 disturbed 4.13 bruised 4.09 ill 4.04 beaten 4.03 limited 3.91 burned 3.76 tested 3.75 diminished 3.69 5.23 4.40 4.40 4.39 4.36 4.33 4.32 4.28 4.24 4.11 4.11 4.07 3.91 3.89 3.86 3.80 3.79 3.75 3.74 3.74 3.69 3.58 3.53 3.51 3.47 3.45 3.43 3.40 3.35 3.30 3.29 3.29 3.29 3.24 3.20 3.19 3.18 3.13 3.13 3.11 chuffed naff pissed off scary weird groovy annoying uptight tacky wacky cute nice funny nasty annoyed pathetic tasty grown-up obnoxious horrible boring amazing upset scared bothered fed-up naughty disgusting degrading bad vile stupid skinny juicy silly hilarious hurt sexy excited frightening

highly

incredibly

particularly

very much appreciated alive mistaken liked alike enjoyed admired disliked regretted loved easier obliged poorer inuenced depended quicker in common slower longer smaller aware stronger harder cheaper safer intact worse bigger broader larger inclined faster afraid shorter impressed okay missed wider bigger involved 3.47 3.11 3.07 3.04 2.77 2.76 2.61 2.51 2.51 2.36 2.35 2.31 2.16 2.15 2.13 2.08 1.99 1.91 1.86 1.82 1.81 1.77 1.77 1.72 1.67 1.65 1.65 1.62 1.53 1.52 1.50 1.47 1.47 1.47 1.44 1.42 1.38 1.33 1.28 1.27

imageable sexed manoeuvrable esteemed politicised commended ammable politicized prized inammable acclaimed falsiable ritualised reproducible stylized publicised skilled individualistic problematical improbable centralized puried rugose contagious mechanised motivated polished selective readable valued nutritious respected correlated impressionable questionable specialised toxic contentious commendable repellent

7.87 6.87 6.53 6.45 6.40 6.40 6.37 6.30 6.27 6.16 5.99 5.89 5.86 5.82 5.76 5.68 5.52 5.52 5.50 5.46 5.42 5.41 5.34 5.31 5.28 5.28 5.27 5.24 5.22 5.21 5.20 5.20 5.19 5.18 5.16 5.12 5.12 5.11 5.09 5.03

sexy nave handsome boring brave exciting lucky stupid efcient clever complicated dangerous thin fast beautiful tired expensive funny powerful soft slow detailed successful difcult simple strong easy blue complex low interesting hard short high important large long small good little

4.70 4.53 3.75 3.73 3.67 3.65 3.55 3.55 3.47 3.42 3.34 3.21 3.20 3.12 2.99 2.86 2.84 2.77 2.77 2.64 2.50 2.40 2.36 2.34 2.33 2.28 2.27 2.25 2.21 2.13 2.00 1.83 1.82 1.40 1.30 0.75 0.67 0.66 0.46 0.40

galling apposite noteworthy poignant irksome noticeable onerous virulent vulnerable susceptible gratifying heartening prevalent fond instructive obnoxious scathing acute suited useful pertinent pleasing inspiring adept helpful sensitive apt intractable traumatic attractive valuable striking stressful disadvantaged rewarding problematic evident memorable timely receptive

479

Note. Values are strength of the collocation as calculated by the MI measure.

Very is associated with words having generally positive associations (e.g., fond, clever, nice, tasty), but some adjectives have negative associations (e.g., distressed, difcult, sad); 23% end in -ing; 13% have a -y sufx; only two end in -ed. Very much is typically associated with words having positive associations and with comparisons; 43% of the collocates end in -er; 38% have an -ed sufx.

In addition to supporting the view that words associated with ampliers may be usefully characterized in collocational terms, the data presented here suggest that certain grammatical and semantic factors may also be involved in determining which modiers are likely to be found preceding particular words. Gradability or scalarity have traditionally been seen as the most important of these factors. The data in this study suggest that account should also be taken of such factors as whether the meaning of the adjective or verb has negative or positive associations or whether the derivational history of the adjective has certain features (e.g., -ed, -ful, -less, -able, or -y sufxes). In addition, words most strongly associated with particular ampliers are not necessarily the most frequently associated, as shown in Tables 2 and 3. Of the ampliers listed by Biber et al. (1999, p. 545), only very nice, really nice, and really bad appear in the present analysis.

THE CORPUS AND THE NATURE OF LANGUAGE LEARNING


Corpus-based research on collocational relationships has shown the extent to which a language may be considered to be a vast and complex collection of groups of words that together function as lexemes and that in turn become associated with particular domains of use. The collocations described in this study reveal a minuscule part of the learning that is necessary in order to become a uent user of English. A substantial part of linguistic competence appears to be based on a huge store of memories of previously encountered words and groups of words stored in units of use. Research in cognitive science has explored the mechanisms by which words and groups of words are established in memory (Kirsner, 1994). The frequency with which individuals experience words and groups of words signicantly inuences the extent to which these linguistic items are associated, stored, and retrievable from memory. By repeatedly coming across the same words occurring together in the same order, individuals store implicitly learned linguistic patterns or constructions (e.g., a lot of, over there, excuse me, dont think so, I think Ill go to bed, Theres nothing worth watching on TV tonight). Very often the patterns these
480 TESOL QUARTERLY

collocations represent have never been noticed by grammarians, and they have consequently not become part of descriptions of English. Francis, Hunston, and Manning (1996), however, illustrate something of the systematic complexity that corpus-based analysis can reveal. Learning to associate forms with forms, forms with semantic or pragmatic functions, and forms and functions with contexts requires huge amounts of exposure. The more frequently individuals are exposed to these units of use (which typically consist of tone units of up to about seven syllables), the faster they can process these units, the more often they exploit probabilistic knowledge about the use of those units, and the more uent they become (Bybee & Hopper, 2001). The ability of young children to complete sentences from well-loved stories they have previously and repeatedly had read to them (or their concern when parts of such texts are omitted) illustrates the power of the frequency effect. One of the issues that continues to be explored within applied linguistics is the relative importance of implicit knowledge and explicit knowledge in language learning. Implicit knowledge is, in part at least, the vast amount of information acquired unconsciously about language simply through exposure to the language being used. For example, no one tells a speaker that in English one is likely to completely forget to ring people, not just forget; that things tend to be totally unacceptable and not just unacceptable; that trucks are rarely just laden but tend to be heavily (or fully) laden; someone might become deeply suspicious or highly (rather than heavily) skilled; one is more likely to be incredibly lucky than highly lucky; and so on. The acquisition of such implicit knowledge may be characterized as learning without awareness. In arguing for the centrality of the lexicon in language acquisition, established by implicit processes, Kirsner (1994, p. 283) suggested that the amount of exposure to and use of lexical items by young language learners may be greatly underestimated and that a core of highly practised words and routines provides the basis of uency. L2 learners often nd it difcult to get enough exposure to such routines to establish comparable uency. Implicit knowledge seems to make up much of what is acquired in L1 acquisition, and this knowledge seems to be acquired especially from meaning-focused interaction, both input and output. By listening and reading, individuals experience many words and many strings of words, and eventually they recognise or recall the ones that are met most often or that have left some particularly striking reason for remembering them. The words and phrases experienced the most leave the strongest trace because language learning is essentially probabilistic learning, as is demonstrated by individuals use of many of the same set phrases (including clichs) as used by parents and peers in speech communities. Learning explicit knowledge, on the other hand, is learning with
AMPLIFIER COLLOCATIONS IN THE BNC 481

awareness. It includes the very large amount of information that typically forms the basis for the L2 curriculum, the words, patterns, and functions that are taught as a set of assumptions about what one needs to know in order to become a uent user of the language. For example, learners might be reminded that in their L1 nouns are modied with adjectives placed before the noun (a hot day) whereas in their target language adjectives are placed after nouns. At its worst, a focus on explicit knowledge can result in inaccurate rules being taught to learners (e.g., a rule stating that different from is correct but different than and different to are incorrect). At its best, good evidence shows, explicit instruction, with focus on language items, patterns, and rules, can speed up learning by supporting implicit learning processes. In a comprehensive review of the research on the effectiveness of L2 instruction, Norris and Ortega (2000) have conrmed that explicit types of instruction are particularly effective in achieving formal accuracy. Furthermore, they conrmed that focuson-form instruction results in signicant and lasting effects especially if form is explicitly associated with communicative function. Ellis (1994) and Robinson (2001) have brought together interdisciplinary approaches to the respective roles of implicit and explicit instruction, and have discussed the kinds of knowledge acquired from each. It remains unclear, however, how collocational knowledge of the kind explored here can realistically be the focus of explicit instruction.

IMPLICATIONS FOR LANGUAGE TEACHING


Language teacher education tends to be centred on three main areas: (a) the factors that contribute to successful learning (e.g., the effect of learner attitudes, motivation, age, and personality and the strategies successful learners adopt); (b) the techniques available to accelerate the learning of the skills of listening, speaking, reading, and writing (usually called methodology); (c) the content of language learning (e.g., what words are most useful to know, how much of the grammar needs to be taught, and what parts should be taught rst). In planning instruction, teachers need to keep in mind knowledge about how languages are learned, how to teach, and what learners learn. The rst two of these, however, have tended to dominate language teacher education and research on second language acquisition for the past two decades while the content of language learning has tended to receive less attention.

Implicit Versus Explicit Curriculum


A language teaching curriculum is a set of assumptions about the explicit knowledge that learners of a language need to be exposed to. A
482 TESOL QUARTERLY

curriculum typically includes explicit instruction on the language code (phonology, grammar, and vocabulary); the way to perform certain speech acts (e.g., how to apologise or seek clarication); development of prociency in the skills of listening, speaking, reading, or writing; and appreciation of culture, ranging from food and favourite pastimes to societal values and historical, literary, and artistic achievements. The explicit curriculum is determined or inuenced by several principal stakeholders, including curriculum designers, materials developers, classroom teachers, and, of course, language learners themselves. Many sources of information are used to decide what particular groups of learners need to know or be able to do. These include needs analysis, error analysis, and corpus-based analysis to discover, for example, the words most frequently used by native speakers, use often being a good indication of usefulness. Curriculum items then have to be turned into tasks and activities that will motivate learning. Corpus-based analysis has shown, however, that in addition to the explicit curriculum, another kind of curriculum is imposed by the language itself. As the analysis of amplier collocation in the present study illustrates, only now is a fuller picture of the complexity of what has to be learned emerging through corpus analysis. Essentially, learners are acquiring experience of which linguistic items typically occur in the company of other items. This curriculum need not be explicit and is indeed typically hidden, but it still has to be learned if uency is to be achieved. Through exposure to language in use, learners unconsciously acquire implicit knowledge that forms the basis for their own language use.

Maximising Opportunities for Internalization


Teaching explicitly the kind of complexity revealed by the corpus is certainly not easy. The challenge for language teachers is to devise a curriculum that maximizes the opportunities for learners to get enough experience of the units of language in use in order to internalise them. Some of the collocations that contain the strongest bonds as measured by the MI score (see Tables 2 and 3) are in fact infrequent (e.g., badly mauled, deeply engrained, heavily trafcked, particularly galling, severely curtailed, badly decomposed, heavily forested, entirely blameless, fully conversant). From a pedagogical viewpoint, the most frequently occurring collocationsrather than those which are most strongly bondednormally need to be learned rst. Some explicit instruction in frequently occurring collocations containing particular ampliers, taught as vocabulary, is nevertheless almost certainly worthwhile. These collocations can range from the highly frequent (e.g., very good, really good) to less frequent types
AMPLIFIER COLLOCATIONS IN THE BNC 483

such as completely clear, highly skilled, or clearly visible. The most infrequent collocations, however strong the bonding, are almost certainly best left for implicit learning. Ensuring that language learners get frequent opportunities for internalizing prefabricated word groups is not the only task of the language teacher, but it is surely one of the most neglected. The L1 learner may receive 30,00040,000 hours of exposure to and practice in using the language by the age of about 12 years. L2 learners may at best expect to get less than 10% of this amount of experience. Because frequency of experience signicantly affects learning, the provision of systematic, repeated exposure to collocations in meaningful contexts lies at the heart of the teaching enterprise. Palmer (1925) urged his students to memorize perfectly the largest number of common and useful wordgroups (p. 2) through repeated exposure to them over a period of time. It is perhaps ironical that after the 1960s, when language teachers rejected the worst excesses of audiolingualism (in which the notion of repetition had been reduced to oral drilling), there was a tendency to lose sight of the continuing importance of repeated exposure to and experience of the units of the language being learned.

Consciousness-Raising for Teachers


The most important outcome of corpus-based insights into what language learning entails may be in consciousness-raising for teachers. A language imposes its own curriculum on learners, although it is a curriculum that is normally hidden and may be different from what teachers think is being learned. For most learners, explicit instruction simply cannot be expected to provide enough exposure to establish uency. As suggested above, explicit instruction may be particularly appropriate for very high frequency linguistic items and processes, and is perhaps especially valuable up to intermediate levels of prociency. For the rest, learners have to get much of the necessary amount of experience as implicit knowledge, perhaps outside the classroom. The encouragement of autonomous learning, especially through reading, is obviously very important. Elley and Mangubhai (1983) showed how reading can contribute to language learning. The data presented here on amplier collocations suggest why reading works. Beyond the 4,000 5,000 words learners tend to use in a preliterate context, most of the words and groups of words native speakers have learned have probably been acquired through reading.

484

TESOL QUARTERLY

CONCLUSION
The research reported in this study reveals how corpus-based analysis can throw light on the nature and extent of collocational bonding between words. Corpus-based analysis of language has clearly gone well beyond the word counting that characterized much of the corpus research until the 1990s but that nevertheless inuenced curricula from the 1920s in both British and U.S. English language teaching traditions (Kennedy, 1998). There is now a better appreciation of the nature and scope of the language learning process and especially of how much there is to learn when one learns a language. From the 1970s the conventional wisdom has been that a language should not be taught as an unapplied system, and this insight was applied within the context of communicative language teaching. A major insight from manual corpus-based analyses by English teaching specialists such as Thorndike (1921), West (1953), and Fries (1940) in the days before corpora were available on computers was that languages should be taught as probabilistic systems in which all potential items are not equally weighted in the curriculum. Modern, computerized corpus-based analysis has continued in this tradition and has also shown the extent to which grammatical rules and processes can have lexical constraints, with lexical units often much bigger than dictionary headwords. At a professional level, language educators have been challenged by corpus linguistics to work out how to maximize the exposure learners need to acquire probabilistic implicit knowledge. In addition, data of the kind considered here can reveal something of the cognitive processes that lie behind language learning and use and that enable individuals to become uent language users, and it is these insights that can be among the most satisfying of all.
ACKNOWLEDGMENTS
I am grateful to Sasha Calhoun for research assistance retrieving data from the corpus. An earlier analysis of 12 of the ampliers discussed in this article is included in Kennedy (2002).

THE AUTHOR
Graeme Kennedy is professor of applied linguistics and director of the New Zealand Dictionary Centre at Victoria University of Wellington. He completed his PhD at the University of California, Los Angeles, and has held visiting positions in the United States, England, China, and Switzerland. He is general editor of A Dictionary of New Zealand Sign Language (Auckland University Press, 1997) and author of An Introduction to Corpus Linguistics (Longman, 1998).

AMPLIFIER COLLOCATIONS IN THE BNC

485

REFERENCES
Allerton, D. J. (1987). English intensiers and their idiosyncracies. In R. Steele & T. Threadgold (Eds.), Language topics: Essays in honour of Michael Halliday 2 (pp. 1531). Amsterdam: Benjamins. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman. Bolinger, D. L. (1972). Degree words. The Hague: Mouton. BNCweb: A Web-based interface to the British National Corpus. (2002). Retrieved June 5, 2003, from http://homepage.mac.com/bncweb/home.html Bybee, J., & Hopper, P. (Eds.). (2001). Frequency and the emergence of linguistic structure. Amsterdam: Benjamins. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16, 2229. Church, K. W., Gale, W., Hanks, P., Hindle, D., & Moon, R. (1994). Lexical substitutability. In B. T. S. Atkins & A. Zampolli (Eds.), Computational approaches to the lexicon (pp. 153177). Oxford: Clarendon Press. Elley, W., & Mangubhai, F. (1983). The impact of reading on second language learning. Reading Research Quarterly, 19, 5367. Ellis, N. (Ed.). (1994). Implicit and explicit learning of languages. London: Academic Press. Firth, J. (1957). Papers in linguistics, 19341951. London: Oxford University Press. Francis, G., Hunston, S., & Manning, E. (1996). Grammar patterns 1: Verbs. London: HarperCollins. Fries, C. C. (1940). American English grammar (Monograph 10). New York: National Council of Teachers of English. Hakuta, K. (1974). Prefabricated patterns and the emergence of structure in second language acquisition. Language Learning, 24, 287298. Kirsner, K. (1994). Second language vocabulary learning: The role of implicit processes. In N. Ellis (Ed.), Implicit and explicit learning of languages (pp. 283311). London: Academic Press. Kennedy, G. (1998). An introduction to corpus linguistics. London: Longman. Kennedy, G. (2002). Absolutely diabolical or relatively straightforward: Modication of adjectives by degree adverbs in the British National Corpus. In A. Fischer, G. Tottie, & H. M. Lehmann (Eds.), Text types and corpora: Studies in honour of Udo Fries (pp. 151163). Tbingen, Germany: Gunter Narr. Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English. London: Longman. Lorenz, G. (1999). Adjective intensicationlearners versus native speakers: A corpus study of argumentative writing. Amsterdam: Rodopi. Nattinger, J. (1980). A lexical phrase grammar for ESL. TESOL Quarterly, 14, 337 344. Norris, J., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417528. Palmer, H. E. (1925). Conversation. Bulletin, 25. Palmer, H. E. (1933). Second interim report on English collocations. Tokyo: Kaitakusha. Paradis, C. (1997). Degree modiers of adjectives in spoken British English. Lund, Sweden: Lund University Press. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike uency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191226). London: Longman. Peters, A. (1977). Language learning strategies: Does the whole equal the sum of the parts? Language, 53, 560573.
486 TESOL QUARTERLY

Peters, A. (1983). The units of language acquisition. Cambridge: Cambridge University Press. Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman. Robinson, P. (Ed.). (2001). Cognition and second language instruction. Cambridge: Cambridge University Press. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Thorndike, E. L. (1921). Teachers workbook. New York: Columbia Teachers College. West, M. (1953). A general service list of English words. London: Longman. Wong-Fillmore, L. (1976). The second time around. Unpublished doctoral dissertation, Stanford University, Stanford, California.

AMPLIFIER COLLOCATIONS IN THE BNC

487

A Combined Corpus and Systemic-Functional Analysis of the Problem-Solution Pattern in a Student and Professional Corpus of Technical Writing
LYNNE FLOWERDEW
Hong Kong University of Science and Technology Hong Kong, China

This article reports on research describing similarities and differences between expert and novice writing in the problem-solution pattern, a frequent rhetorical pattern of technical academic writing. A corpus of undergraduate student writing and one containing professional writing consisted of 80 and 60 recommendation reports, respectively, with each corpus totaling approximately 250,000 words. Drawing on two analytic perspectives, the methodology included searches for key words that provided linguistic evidence for the problem-solution pattern. A more delicate examination of the linguistic meanings encoded in the problemsolution reports involved a systemic-functional approach to analysis of evaluative texts, APPRAISAL (Martin, 2000, in press), as well as an analysis of the lexicogrammatical patterning of the key word problem within a framework of causal relations. Along with many similarities between the expert and the novice writing, ndings highlight important differences in the use of problem within the causal relation patterns. Pedagogical implications are discussed.

he problem-solution rhetorical pattern appears frequently in technical reports and other academic writing, perhaps most notably when the author introduces the issue that the report or paper discusses as a problem and then presents the main point of the paper as a solution. As a consequence, successful academic English writers draw on important aspects of their rhetorical knowledge and strategic competence when they exploit the linguistic resources required to express the problemsolution pattern in a range of academic writing. Accordingly, ESL/EFL teachers need to understand this common pattern, particularly in terms of how it is realized linguistically by novice and expert writers. This study
489

TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

addresses that need through a large-scale corpus analysis that was conducted to identify individual lexical items that may realize the two most basic elements of the problem-solution pattern and through a subsequent detailed analysis of the lexicogrammatical patterning of the items realizing the problem-solution elements in the student and the professional corpus.

PROBLEM-SOLUTION PATTERNS
The problem-solution pattern consists of four basic elementssituation, problem, solution, and evaluation (Hoey, 1983, 2001; Jordan, 1984; Winter, 1986). An advertisement for an Internet service illustrates the main features of the pattern together with their signaling devices:
TRYING TO WORK WITH THE INTERNET? IS THE INTERNET TURNING YOU INTO A MONSTER? LET MCIS HELP YOU CONTROL THE BEAST. MCIS is a Total Internet Solutions Provider and can assist you in the following areas: [A list follows.] (Hoey, 2001, p. 128)

As Hoey explains, the rst sentence is the situation element, which in advertisements commonly invites readers to identify with the situation being describedin this case, working with the Internet. The second sentence signals the problem element through the use of the word monster, which is also triggered by the near-synonym beast in the third sentence. This third sentence offers a solution (MCIS) as well as a positive evaluation (help), which is reiterated in the nal sentence (can assist you). This brief example demonstrates that the linguistic devices employed are an important consideration for identication of the pattern and that the separate elements may not be conned to one sentence or paragraph but can co-occur in a single sentence, as in Sentence 3 in the example. The problem-solution rhetorical pattern has been associated with moves identied through genre analysis. They might be seen as overlapping with move structure analysis (see Paltridge, 2001, p. 72) or being subsumed within the structure of a particular move (see Flowerdew, 2000b). To illustrate the latter point, I consider a subsection from the Discussion section of a nal-year engineering undergraduate report (see Figure 1). According to Swales (1990, pp. 172173), the Discussion sections of research reports contain the following eight move structures: background information, statement of results, (un)expected outcome, reference to previous research, explanation, exemplication, deduction, and recommendation.
490 TESOL QUARTERLY

FIGURE 1 Example of the Problem-Solution Pattern in the Unexpected Outcome Move of a Discussion Section Results Analysis Modications Although we could not test the concentration of oxygen in the seawater due to equipment failure we could observe that the sh in the tank lacked oxygen as most of them came up to the water surface for respiration. The original air injection system integrated with the lter could not provide enough oxygen to the culture. We added an external air pump to improve the problem. However, we could not inject air into the tank directly as foam might form. As a result, we added an air pump into the foam removal unit, allowing external air to be injected into the unit. In order to remove carbon dioxide from the culture, we put some seaweed in the tank. This is the most efcient way to remove carbon dioxide from the water. Note. From Flowerdew (2000b, p. 378). Situation Problem 1a + partial solution Problem 1b + solution Problem 2 + solution + evaluation

The subsection in Figure 1 would belong to the unexpected outcome move structure, as it deals with modications made in the light of unexpected equipment failure. Within this move structure all elements of the problem-solution pattern are present as identied in Figure 1. (In fact, the extract is a slight elaboration of the basic pattern: The rst solution proposedaddition of an external air pumpis only a partial one as it sets up another related problem to be solvedthe method of injecting air into the tank.)

ANALYSIS OF PROBLEM-SOLUTION TEXTS


Despite the importance of the problem-solution pattern, its moves and linguistic patterns have not been analysed in any great detail through either a genre-based approach or a quantitative corpus analysis. However, neither method alone seems ideally suited to the kind of analysis that would be informative for English language teachers.

Genre Analysis and Corpus Analysis


Although analysis of the problem-solution pattern in technical writing might fall within the interests of genre researchers, no detailed, genreoriented analysis of this pattern exists. Instead, most genre-oriented
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 491

analyses have drawn on Swales (1990) concept of move structures, an approach mainly developed for the introduction-method-results-discussion (IMRD) sections of reports (Bondi, 2001; Connor, Precht, & Upton, 2002; Gledhill, 2000; Marco, 2000; Tribble, 2002; Upton & Connor, 2001; for a detailed report on these approaches to genre-based corpus studies, see Flowerdew, 2002). Genre analysts have tended to disregard the problem-solution pattern, either because it is not relevant to the genre under investigation, such as job application letters (Upton & Connor, 2001), or because it does not accommodate the full range of rhetorical strategies writers use to convey their message (Swales, 1990). In a sense, genre analysis may consist of rather broad strokes for the more delicate lexicogrammatical positioning of moves within the problem-solution pattern. Nor has the problem-solution pattern been the major focus of quantitative, corpus-based studies. The computer-assisted corpus studies based on a multidimensional statistical analysis of linguistic features have sought evidence for differences in registers based on differing cooccurrence of linguistic features (see Biber, 1988; Biber, Conrad, & Reppen, 1998; Biber, Conrad, Reppen, Byrd, & Helt, 2002). Register differences have been found between general categories such as narrative versus nonnarrative concerns and impersonal versus nonimpersonal style. For example, nouns and attributive adjectives have been found to be associated with informational production. This approach has not been attempted for generic patterns such as problem-solution, perhaps because the focus has been large-scale comparisons based on features identied as quantitatively salient across a corpus of texts. The features of interest for the problem-solution pattern are those which express the meanings that unfold gradually within a text. Scotts (2000) small-scale quantitative corpus study may point to a more fruitful way of looking at the delicate patterns that realize the meanings expressed in problem-solution texts. Scott investigated the problem-solution pattern in a corpus of 1,783 Guardian newspaper feature articles from 1990 to 1994. Using the keywords function in WordSmith Tools (Scott, 1999), which identies words of unusually high frequency through comparison with a reference corpus, he ascertained whether types such as problem and solution occurred as key words in the corpus in question. An additional research question was whether these key wordsif indeed presentsignaled a global problem-solution text structure or played a more local role. Scott found that problem occurred as a key word in only three of the articles. Interestingly, a closer examination of problem revealed that in many cases it had a purely local scope rather than serving as a marker of overall textual organization.

492

TESOL QUARTERLY

Combining Corpus and Systemic-Functional Perspectives on Genre


Scotts (2000) corpus-based approach, which links specic key words to the generic structures of the problem-solution pattern, seems a useful beginning point for the study of this pattern. However, even broader analytic perspectives are necessary for a fuller understanding of the functions of writers lexical choices in the problem-solution pattern. In earlier work (Flowerdew, 1998b), I suggested that combining corpus linguistic techniques with systemic-functional perspectives (Halliday, 1994; Halliday & Martin, 1993) might be a way to address the need for analyses of the generic patterns that would be useful to teachers. Such an approach combines the use of corpus linguistics software tools with a systemic-functional analysis to investigate how writers lexicogrammatical choices express problems and solutions. This research uses the corpus linguistic method of identifying frequent key words and then categorizes the lexical realizations in the problem-solution texts from the perspective of systemic-functional linguistics. Corpus linguistic methods are used to gather linguistic evidence for classifying reports as belonging to the problem-solution pattern. In particular, the keywords function in WordSmith Tools uncovers lexical items of unusually high frequency, as demonstrated by Scott (2000). A key-word list that contains a variety of lexis relating to the problemsolution pattern would be used as evidence for classifying a corpus as organized according to that pattern. Previously, this tool has been used to delineate particular genres on the basis of the key words (Bondi, 2001; Scott, 1997; Tribble, 2002), but it can be applied as well to uncovering key lexis for the problem-solution pattern. The systemic-functional category of interest has been dened in terms of APPRAISAL (as described by Martin, 2000, in press), which was developed to analyze the linguistic choices people make to realize interpersonal functions in language. Interpersonal functions concern how writers use language to evaluate and position themselves regarding situations and events. This framework has mainly been applied to the analysis of media discourse, casual conversation, and literature (see White, 2001, for an overview of these studies) but not to the analysis of student or professional report writing. The interpersonal meanings should play an important role in report writing because of the semantic character of problems and solutions (i.e., problems are bad, and solutions are typically portrayed as good). A means of classifying evaluative language, then, should reveal some aspects of the problem-solution that are relevant for the type of text under investigation. The APPRAISAL system is used to code evaluative lexis as Inscribed or
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 493

Evoking. The Inscribed option refers to lexis that is explicitly evaluative, in which the evaluation is encoded in the word. These types of words constitute a superordinate category and would include lexis such as problem and recommendation. The Evoking option, meanwhile, draws on ideational meaning to connote evaluation . . . by selecting meanings which invite a reaction (Martin, in press). An item in this category, such as noise, has an intrinsically less negative connotation than an Inscribed item, such as problem, although a readers conventional interpretation may still include a negative connotation for the word when seen out of context. Also included in the Evoking category are words such as pollution and contamination, which, although engendering a negative reaction, act more as hyponyms of superordinate items such as problem in the Inscribed category. The lexicogrammatical patterning of the Inscribed and Evoking keyword items for the problem element can also be described according to systemic-functional categories. The important lexicogrammatical pattern in a problem statement is the LOGICAL sequence of causal relations (Crombie, 1985; Halliday, 1994) because such a statement often entails some aspect of causativity. Lexicogrammatical patternings of the item problem (taken from the corpus of professional texts used in the research described here) can realize ve causal relations: reason-result, means-result, grounds-conclusion, means-purpose, and condition-consequence (see Table 1). However, the problem-solution pattern does not necessarily entail causativity (although it is likely to), so noncausal categories in which the lexicogrammatical patternings of problem occur must be examined as well. For example, because problem is a superordinate type of lexis, it is commonly found in topic-like sentences (e.g., There are several solutions to the problem). Through a combination of corpus and systemic-functional analyses, I addressed the following questions: Does a key-word analysis provide linguistic evidence for classifying a student and a professional corpus as problem-solution based?
TABLE 1 Framework of Causal Relations for Problem Causal relation Reason-result Means-result Grounds-conclusion Means-purpose Condition-consequence Note. Based on Crombie (1985). 494 TESOL QUARTERLY . . . . . . . . . . . . . . . Example export scheme will create a noise problem. thereby averting an odour problem and so ooding is not a serious problem. in order to alleviate the problem of . . . . If there is a problem with . . . .

What kind of lexis realizes the problem and solution elements of the pattern in each corpus? What functions do the lexicogrammatical patternings of the key words serve? How do the lexicogrammatical patterns differ across the two corpora, and how do patterns found in the student corpus compare with those in another corpus?

METHOD Corpora
The problem-solution pattern and the problem element of the pattern were investigated in an undergraduate student corpus (STUCORP) and a professional corpus (PROFCORP) of recommendation-based technical reports. Each corpus consisted of approximately 250,000 words, with 80 reports in STUCORP and 60 reports in PROFCORP. The PROFCORP reports were commissioned by the Hong Kong Environmental Protection Department from various consultancy companies in Hong Kong. They documented the potential environmental impacts of the construction and operation of proposed buildings and facilities. These reports also contained a section on suggested measures to alleviate any possible adverse impacts. The STUCORP reports were written by second- and third-year undergraduate students at a tertiary institution in Hong Kong as an assessed assignment in a technical communication skills course. Brief assignment guidelines given to students stipulated that they were to choose an area for investigation in which a problem or need could be identied based on evidence from secondary and primary source data (e.g., survey questionnaire, interview, observation) and offer a set of recommendations for solving or alleviating the identied problem. The topics of the STUCORP reports were quite wide ranging and mostly concerned different university departmental or service unit issues, such as an evaluation of the existing software or hardware in computer rooms or the lack of security measures in the laboratories. Unlike the PROFCORP reports, however, the STUCORP reports were unsolicited; the students wrote the reports on the basis of a perceived problem rather than in response to a request by a department to investigate an issue. Most previous research on learner corpora has focused on error analysis in student writing (see, e.g., Granger, 1998). However, I was not concerned solely with an examination of deciencies in student writing. Rather, I treated the student corpus as a corpus in its own right and
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 495

examined the major ndings from the perspective of whether the students appeared to be competent writers. I compared STUCORP and PROFCORP not only to establish students deciencies in writing but also to ascertain to what extent the student writing was like or unlike expert writing, bearing in mind the different contextual and situational features of each corpus. That is, I did not assume that differences necessarily indicated deciencies in student writing. When differences arose, I made cross-comparisons with a reference corpus, the written component of the British National Corpus (BNC), which includes more diverse types of texts, to establish whether the differences were a specic feature of apprentice writing or could be considered features of competent writing. The written component of the BNC contains around 90 million words covering nine subject domains (for a detailed description, see Aston & Burnard, 1998, chapter 2).

Procedures
For the quantitative corpus analysis, I used WordSmith Tools (Scott, 1999) to nd key words (identied by their frequency of occurrence in a specialized corpus relative to their frequency in a more general corpus) in each text and in each corpus. First, I treated each report in PROFCORP and STUCORP as a single text and compared it with the 1-million-word core written component of the BNC (as the full BNC was not available to me at the time) for extraction of the key words in order to determine whether the report reected the problem-solution pattern. I then created a database of the key-word les for each corpus, listing the key key-words, or those that were most frequent over four or more texts. (The words that were key in three or fewer reports tended to be the names of environmental companies in PROFCORP and university departments in STUCORP.) The systemic-functional analysis began with the APPRAISAL analysis of the key key-words extracted from STUCORP and PROFCORP. I categorized the key key-words as Inscribed or Evoking and as occurring in the problem or solution elements of the reports. Then, following Sinclairs (1991) idiom principle, I examined the collocational preferences and grammatical structures of selected key words using the concord option in WordSmith Tools. In addition, I classied the key words by their lexicogrammatical patternings within a systemic-functional linguistics framework, that of LOGICAL relationsthe semantic relation of causativity (described in the section Lexicogrammatical Analysis of Problem).

496

TESOL QUARTERLY

RESULTS AND DISCUSSION Key-Word and Key Key-Word Analyses


Each report in PROFCORP contained two or more Inscribed signals (e.g., recommended) and four or more Evoking items (e.g., noise, waste, trafc) occurring as key words. There was thus lexical evidence identifying the reports as belonging to the problem-solution pattern. Likewise, most of the reports in STUCORP contained key-word lexis for the pattern, but the focus tended to be more on the problem element, with problem occurring as a key word in the largest number of reports. Although the occurrence of Inscribed and Evoking key key-word lexis in the corpora showed that they both comprised problem-solution based reports, the overall proles of the patterning were somewhat different (see Table 2). In PROFCORP, the problem element tended to favour Evoking lexis whereas the solution element preferred the Inscribed lexis. Some of this evaluative lexis was key in a very large number of reports, which is not surprising given that the PROFCORP reports were relatively homogeneous in their subject matter. In contrast, the Inscribed lexis dominated the signaling of the problem-solution pattern in STUCORP. The heavy reliance on superordinate terms such as problem, problems, and need for the problem element, and recommendations, solutions, and solution for the solution element, can be traced back to the rubrics for the assignment, which included these Inscribed items in the instructions. Students appear to have incorporated the metalanguage provided in the assignment guidelines into the writing of their recommendation reports to signal the pattern overtly. The inuence of related texts on learner writing has been noted by other corpus linguists (see Hyland & Milton, 1997; McEnery & Kie, 2002; Milton, 2001) and therefore should not be underestimated in the analysis of learner corpora of writing. The smaller number of lexical items occurring as key key-words across a large number of reports in STUCORP is predictable in view of the fact that the student reports covered a much wider range of topics than those in PROFCORP. A greater variety of topics means that the same lexis would tend not to occur across reports and would not show up as key in four or more reports. Therefore, what initially appeared to be a paucity of Inscribed and Evoking signals in STUCORP, and hence a deciency in students writing, may not be so when contextual and situational factors are taken into account. A look at the context of situation (see Halliday & Martin, 1993), which encompasses the eld (the content), tenor (the relationship between the writer and the reader), and mode (the kind of text) that would all inuence lexical choice, is thus in order.
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 497

TABLE 2 Words Used as Inscribed and Evoking Signals for Problem and Solution Elements in STUCORP and PROFCORP Signal STUCORP Problem element Inscribed problem (8) problems (4) need (4) insufcient (5) stolen (4) PROFCORP

Evoking

impacts (50) impact (26) noise (44) trafc (23) sewage (12) sewerage (6) waste (20) wastes (5) dust (20) pollution (10) emissions (10) sediments (10) odour (9) contaminated (14) contamination (4) efuent (6) discharge (5) discharges (5) NSRS (9) Dba (8) TSP (7) leachate (6) stormwater (5) groundwater (4) constructiona (47) landlla (10) Solution element mitigation (43) measures (30) proposed (28) recommended (27) recommendations (8) treatment (9) options (5) plan (5) scheme (5) minimise (5) reduce (4) ensure (4) disposal (14) implementation (6) Barriers (5) Ordinance (4) constructiona (47) landlla (10)

Inscribed

recommendations (5) solutions (4) solution (4)

Evoking

Note. Figures in parentheses denote the number of texts in which the words were found to be key; italics = technical vocabulary. a Can signal either the problem or the solution element.

Lexicogrammatical Analysis of Problem


Here I report the LOGICAL relations analysis of only one key word, problem, and identify similarities or differences between its patterning in STUCORP and in PROFCORP. In addition, I compare some of the lexical patterns associated with problem in the STUCORP with those in the BNC. (The tokens where problem occurs as a heading or subheading were excluded from the analysis.) A summary of the in-text tokens for problem in PROFCORP and STUCORP according to the ve causal categories outlined in Table 1 shows that its frequency and distribution across causal and noncausal categories in the two corpora are quite different (see Table 3). In
498 TESOL QUARTERLY

TABLE 3 Causal Relation Categories for Instances of Problem in STUCORP and PROFCORP STUCORP Relation Causal Reason-result Means-result Grounds-conclusion Means-purpose Condition-consequence Total Noncausal Overall total 84 6 5 48 7 150 323 473 18 1 1 10 2 32 68 100 29 2 1 6 1 39 2 41 71 5 2 15 2 95 5 100 n % PROFCORP n %

PROFCORP, 95% of the tokens for problem (39 of 41) fall into a causal category whereas in STUCORP only 32% of the tokens (150 of 473) occur in a causal relation. Below I discuss how problem functioned in causal and noncausal categories in PROFCORP and compare the lexicogrammatical patterning with that found in STUCORP.

The Reason-Result Category of Problem in PROFCORP


In PROFCORP, the majority of the tokens for problem fall into the reason-result category, the area in which I found student writing to differ most widely from professional writing. The collocational patterning of problem in this category can be viewed from the perspective of either cause/reason (i.e., what problem is caused) or result/effect (i.e., what the cause of the problem is) (see Table 4). For the reason-result category, I also consider what is offered as the solution to a potential problem because it is also a type of causation, as explained below with regard to implicit causative verbs. Problem occurs mostly with some type of causative verb in the reasonresult relation. Of the 12 explicit causative verbs occurring with problem in PROFCORP, 10 signal cause/reason (e.g., . . . works at the tunnel portal will create a noise problem) whereas only 2 signaled result/effect (e.g., The problem derives primarily from . . .). These ndings are in keeping with other research on cause-effect markers (Flowerdew, 1998a, 1998c), in which the use of cause/reason verbs far outweighed result verbs in the types and the number of tokens found in professional writing. Six tokens of problem collocate with implicit causative verbs (minimise, alleviate, eliminate, avert, resolve, address), dened by Fang and Kennedy
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 499

TABLE 4 Collocational Patterning of Problem in the Reason-Result Category Collocate Verbs Explicit causative Implicit causative Phrase with be All Causative nouns Solution nouns Prepositions Overall total STUCORP PROFCORP

20 23 0 43 20 20 1 84

12 6 8 26 1 0 2 29

(1992) as those which entail the meaning of . . . make somebody/thing do something or make somebody/thing + adj. (p. 65). The verbs in this context can all be roughly paraphrased as make the problem better in some way. For example, minimise in the phrase should minimise much of the problem can be paraphrased as make [the problem] less severe. These data thus reveal that when problem collocates with explicit causative verbs, the verbs have a negative semantic prosody; that is, they indicate some type of adverse happening (e.g., . . . will create a noise problem; see Stubbs, 2001, for more on semantic prosody). However, when problem collocates with implicit causative verbs, the verbs take on a positive semantic prosody. Because these verbs indicate some fortuitous event, they can be seen as acting as a two-way signal for the problem-solution pattern (Hoey, 1983) with the verb acting as the solution element, as in, for example, . . . will miminise much of the problem. Problem also collocates with the verb be, used as a main verb, in six phrases. Be is not normally considered a causative verb; however, I would argue that in the context of these environmental reports be takes on the semantics of a causative verb rather than acting as a stative verb because it implicitly means that a present problem could create a future one. To illustrate, an explicit causative verb, such as pose or present, could substitute for be in the examples below:
1. 2. . . . it is considered unlikely that septicity would be a problem. Increased noise levels are not expected to be a problem.

When problem occurs with existential there, as in Examples 3 and 4, be also seems to be acting as an event verb, but in this case it acts as a result/ effect verb because it has the meaning of arise.
500 TESOL QUARTERLY

3. 4.

. . . there should not be any disposal problem. . . . it will be unlikely that there will be a problem. . . .

In PROFCORP, then, the Inscribed signal problem occurs mostly in causal categories, specically reason-result. Signicantly, problem in this relation overwhelmingly collocates with various verbal devices rather than with other markers of causation, such as prepositions or nouns, and connectors do not gure at all. The two instances of problem in the noncausal category relate to elaboration of the problem (e.g., The problem with all these applications is that users . . . .).

The Reason-Result Category of Problem in STUCORP


Although there are far more tokens for problem in STUCORP than in PROFCORP, 473 as opposed to 41 (see Table 3), only 32% (150 tokens) occur in a causal category. As shown in Table 4, these tokens gure most prominently in the reason-result relation, where problem collocates with a variety of causative verbs. The 11 explicit causative verbs used with problem indicating cause/reason, such as cause and create, are similar to those found in PROFCORP, but the range of verbs is far more restricted. I found only 9 instances in which students had tried to use explicit causative verbs denoting result/effect: come from (e.g., . . . and the other problem came from . . .) in 2 examples and with rise/arise in 7 examples. However, rise/arise was used correctly in only 2 cases, which were of the pattern problem . . . has arisen. The main reason for the incorrect use of this verb is that students confused an explicit causative verb marking result/effect with ones marking cause/reason, as exemplied below:
5. 6. *It rises a problem that . . . . *The problem seems to be arised out of the fact that . . . .

Example 5 calls for a cause/reason verb such as create. In Example 6 the use of passive voice signals cause/reason; instead, the active voice should be used to signal result/effect. I now examine the implicit causative verbs collocating with 23 tokens for problem. Whereas in PROFCORP all such verbs have the meaning of make the problem better, in STUCORP these verbs can be divided into two groups: 18 phrases in which the verb has a positive semantic prosody (e.g., solve) and 5 phrases in which the verb has a negative semantic prosody, conveying the meaning that the problem is exacerbated in some way. This difference between the use of implicit causative verbs in the two corpora can be accounted for by the way the report envisions the problem element. The reports in PROFCORP mainly discuss potential
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 501

environmental problems that could arise from any planned construction work, whereas in STUCORP the problems already exist, as evidenced by primary and secondary source data in the student reports. The ve implicit verbs with a negative semantic prosody are used either incorrectly or only marginally correctly. In some cases the student attempted to use an implicit verb as an explicit one, as in the example below:
7. *This situation will deteriorate the problem of . . . .

In other cases, the student used the passive, for example,


8. . . . the problem will probably be worsened.

In nativelike English (Pawley & Syder, 1983) this concept would more likely be expressed by an explicit causative verb + noun, derived from the implicit verb, such as
9. . . . will lead to a worsening of the problem.

I consulted the BNC to ascertain whether this patterning was a feature of student writing or in fact occurred in professional writing. A crosscomparison with the BNC showed that of 272 instances of worsened, only 14 were in the passive, and all were in the past tense, strongly suggesting that the passive use, although possible, is not usual. Other novice writing characteristics were identied in STUCORP. Within the group of 18 implicit verbs denoting some kind of solution to the problem, there are 11 tokens for solve, 2 tokens for attend to, and 1 token each for resolve, ease, x, reduce, and get rid of. However, get rid of is an inappropriate register for the formal context of recommendation report writing and would be better expressed by eliminate. A second is evident from the analysis of the verbs in PROFCORP, which I argued is a basis for treating be as a causative verb in a few cases, but this function of be was not found in STUCORP. In this corpus, one problem leading to another is expressed by variations on the phrase This . . . problem (e.g., This causes a problem . . . ; This may create a problem . . .). With regard to the two-way signaling of the problem and solution elements, in STUCORP, unlike PROFCORP, this relation is very often expressed in a lexicogrammatical phrase containing the two nouns solution(s) and problem. Of the 20 instances of such solution nouns collocating with problem (see Table 4), 10 occur in a phrase proposing a specic solution, for example,
10. Imposed penalty schemes may be a solution to this problem.

502

TESOL QUARTERLY

The other 10 phrases operate at a metadiscourse level, with solution(s) acting cataphorically and problem anaphorically beyond the sentence boundary, as in the examples below:
11. 12. . . . and to suggest some solutions to this problem. . . . we suggest some feasible solutions for the problem.

This kind of explicit signaling is not present in the PROFCORP data, quite possibly because two key subheadings in the reports (Environmental Impacts and Mitigating Measures) fulll the same function; explicit signaling in the body of the reports would be considered redundant. Variations of the solutions . . . problem lexicogrammatical patterning were found in the BNC, but the specic pattern solution . . . problem was more common (323 instances compared with 45 instances of solutions . . . problem). One reason for the higher occurrence of solutions + problem in the STUCORP data could be that students lacked knowledge of the range of implicit verbs found in the PROFCORP data (e.g., alleviate, eliminate) and therefore overrelied on metalanguage. Similarly, the student data lacked modals that were used by the professionals. In PROFCORP modals such as would, should, and could, combined with two-way Inscribed signaling verbs such as minimise or reduce, convey the possible degree of success of the proposed solution, for example, . . . should minimise much of the problem. However, STUCORP contains no instances of these modals in this context, and students used possible + solution on only four occasions to convey this epistemic use (e.g., . . . will be a possible solution to the problem). This suggests that students either saw no need for modal marking or, more likely, had a very limited repertoire of modal expressions just as they had a limited repertoire of verbs. Similar to the ndings with solution nouns, the patterning of problem with causative nouns is far more common in STUCORP than in PROFCORP. Students may have been compensating for their small repertoire of verbs by overusing such nouns. Interestingly, similar to the PROFCORP data, STUCORP contains only one instance of problem with a preposition (i.e., such problem may be due to the fact that . . .) and no occurrences of connectors with problem in the reason-result relation.

Other Categories of Problem in STUCORP


The remaining 323 tokens for problem constitute 68% of the total occurrences in STUCORP. Twenty-one of these tokens were found in sentences relating to the purpose of the investigation, for example,
13. In this project our aim is to investigate seriousness of copyright problem . . .
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 503

The remaining 302 tokens are in the sections of the reports describing and discussing the ndings. The focus of the analysis is the premodication patterning of problem and the anaphoric and cataphoric referencing of this signal. I chose these aspects of the pattern in order to examine whether problem operated at a local level of coherence or played a more discourse-organising role, in which case the referencing would extend across sentence boundaries. About 25% of the tokens for problem are premodied by evaluative adjectives such as common, important, signicant, severe, serious, main, and major. In these cases problem tends to be anaphoric but to operate at the sentence level rather than at the discourse level, for example,
14. . . . the lack of modem lines is really a serious problem.

In several cases, students used the referent it (e.g., It is really a serious problem) instead of the referent this, a common interlanguage error in the writing of Hong Kong students that has been noted by other researchers (see Lin, 2002; Milton, 2001). One striking use of premodication is the use of ordinatives (12 tokens), such as rst, second, third, and next, and the deictic another (16 tokens). Phrases containing ordinals (e.g., The next/second/third problem . . .) and those with main, major, and minor, which usually combine with ordinals (e.g., The rst major problem . . .), have cataphoric reference and are always sentence internal. The main lexicogrammatical pattern is problem + be + noun, and a few cases of the pattern problem + be + that-clause, as in the examples below:
15. The second major problem is the power failure problem.

16. The main problem is that the Division of Humanities could not allocate resources to establish such a centre now.

Twelve of the 16 tokens for another + problem display a similar type of patterning (e.g., Another problem is/was (that) the . . .). These ndings are consonant with Schmids (2000) corpus-based research on the types of nouns that are classied as Inscribed signals in this article. Schmid notes that the lexicogrammatical use of problem nouns is marked by a distinct preference for the patterns noun + be + that (i.e., A problem is that . . .) and the + be + noun (i.e., The . . . is a problem) (p. 122). Overall, these data indicate that the students choice of premodier with problem varied with whether the noun phrase was used for cataphoric or anaphoric reference. In addition, as in the above examples, the anaphoric and cataphoric referencing is almost always sentence internal. What is most signicant, however, is that the anaphoric and cataphoric referencing patterns associated with problem in noncausation phrases are
504 TESOL QUARTERLY

markedly different from those of problem in causal phrases. In all causal categories, problem is invariably anaphoric but, most importantly, operates beyond the sentence boundary (e.g., . . . recommendation to solve the problem). Likewise, in PROFCORP, the same kind of anaphoric referencing at the discourse level is present in the causation-related phrases (e.g., . . . should minimise much of the problem). According to Francis (1986, 1994), problem is one of the most common discourse-organizing anaphoric nouns, which she terms A-Nouns (Schmid, 2000, refers to such discourse organizing nouns as shell nouns). In the example provided by Francis (1994), problem is premodied by this and also happens to be involved in a causal relation, signaled by to get around:
the patients immune system recognised the mouse antibodies and rejected them. This meant they did not remain in the system long enough to be fully effective. The second generation antibody now under development is an attempt to get around this problem by humanising the mouse antibodies, using a technique developed by . . . . (p. 85)

Francis claim about the frequency of problem as a discourse-organizing anaphoric noun is generally true in both STUCORP and PROFCORP when problem is involved in some kind of causal relation. However, other data relating to noncausation phrases do not support Francis premise. Moreover, Hoey (1997) remarks that Francis seems to be overstating the anaphoric importance of signaling nouns such as problem. He points out that nominal groups containing another also label a previous stretch of text. Thus, a full interpretation of another problem requires the reader to relate both an earlier and a later lexicalisation of problem, thus suggesting that the noun phrase functions both cataphorically and anaphorically. The anaphoric function of such nouns as problem is also called into question when it co-occurs with a demonstrative such as this. As Tony McEnery (personal communication, May 29, 2003) has pointed out, the burden of anaphoric reference is carried by this, and the noun problem has the function of evaluating what is being referred to. Consulting the Applied Science component of the BNC (approximately 7 million words) sheds some light on this issue. An examination of the 222 concordance lines of This/this problem showed that 138 (62%) are causation based, with 131 combining with a verb signaling the solution element, for example, This problem was overcome by providing the lamps with locks. A totally different picture emerges from the concordance lines for problem with other premodiers in the same subcomponent of the BNC. For example, of the 36 instances of another problem, only 5 are anaphoric, and they are not discourse based but sentence internal. The majority have cataphoric reference and are of the same patterning as those found
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 505

in STUCORP. Moreover, only 5 are causation based, all employing the verb arise. Not surprisingly, ordinatives displayed patterning similar to that of another. As for evaluative adjectives (e.g., serious, common), examples from the same subcomponent of the BNC show the lexicogrammatical patterning to be very similar to that found in STUCORPthat is, having anaphoric reference within the sentence (e.g., . . . stray radiations become a serious problem . . .). However, in 8 of the 20 concordance lines of serious problem, the anaphoric sentence referent is this (e.g., This is a serious problem . . .), which depends on a previous stretch of discourse for its relexicalisation. Most signicantly, though, of the 21 examples of serious problem and 20 examples of common problem examined in the BNC, there is only one instance of a causation-related sentence. Therefore, unlike the data for problem in causation-related phrases, these data do not support Francis (1986, 1994) premise that problem is a common A-Noun. In sum, the analysis of the above data in PROFCORP and STUCORP and further examples from the BNC suggests that Francis (1986, 1994) is correct in saying that problem functions anaphorically at the discourse level but that this statement is mainly applicable to its role in causal relations. Moreover, the BNC data conrm that when problem is immediately premodied by this or the, it plays an important role in causal relations, but this was not found to be the case with other premodiers, such as evaluative adjectives and ordinatives. As Schmid (2000) points out, Shell nouns and shell-noun phrases can only be studied appropriately if what they link up with is taken into account (p. 8). These data from PROFCORP, STUCORP, and the BNC highlight the importance of also taking into account the semantic relations that a noun such as problem may be involved in when determining its discourse-organizing role. The large number of occurrences of problem in noncausal categories is a difference between STUCORP and PROFCORP. However, the analysis here has shown that the occurrences have anaphoric and cataphoric referencing similar to that found in the Applied Science component of the BNC.

PEDAGOGIC CONSIDERATIONS
The comparison of the lexicogrammatical patterning of the Inscribed key-word signal problem in the student and professional corpora, in conjunction with the comparison of the student corpus with the BNC, has indicated aspects of student writing that mirrored professional writing and other aspects that differed. As far as the noncausal category is concerned, student writing displayed a type of lexicogrammatical
506 TESOL QUARTERLY

patterning similar to that found in professional writing. A range of different types of premodifying adjectives were found with problem, and the anaphoric and cataphoric referencing was similar to that found in professional writing. The main deciencies in student writing surfaced in the patterning of problem in the causal categories. Causative verbs, which play an important rhetorical role in certain types of problem-solution texts, have been shown to be a major area of difculty for students in several aspects. Previous corpus-based research carried out on Hong Kong tertiary students English (Flowerdew, 2000a; Milton, 2001) has revealed that verb choice is problematic for students, with Milton commenting that students more frequently err in their choice of correct verb than any other word class (p. 73). Incorrect verb choice has also been uncovered in this study, as evidenced by the confusion between cause/reason and result/effect verbs. However, this research has also pinpointed other deciencies in the verbal domain, namely, restricted lexical knowledge and semantic inappropriacy. I have surmised that students lack of verbal lexical knowledge led them to overuse the pattern solution + problem instead of using the lexicogrammatical pattern of implicit causative verb (e.g., alleviate, minimise) + problem, which was found in both PROFCORP and the BNC. As Conrad (2000) points out, such corpus-based empirical data are important for language teaching as they can show which lexicogrammatical associations are most commonly used. Below I suggest ways to tackle students deciencies in the area of causative verbs. In order to alert students to the differences between explicit cause/ reason and result/effect verbs, teachers might have students concordance a selection of these verbs (e.g., arise from, pose, present, attribute to) in a subsection of the BNC and classify them by verb category. Students could also concordance implicit causative verbs such as minimise, eliminate, and alleviate, noting their lexicogrammatical and collocational patterning; for example, in the BNC minimise frequently collocates with nouns having a negative semantic prosody, such as risk, trouble, and disaster. In the concordancing exercises, teachers could draw students attention to the combination of epistemic modal verbs with these implicit causative verbs, combinations that were also found to be lacking in students writing. In order to sensitize students to different registers of causative verbs, teachers could ask students to analyze the frequency and uses of pairs such as eliminate and get rid of. Because get rid of occurs more in spoken text in the BNC (e.g., Will we ever get rid of these mosquitoes?), students could infer that it may not be stylistically appropriate for formal report writing. However, as Widdowson (2000) has pointed out, caveats are in order in the transfer of corpus ndings to pedagogy. For example, if students were to analyze concordances of problem in the Applied Science
ANALYSIS OF THE PROBLEM-SOLUTION PATTERN 507

subcorpus of the BNC, they would nd some instances of problem collocating with get rid of even though it would not be an appropriate patterning to transfer to the type of report writing they are being asked to do. For this reason, concordancing should be used judiciously with students, and teachers may need to discuss some registerial and pragmatic aspects of the corpus data to avoid inappropriate transfer to students writing (Flowerdew, 2001). Other caveats for the use of corpora in language teaching are summarised by Hunston (2002).

CONCLUSION
Corpus linguistic techniques, combined with systemic-functional analyses, are a valuable tool for the investigation of the problem-solution pattern in a student and professional corpus of technical writing. In addition, contextual and situational factors, such as the report-writing guidelines, are important considerations in interpreting quantitative data on the Inscribed and Evoking key word lexis in both corpora. General, large-scale corpora such as the BNC can play a valuable role in the analysis of small-scale corpora. I used the BNC in two ways: as the reference corpus for determining the key words in STUCORP and PROFCORP and for comparative purposes to see whether my ndings were specic to my corpora or generalizable within a wider written domain. The student writing I investigated displayed a restricted use of the vocabulary and patterns commonly found in professional writing. It is this kind of apprentice writing that de Beaugrande (2001) singles out when he writes, Our major problem is not so much bad English or incorrect English, as is often lamented, but rather insufcient English (p. 10). The investigation of learner corpora can therefore provide useful insights for instructional purposes not only in terms of errors but also in terms of insufcient lexicogrammar.
ACKNOWLEDGMENTS
I thank my PhD supervisor, Michael Hoey, for his helpful advice on my thesis, on which this article is based. I also thank Susan Conrad and two anonymous reviewers for their useful comments on earlier drafts.

THE AUTHOR
Lynne Flowerdew coordinates a technical communications skills course at the Hong Kong University of Science and Technology. Her main research interests include corpus linguistics, text linguistics, English for specic purposes, and syllabus design.

508

TESOL QUARTERLY

REFERENCES
Aston, G., & Burnard, L. (1998). The BNC handbook. Edinburgh, Scotland: Edinburgh University Press. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press. Biber, D., Conrad, S., Reppen, R., Byrd, P., & Helt, M. (2002). Speaking and writing in the university: A multidimensional comparison. TESOL Quarterly, 36, 948. Bondi, M. (2001). Small corpora and language variation: Reexivity across genres. In M. Ghadessy, M. A. Henry, & R. Roseberry (Eds.), Small corpus studies and ELT (pp. 135174). Amsterdam: Benjamins. Connor, U., Precht, K., & Upton, T. (2002). Business English: Learner data from Belgium, Finland and the U.S. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language learning (pp. 175194). Amsterdam: Benjamins. Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21st century? TESOL Quarterly, 34, 548560. Crombie, W. (1985). Discourse and language learning: A relational approach to syllabus design. Oxford: Oxford University Press. de Beaugrande, R. (2001). Large corpora, small corpora and the learning of language. In M. Ghadessy, M. A. Henry, & R. Roseberry (Eds.), Small corpus studies and ELT (pp. 328). Amsterdam: Benjamins. Fang, X., & Kennedy, G. (1992). Expressing causation in written English. RELC Journal, 23, 6280. Flowerdew, L. (1998a). Concordancing on an expert and learner corpus in ESP. CLL Journal, 8(3), 37. Flowerdew, L. (1998b). Corpus linguistic techniques applied to text linguistics. System, 26, 541552. Flowerdew, L. (1998c). Integrating expert and interlanguage computer corpora ndings on causality: Discoveries for teachers and students. English for Specic Purposes, 17, 329345. Flowerdew, L. (2000a). Investigating referential and pragmatic errors in a learner corpus. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective (pp. 145154). Frankfurt am Main, Germany: Peter Lang. Flowerdew, L. (2000b). Using a genre-based framework to teach organizational structure in academic writing. ELT Journal, 54, 369378. Flowerdew, L. (2001). The exploitation of small learner corpora in EAP materials design. In M. Ghadessy, M. A. Henry, & R. Roseberry (Eds.), Small corpus studies and ELT (pp. 363379). Amsterdam: Benjamins. Flowerdew, L. (2002). Corpus-based analyses in EAP. In J. Flowerdew (Ed.), Academic discourse (pp. 95114). London: Longman. Francis, G. (1986). Anaphoric nouns (Discourse Analysis Monographs 11). Birmingham, England: University of Birmingham, English Language Research. Francis, G. (1994). Labelling discourse: An aspect of nominal-group lexical cohesion. In M. Coulthard (Ed.), Advances in written text analysis (pp. 83101). London: Routledge. Gledhill, C. (2000). The discourse function of collocation in research article introductions. English for Specic Purposes, 19, 115135. Granger, S. (Ed.). (1998). Learner English on computer. London: Longman. Halliday, M. A. K. (1994). Introduction to functional grammar. London: Edward Arnold.

ANALYSIS OF THE PROBLEM-SOLUTION PATTERN

509

Halliday, M. A. K., & Martin, J. R. (1993). Writing science: Literary and discursive power. Pittsburgh, PA: University of Pittsburgh Press. Hoey, M. (1983). On the surface of discourse. London: Allen & Unwin. Hoey, M. (1997, April). Some text properties of certain nouns. Plenary paper presented at the Colloquium on Discourse Anaphora and Reference Resolution, Lancaster, England. Hoey, M. (2001). Textual interaction. London: Routledge. Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. Hyland, K., & Milton, J. (1997). Qualication and certainty in L1 and L2 students writing. Journal of Second Language Writing, 6, 183205. Jordan, M. P. (1984). Rhetoric of everyday English texts. London: Allen & Unwin. Lin, L. (2002). Overuse, underuse and misuse: Using concordancing to analyse the use of it in writing of Chinese learners of English. In M. Tan (Ed.), Corpus studies in language education (pp. 6376). Bangkok, Thailand: IELE Press. Marco, M. J. L. (2000). Collocational frameworks in medical research papers: A genre-based study. English for Specic Purposes, 19, 6386. Martin, J. R. (2000). Beyond Exchange: APPRAISAL systems in English. In S. Hunston & G. Thompson (Eds.), Evaluation in text (pp. 14275). Oxford: Oxford University Press. Martin, J. R. (in press). Sense and sensibility: Texturing evaluation. In J. Foley (Ed.), New perspectives on education and discourse. London: Continuum. McEnery, T., & Kie, N. (2002). Epistemic modality in argumentative essays of second-language writers. In J. Flowerdew (Ed.), Academic discourse (pp. 182195). London: Longman. Milton, J. (2001). Elements of a written interlanguage: A computational and corpus-based study of institutional inuences in the acquisition of English by Hong Kong Chinese students (Research Reports, Vol. 2). Hong Kong: Hong Kong University of Science and Technology, Language Centre. Paltridge, B. (2001). Genre and the language learning classroom. Ann Arbor: The University of Michigan Press. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike uency. In J. Richards & R. Schmidt (Eds.), Language and communication (pp. 191226). London: Longman. Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition. New York: Mouton de Gruyter. Scott, M. (1997). PC analysis of key words and key key words. System, 25, 233245. Scott, M. (1999). WordSmith Tools [Computer software]. Oxford: Oxford University Press. Scott, M. (2000). Mapping key words to problem and solution. In M. Scott & G. Thompson (Eds.), Patterns of text (pp. 109127). Amsterdam: Benjamins. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell. Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press. Tribble, C. (2002). Corpora and corpus analysis: New windows on academic writing. In J. Flowerdew (Ed.), Academic discourse (pp. 131149). London: Longman. Upton, T., & Connor, U. (2001). Using computerized corpus analysis to investigate the textlinguistic discourse moves of a genre. English for Specic Purposes, 20, 313 329.

510

TESOL QUARTERLY

White, P. R. R. (2001). The Appraisal Website: Homepage. Retrieved June 14, 2003, from http://www.grammatics.com/appraisal Widdowson, H. G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21, 325. Winter, E. O. (1986). Clause relations as information structure: Two basic text structures in English. In M. Coulthard (Ed.), Talking about text (pp. 88108). Birmingham, England: University of Birmingham, English Language Research.

ANALYSIS OF THE PROBLEM-SOLUTION PATTERN

511

BRIEF REPORTS AND SUMMARIES


TESOL Quarterly invites readers to submit short reports and updates on their work. These summaries may address any areas of interest to Quarterly readers. Edited by SUSAN CONRAD Portland State University

The Corpus of English as Lingua Franca in Academic Settings


ANNA MAURANEN University of Tampere Tampere, Finland
I

The English language has established itself as the global lingua franca, that is, a vehicular language spoken by people who do not share a native language. This unprecedented spread of one originally ethnic language can be traced to British colonialism and later to the economic and political power of the United States, but the origins have ceased to be the prime motivation for the continued spread of the language. Most of its use today is by nonnative speakers (NNSs), and the number of people speaking it as a foreign or second language has surpassed the number of its native speakers (NSs) (about 80% of speakers of English are estimated to be bilingual users; see Crystal, 1997). As a consequence, voices in the English teaching profession and among scholars in the eld (see, e.g., Kachru, 1996; Knapp, 2002; McArthur, 2001; Rampton, 1990; Seidlhofer, 2000; Widdowson, 1994) have questioned the NSs status as the most relevant model for teaching English and have called for the development of models for international speakers that are more appropriate to the changed role of English. In view of the growing recognition of the widespread use of English, it is surprising that English as a lingua franca (ELF) has been little described as a language form. Native or established world varieties of English (corresponding to the inner and outer circles of Kachru, 1985, but excluding the expanding circle) have attracted scholarly attention
TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

513

even among corpus studies (e.g., as in the International Corpus of English), and so has learner English (e.g., Granger, 1998), but only a very small body of empirical research (notably, work by Jenkins, 2000; Knapp, 1987; Meierkord, 1998, 2000; papers in Knapp & Meierkord, 2002) has taken ELF into its focus. This variety of English therefore merits attention as an object of serious research. In this report I present the outlines of a project at the English department of Tampere University, Finland, that purports to explore one variety of lingua franca English, namely, academic speaking. Findings on advanced learners use (as reported in, e.g., Altenberg & Granger, 2002) provide useful hypotheses for ELF; so, for example, if sophisticated learners underuse delexical forms of frequent verbs, as Altenberg and Granger suggest, it is a reasonable prediction that speakers of lingua franca English resort to similar strategies. However, the angle of the research is quite different: Instead of seeing this underuse as a problem merely because it deviates from comparable NS use, such features, if typical, are regarded as acceptable characteristics of the variety unless there is evidence that they lead to misunderstandings and communicative dysuencies in ELF discourse. In addition, L2 speakers who manage important parts of their lives using ELF uently are not construed as learners as if they were on the way toward the (unattainable) goal of nativeness.

THE NEED FOR AN ELF CORPUS


Investigating ELF serves three kinds of research interests: theoretical, descriptive, and application oriented. My focus here is on the applications, but I rst briey discuss theoretical and descriptive interests.

Theoretical Interest
The theoretical interest in ELF arises from the special nature of English as a contact language, or vehicular languagein a broad sense, a new variety that emerges in situations where interlocutors do not share an L1. However, unlike most contact languages that have been widely studied, ELF is usually a language that both communicating parties have been formally taught at some point. Communicators thus tend to share somealbeit diverseeducational background in English. ELF can therefore be expected to display interference features that are due to variable learning and can be likened to contact-induced changes in any language, even though the contact is not between two particular languages but between English and a wide variety of others. As, for example, Thomason (2001, p. 75) points out, imperfect learning tends to cause structural or phonological rather than lexical changes in the target
514 TESOL QUARTERLY

language and more often than not leads to simplication rather than complication of the target language structure. I prefer to use the term variable rather than imperfect learning (see also Brutt-Grifer, 2002, pp. 128129, for a quite different critique of imperfect learning) because learner variation cannot be reduced to points on a single unidimensional scale. Variation also derives from different targets and practices of teaching. Nevertheless, Thomasons (2001) assumptions still appear valid. Although such tendencies are not absolute (and depend heavily on the understanding of simplicity and complexity), it is fair to assume that ELF tends toward some kind of structural simplication, or generally unmarked features, because as a global language it has an exceptionally rich variety of L1s among its users. Simplication is far from a straightforward issue in linguistic analysis, so evidence for (or against) its manifestations in ELF will be signicant. For example, although particularly frequent use of the most frequent lexical items has been observed in both translated language and learner language, plenty of at least anecdotal evidence shows nonnatives using the kind of lexis in speech that is more typical of writing in L1 English (e.g., Sinclair, 1991). This phenomenon might be seen as genre simplicationor as lexical complexication if the preferred spoken use of L1 speakers is seen as simple in some reasonable sense. Moreover, a major mechanism of contact-induced change, negotiation, in which speakers change their language to approximate what they believe to be the patterns of another language or dialect (Thomason, 2001), is likely to be at work in ELF situations. Because ELF users have diverse backgrounds, their guesses about what they share are likely to be most accurate in the case of most widely shared, unmarked features. As these tend to be generally the easiest to learn as well, one might predict that most pervasive ELF features will be generally, perhaps universally, unmarked ones. Linguistic features that learners of English may have difculties with but that are regarded as universal features of communication can also be tested on ELF data; for instance, Brazils (1985, 1995) model of English discourse intonation distinguishes information that is being referred to (i.e., assumed to be shared) from that which is proclaimed (i.e., assumed to be newsworthy). Insofar as these categories are shared across languages, albeit in different realisations, they can be expected to nd expression in ELF, given that the communicators already are competent speakers of at least one natural language. The marking of this distinction may, of course, not be identical to that of native English speakers. In similar lines, McCarthy (2001, p. 39) suggests that discourse marking is a universal feature; if this is so, it should also be detectable in ELF data. Recent work on both corpora (e.g., Sinclair, 1991) and language
BRIEF REPORTS 515

processing (e.g., Ellis, 1996, 2002; Wray, 2002) has challenged descriptions of language based on strict dichotomies of lexis and grammar. The units that most clearly distinguish even the most competent NNSs of a language from NSs seem to be collocations, idioms, and other (semi-) formulaic speech (e.g., Nattinger & DeCarrico, 1992). One of the questions that arises with respect to uent ELF communities, especially fairly well-established ones (e.g., the European Union, academic programmes, disciplinary elds) is to what extent they might be developing their own formulae. Little evidence has emerged so far because the research that has explored such patterning has invariably constructed nonnatives as learners and has consequently sought differences between their use and that of NSs rather than approaching the nonnative language as a variety in its own right. It is therefore important to ll this gap and compile a database large enough to reveal possible patterning in ELF language studied for its own sake. On the whole, the data ought to provide useful insights into controversies about universal versus culture-specic communicative features (for views emphasising cultural specicity in communicative habits see, e.g., Carbaugh, 1990; Scollon & Scollon, 1995). In brief, the theoretical interests in ELF data centre around the possibility of nding manifestations of simplication, evidence for universally unmarked features, and hypothesised universals of communication as well as evidence for selfregulative patterning.

Descriptive Interest
The second facet of interest in ELF is descriptive: What features of English constitute the core in an ELF perspective? The core elements of standard English have been claimed to constitute the basis of most standard reference works of English, such as grammars and dictionaries, especially those oriented toward the international audience. The enormous reference corpora that represent contemporary English (the Bank of English, http://titania.cobuild.collins.co.uk; the British National Corpus, http://www.hcu.ox.ac.uk/BNC/) likewise seek to cover the core, that is, what (native) speakers of English share. It is reasonable to assume that what constitutes the core of the language for NSs deviates from what may emerge as its lingua franca core. Clearly, the notion of core can be questioned, but there is no ad hoc reason to assume that an ELF core is inherently less plausible than a standard English corea common, tacitly held assumption despite the recognised existence of several standard varieties. Some linguists (e.g., Biber, Conrad, & Reppen, 1998; Hyland, 2002) see variation according to function, genre, or region as more fundamental than the commonalities of any English. Theoretically, one can postulate strong centrifugal as well as centripetal tendencies
516 TESOL QUARTERLY

operating on English worldwide (cf. Brutt-Grifer, 2002), but the relevance of a core needs to be resolved on empirical grounds. English phonology in lingua franca communication has been fairly extensively described by Jenkins (2000), but clearly other aspects of language are of equal interest to English scholarship. The descriptive work that I hope will be undertaken in the near future is crucial for both testing theoretical hypotheses and developing practicable applications, and thus feeds into the other facets outlined here. Such research is by nature collaborative and requires an international network of scholars, which is beginning to take shape.

Practical Applications of ELF Research


The applications of theoretical and descriptive work on ELF are of considerable practical signicance in todays world. The usefulness of reasonable and achievable targets of learning can be seen roughly from two viewpoints. The rst relates to general educational goals and peoples language rights, and the second to efciency. As to the rst point, it is important for people to feel comfortable and appreciated when speaking a foreign language. Speakers should feel they can express their identities and be themselves in L2 contexts without being marginalised on account of features like foreign accents, lack of idiom, or culture-specic communicative styles as long as they can negotiate and manage communicative situations successfully and uently. An international language can be seen as a legitimate learning target, a variety belonging to its speakers. Thus, deciency models, that is, those stressing the gap that distinguishes NNSs from NSs, should be seen as inadequate for the description of uent L2 speakers and discarded as the sole basis of language education in English. Moreover, learners with a lingua franca target should be particularly sensitized to interpersonal aspects of language and intercultural competence (as distinct from familiarity with the target culture) because the expected intercultural encounters are much less predictable than those in which L1 speakers (especially of a given nation or culture) constitute the other party. English also differs from many other languages that are taught as foreign languages; even if many are not conned to one nation (e.g., German, Spanish, French), they are nevertheless much more readily identiable with a limited number of cultural contexts, and target culture expectations are more salient for them. The second viewpoint on application relates to efciency. Holding up an NS model as the target for international users of English is counterproductive because it sets up a standard that by denition is unachievable. A more powerful solution is to capitalise on learners strengths in acquiring and focusing on those aspects of the language that are
BRIEF REPORTS 517

relatively easy to learn (as core elements tend to be) and are most useful in communicating with other ELF speakers. What the core elements are and what is most useful for learners have been subjects of pedagogical debate for a long time, but the ensuing suggestions have never been based on empirical analyses of actual language use. One of the best known and most successful and systematic efforts at reducing English for learners to a manageable size was Basic English (Ogden, 1930, discussed by Seidlhofer, 2002, in an interesting way), but even that was based on a linguists reasoning about simplicity, coverage, and usefulness, not on actual empirical evidence. Most foreign language teaching materials are based on traditions that have been handed down from one teacher generation to the next, and challenges to earlier practices have been based on ideology (e.g., communicative teaching, the notional-functional syllabus), innovative methods (e.g., direct teaching, total physical response), or educational theory (e.g., learning as habit formation, the natural approach, task-based learning). Recently, corpus studies (e.g., Aston & Burnard, 1998; Burdine, 2001; Lawson, 2001; McCarthy, 1998) have been able to challenge the linguistic underpinning of many such traditions by showing that they deviate from what is in fact common in the language. Yet even this recent criticism is based on corpus data gathered solely from NS usage. The need for other than NS standards is recognised in the applied linguistics eld; for example, McCarthy (2001) argues that as a programme for research within applied linguistics, identifying criteria for expert use of a language like English in different cultural contexts is an urgent one (p. 141). This argument follows along the lines of Crystals (1997) world standard English, and there is certainly a good deal of work to be done here. But there are caveats as well: First, a globally unied standard is probably an unrealistic as well as undesirable goal. Those countries where English is used as an adopted L2 in bilingual settings are probably inclined to and should continue developing their own local standards. In contrast, many international discourse communities, including the academic, which still make U.S. and United Kingdom native standards their reference point, would do well to rethink the relevance of these standards to their own needs. It may be realistic to start developing standards fairly modestly, either limiting the goal by level of language (like Jenkinss, 2000, phonology), or function (e.g., academic or educational or business English), or even region (e.g., European ELF). Secondly, particular discourse communities using ELF might plausibly develop their own norms of use, that is, standards of what is acceptable, comprehensible, and adequate for efcient communication more or less spontaneously. In other words, self-regulatory mechanisms operate in shaping the communicative practice of the discourse community. Therefore, the point of departure for establishing standards
518 TESOL QUARTERLY

for teaching and assessment ought to be the empirical study of discourse communities that use ELF successfully.

ACADEMIC ELF IN ITS SOCIAL CONTEXT


The academic community can be considered a discourse community (Swales, 1990, 1998), that is, a social formation with its own discourses that serve both as resources and as products of the community. The particular discourses adopted by communities serve a gatekeeping function (Bourdieu & Passeron, 1977), and novices need to acquire these discourses before they can regard themselves as full members. At the same time, the discourses bring cohesion to the community and mark its identity. Following Killingsworth (1992), academic discourse communities can be divided into local, consisting of members who habitually interact, and global, dened exclusively by a commitment to particular actions and discourses. This distinction is made with written discourses in mind. With academic ELF discourse communities, the local community is relevant as the scene for spoken interaction, which needs to be performed and negotiated in the immediate context, whereas the global community is relevant as the general backdrop and the eventual target community for novice participants. Participants in an international encounter already presume or share global elements of the academic world, although the precise degree and nature of the shared elements are elusive and need to be renegotiated with each encounter. On the other hand, the ultimate goal of those who participate in international programmes is membership in international discourse communities. It is to be expected, then, that the primary identity constructed with the use of English is international, with its diverse associations. The discourses to be investigated in the Tampere project can therefore be characterised as (a) academic, with the dialectical global-local identity that this involves; (b) multicultural, because the actual face-toface encounters involve engaging with cross-cultural interaction; and (c) international, with no clear or dened national anchorage.

A CORPUS OF ELF IN ACADEMIC SETTINGS (ELFA)


Making sense of academic lingua franca English requires a good database. A fairly large corpus is the best way to discover regularities in this highly variable and basically unexplored territory of English through quantitative and qualitative analyses. We at Tampere University are therefore compiling a corpus of spoken academic English (the ELFA corpus). The material is recorded in international degree programs and other university activities regularly carried out in English. The rst phase
BRIEF REPORTS 519

aims at 0.5 million words, which is already substantial for a corpus of spoken language. Additional material is being recorded at Tampere Technological University and at international conferences. Both universities have some degree programs for international as well as Finnish students. These programs are run entirely in English, with students and some of the faculty coming from a variety of countries, mostly European. The approach is not dissimilar to that adopted in two U.S. corpora. The Michigan Corpus of Academic Spoken English (MICASE; see Simpson, Briggs, Ovens, & Swales, 1999) took one university as the point of departure, and the TOEFL 2000 Spoken and Written Academic Language (T2K-SWAL) corpus (see Biber, Reppen, Clark, & Walter, 2001) was compiled by a team at Northern Arizona University and involved four universities (and is owned by Educational Testing Service). In our situation the most expedient solution seemed to be two universities close to one another.1 From a theoretical and descriptive viewpoint, delimiting the social context of a language variety is useful in controlling some of the variables that may get out of hand in a highly complex context like ELF. Also, it allows for a focus on the behaviour of a discourse community as a discourse community. For example, the characteristic unfolding of academic genres in chainlike formations, such as lecture courses or lectures followed up by seminars and student presentations (see Mauranen, 2001), allows the detection of self-regulatory mechanisms and the contrasting of those mechanisms with one-off recordings of individual groups. The latter have so far constituted practically all the data in the eld of ELF, as they have among academic speech corpora. Spoken rather than written language was chosen for two reasons. First, speaking has been barely studied so farboth existing databases comprising academic speech were nished in the late 1990s (see Biber et al., 2001; Simpson et al., 1999), and although they are rendering exciting results as researchers begin to work with them, they are for the most part limited to L1 speakers. Although MICASE includes L2 speakers, the context is one where L1 English is the norm. Second, the written variety of academic discourse has a degree of consistency achieved through the normal editing cycles, although culture-specic rhetoric still surfaces (see, e.g., Mauranen, 1993). But in spoken interaction, the negotiation of meanings and deployment of interactive patterns are left to the interactants, without outside monitoring. The language must therefore adapt to the interactants immediate communicative needs and resources, and the variety with its emerging regularities can be expected to differ widely from comparable written text.
1 MICASE has been particularly inuential on account of my close contact and involvement with that project.

520

TESOL QUARTERLY

Because the social context is complex, the data to collect should be selected according to clear principles that reect the social parameters involved. The uses of spoken English in a non-English-speaking university are virtually unexplored; written genres are much better covered (e.g., Ventola & Mauranen, 1990), and the preliminary sampling frame will undoubtedly be modied in the light of accumulating data. The data gathering needs to be informed by the local knowledge (Geertz, 1983) of the participants, established through interviews and background material distributed by the communities themselves.

SAMPLING FRAME FOR THE ELFA CORPUS


As a general principle, all data in the corpus are to be authentic in the sense that they are not elicited for research purposes but occur naturally. The corpus is to consist of complete speech events (i.e., complete individual sessions, not truncated), and their relation to other speech events is described on parameters such as participants familiarity with each other. Compilation criteria are external; that is, they are not determined on the basis of linguistic register features but rather by socially based denitions of the prominent genres of the discourse community. Speech events with NSs of English are excluded as far as possible, but if present, they are coded. Sessions with speakers who all share an L1 are not included; neither are foreign language courses. The basic unit of sampling is the speech event type (following MICASE), which is a looser term than genre and therefore perhaps more appropriate, as the discourses represent a variety of events, some (e.g., lectures) of which are much more rmly established as genres than others (e.g., workshops). Two fundamental selection criteria are genre/event type and discipline, whose primary categorization comes from the labels and descriptions given by the relevant discourse community ( folk genres), for instance, lecture in political history or seminar in economics. Other external criteria involve institutional hierarchies and the ensuing systems of seniority, which affect the speakers interpersonal relations: Peer sessions (student groups, conference presentations) and groups that are mixed with respect to academic status (lectures, seminars, generally sessions with teacher and students) are to be included in a natural balance, with emphasis on the asymmetrical event types because they dominate the discourse community. Because we cannot ascertain the actual balance a priori, a better understanding of the target discourse communities is likely to emerge as the data accumulate. We observe the positioning of events in (intertextual) sequence: The chainlike nature of academic events, changing along familiarity parameters, leads to variety in interpersonal relations; thus, a new group or new situation presumably includes more initial face work than do seminars/lectures from later
BRIEF REPORTS 521

stages, when roles have been established and a comfortable degree of familiarity and formality has been negotiated. Because these situations are intercultural, the negotiation may be slower than in corresponding monocultural settings. The main selection criteria for genres are all related to their perceived importance in one way or another: 1. prototypicality: the extent to which genres are shared and named by most disciplines, for example, lectures, seminars, thesis defences, and conference presentations 2. inuence: genres that affect a large number of participants (or are widely consumed), for example, introductory lecture courses, examinations, and consultation hours 3. prestige: genres with high status in the discourse community, for example, guest lectures, plenary conference presentations, and opening/closing speeches A nal criterion rests on a different basis: selecting a particular genre for current theoretical discussion in the eld. For example, dialogic events have been central in the incipient discussion on ELF whereas the lecture and the seminar have attracted most attention in studies of academic speaking. Conference papers might be seen as the nearest equivalent to the research article, which has occupied centre stage in research among written academic discourses. The selection of disciplines can conveniently be considered at three levels: (a) disciplinary domain (e.g., social science, technology, biological sciences, arts and humanities), (b) discipline (e.g., sociology, political science, history, electronic engineering, literature, linguistics), and (c) subdiscipline (e.g., Finnish history, organic chemistry, Russian literature, experimental psychology). As with MICASE, a broad division is relevant to most research questions that are likely to arise in the research, and given the relatively small size of the present corpus, the highest level disciplinary domainis the realistic choice. Moreover, the disciplinary selection and balance are unique to the two sample universitiesand because we are dealing with English-medium events only, the choices are narrowed further. Some language-internal categories also need to be taken into account in the sampling frame. A basic one is the distinction between monologic and dialogic speech, which can be made on the basis of the number of active participants (i.e., one vs. more than one). Both types need to be included, with an emphasis on dialogic events. Scripted speech (reading aloud) needs to be kept separate from unscripted (without notes) and semiscripted (with lecture notes). Finally, some facts, such as age, gender, nationality or cultural identity, and mother tongue, are worth coding if available but are not included in the sampling frame.
522 TESOL QUARTERLY

USING THE ELFA CORPUS


Corpus research has brought important insights into linguistics (see, e.g., Biber et al., 1998; McEnery & Wilson, 1996; Sinclair, 1991). It has allowed scholars to see patterning that differs from traditional models and has shown the limitations of NS intuitions, among other things. It has provided insights into what is typical and frequent in languageand although its main strength is not in uncovering the limits of what is possible in a language, it sometimes does that as well, showing that NSs often say things they do not think they say. Spoken corpus data have been rarer than written, in part owing to the difculties in obtaining and transcribing them, but they have been analysed in L1 English (e.g., Aijmer, 1996; Biber, Johansson, Leech, Conrad, & Finegan, 1999). With corpus analysis, repeated patterning, especially unsuspected patterning, is much easier to detect than it is with qualitative analysis alone, which tends to focus on structuring in unfolding discourse. In analysing ELFA data one must expect surprises in recurrent forms; therefore, concordancing will be a helpful research tool (for further explanation of concordancing, see, e.g., OKeeffe & Farr, this issue). On the other hand, many phenomena, often of a pragmatic nature, such as misunderstanding or expressing criticism, disagreement, or evaluation, are very hard to detect through concordancing. Detecting them requires looking into transcripts of target conversations, deriving hypotheses, and testing the hypotheses on the corpus (see e.g., Mauranen, 2001, 2002, in press). For example, misunderstanding is quite commonly signaled by repeating the lexical item whose meaning or relevance is not captured in the conversation (Mauranen & Lappalainen, 2002). Concordancing could not capture such repetition, which, moreover, can involve lexical items that are quite unpredictable as potential objects of misunderstanding. A speech corpus lends itself to such an approach easily because the transcripts will be available for qualitative analysis, together with the soundtrack as the need arises. In the example below, the relevance, not the semantics, of the United States is the problem:
S1: . . . democracies at the time of peace are hardly motivated to spend money on the military do you think that this this applies to the United States for? S2: United [States] S1: [its] a demo- democracy [S2: yeah] and still spends a lot of money on on the milS2: yeah but i think that and this is a personal conception . . .

The soundtrack is useful in explaining, for example, why importing cars was misheard as important class. For concordancing written or transcribed texts, a number of good, standard tools exist, and the same is true for common corpus statistics.
BRIEF REPORTS 523

But handling spoken data in a satisfactory manner will require developments in the eld. With lingua franca data, the need to include the soundtrack in the analysis is even more acute than with standard language material because the data are more likely to include unpredictable forms. If the transcription is normalized, it enables pattern searches but this in turn requires that the soundtrack be available to balance the normalization. Either mispronunciations (e.g., separitism, constract) will be lost from searches if simply approximated or, if they are normalised (separatism, construct), the range of unorthodox pronunciations that seems to be accepted in interaction will be lost. The same is true of nonnativelike blends or neologisms of grammatical or discourse items (e.g., inspite that). Platforms that allow transcript and soundtrack concordancing simultaneously are being developed and undoubtedly will soon be available for research. The data thus permit different research methods. The main ones used in the current project will be corpus analysis and discourse analysis, supplemented with interviews and nonparticipant observation. The diversity of approaches will allow us to test hypotheses and the validity of earlier ndings on comparable datanot only NS academic speech but also data based on NS-NNS interaction and smaller scale ELF studies. Preliminary ndings suggest that ELFA has its own prole but resembles NS academic speaking in some respects. For instance, metadiscourse seems similarly linked to hedges, as in perhaps we could eh look an extract now (cf. Mauranen, 2001). On the other hand, the corpus is like other ELF discourse in that it contains a very large proportion of self-repairs. As in NS-NNS interaction, interactive repairs appear prominent, but in contrast to it, they tend not to repair grammar (Mauranen & Lappalainen, 2002). The use of metadiscourse, hedging, and vagueness, and the construction of interpersonal relations in general, are among the discourse phenomena that will be interesting to compare with other ndings; the amount of formulaic language use likewise will undoubtedly throw new light on the current discussion of its role in language processing (see, e.g., Ellis, 2002).

PROSPECTIVE OUTCOMES
By launching another corpus project, we hope to be lling a gap in the current selection of English corpora. This corpus ts in well with the existing U.S. corpora of academic English and with the other ELF corpus that is currently being compiled (see the Vienna-Oxford International Corpus of English, e.g., in Seidlhofer, 2001). The two ELF corpora will complement each other while leaving room for corpora covering other domains of global English. As pointed out above, discovering regularities (and irregularities, for that matter) in lingua franca English invites
524 TESOL QUARTERLY

international cooperation. The range of variation is likely to be enormous although hardly innite, and it is through research in specic contexts that we can hope to capture a general picture of the phenomenon and its development. It has been postulated that strong centripetal forces are operating along with centrifugal ones in the development of World English and that the centripetal forces arise as a function of English as a vehicle of international communication (Brutt-Grifer, 2002, pp. 177178). The accumulating evidence ought to give an indication of the direction that these developments take. Many open questions remain in this research area, and few answers have been obtained yet. As sketched earlier in this article, some issues of theoretical interest relate to the patterns of discourse marking, formulaic expressions, simplication, and universally unmarked linguistic features. The descriptive ndings contribute to the establishment of international standards for expert users of English, and the practical applications will operate on the basis of these standards. The main benets should accrue to teaching English for international purposes, but contributions to other global uses of English in related elds are also likely, the expanding eld of technical writing being a case in point.
THE AUTHOR
Anna Mauranen is professor of English philology at the University of Tampere. Her main research is in corpus linguistics, text, and discourse analysis. She has compiled a corpus of translated Finnish and been involved in a contrastive Finnish-English corpus and the Michigan Corpus of Academic Spoken English. Her publications include corpus linguistics and text linguistic work (e.g., Cultural Differences in Academic Rhetoric, Peter Lang, 1993).

REFERENCES
Aijmer, K. (1996). Conversational routines in spoken discourse. London: Longman. Altenberg, B., & Granger, S. (2002). The grammatical and lexical patterning of make in native and non-native student writing. Applied Linguistics, 22, 173189. Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh, Scotland: Edinburgh University Press. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman. Biber, D., Reppen, R., Clark, V., & Walter, J. (2001). Representing spoken language in university settings: The design and construction of the spoken component of the T2K-SWAL Corpus. In R. C. Simpson & J. M. Swales (Eds.), Corpus linguistics in North America (pp. 4857). Ann Arbor: University of Michigan Press. Bourdieu, P., & Passeron, J.-C. (1977). Reproduction in education, society and culture. London: Sage.
BRIEF REPORTS 525

Brazil, D. (1985). The communicative value of intonation. Birmingham, England: University of Birmingham, English Language Research. Brazil, D. (1995). A grammar of speech. Oxford: Oxford University Press. Brutt-Grifer, J. (2002). World English: A study of its development. Clevedon, England: Multilingual Matters. Burdine, S. (2001). The lexical phrase as pedagogical tool: Teaching disagreement strategies in ESL. In R. C. Simpson & J. M. Swales (Eds.), Corpus linguistics in North America (pp. 195210). Ann Arbor: University of Michigan Press. Carbaugh, D. (Ed.). (1990). Cultural communication and intercultural contact. Mahwah, NJ: Erlbaum. Crystal, D. (1997). English as a global language. Cambridge: Cambridge University Press. Ellis, N. C. (1996). Sequencing in SLA: Phonological memory, chunking, and points of order. Studies in Second Language Acquisition, 18, 91126. Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24, 143188. Geertz, C. (1983). Local knowledge: Further essays in interpretive anthropology. New York: Basic Books. Granger, S. (Ed.). (1998). Learner English on computer. London: Longman. Hyland, K. (2002). Specicity revisited: How far should we go now? English for Specic Purposes, 21, 385392. Jenkins, J. (2000). The phonology of English as an international language. Oxford: Oxford University Press. Kachru, B. (1985). Standards, codication, and sociolinguistic realism: The English language in the outer circle. In R. Quirk & H. Widdowson (Eds.), English in the world: Teaching and learning the language and literatures (pp. 1130). Cambridge: Cambridge University Press. Kachru, B. (1996). English as lingua franca. In H. Goebl, P. Nelde, Z. Zary, & W. Wlck (Eds.), Kontaktlinguistik/Contact linguistics/Linguistique de contact (Vol. 1, pp. 906913). Berlin: de Gruyter. Killingsworth, M. J. (1992). Discourse communities local and global. Rhetoric Review, 11, 110 122. Knapp, K. (1987). English as an international lingua franca and the teaching of intercultural communication. In W. Lrsche & R. Schulze (Eds.), Perspectives on language in performance (pp. 10221039). Tbingen, Germany: Narr. Knapp, K. (2002, December). Self-regulation and standards in English as a lingua francaeffects of types of speech communities. Paper presented at 13th World Congress of Applied Linguistics, Singapore. Knapp, K., & Meierkord, C. (Eds.). (2002). Lingua franca communication. Frankfurt, Germany: Peter Lang. Lawson, A. (2001). Rethinking French grammar for pedagogy: The contribution of spoken corpora. In R. C. Simpson & J. M. Swales (Eds.), Corpus linguistics in North America (pp. 179194). Ann Arbor: University of Michigan Press. Mauranen, A. (1993). Cultural differences in academic rhetoric. Frankfurt, Germany: Peter Lang. Mauranen, A. (2001). Reexive academic talk: Observations from MICASE. In R. Simpson & J. M. Swales (Eds.), Corpus linguistics in North America (pp. 165178). Ann Arbor: University of Michigan Press. Mauranen, A. (2002). A good question: Expressing evaluation in academic speech. In G. Cortese & P. Riley (Eds.), Domain-specic English: Textual practices across communities and classrooms (pp. 115140). Frankfurt, Germany: Peter Lang. Mauranen, A. (in press). Theyre a little bit different: Variation in hedging in

526

TESOL QUARTERLY

academic speech. In K. Aijmer & A.-B. Stenstrm (Eds.), Discourse patterns in spoken and written corpora. Amsterdam: Benjamins. Mauranen, A., & Lappalainen, S. (2002, December). Signalling misunderstanding in ELF communication. Paper presented at 13th World Congress of Applied Linguistics, Singapore. McArthur, T. (2001). World English and world Englishes: Trends, tensions, varieties, and standards. Language Teaching, 34, 120. McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge: Cambridge University Press. McCarthy, M. (2001). Issues in applied linguistics. Cambridge: Cambridge University Press. McEnery, T., & Wilson, A. (1996). Corpus linguistics. Scotland: Edinburgh University Press. Meierkord, C. (1998, July). Lingua franca English: Characteristics of successful nonnative-/non-native-speaker discourse. Erfurt Electronic Studies in English. Retrieved May 28, 2003, from http://webdoc.sub.gwdg.de/edoc/ia/eese/eese.html Meierkord, C. (2000, January). Interpreting successful lingua franca interaction: An analysis of non-native/non-native small talk conversations in English. Linguistik Online, 5. Retrieved May 28, 2003, from http://www.linguistik-online.com/1_00 /index.html Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. Ogden, C. K. (1930). Basic English: A general introduction with rules and grammar. London: Kegan Paul. Rampton, B. (1990). Displacing the native speaker: Expertise, afliation and inheritance. ELT Journal, 44, 97101. Scollon, R., & Scollon, S. (1995). Intercultural communication. Oxford: Blackwell. Seidlhofer, B. (2000). Mind the gap: English as a mother tongue vs. English as a lingua franca. Vienna English Working Papers, 9, 5168. Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as a lingua franca. International Journal of Applied Linguistics, 11, 133158. Seidlhofer, B. (2002). Basic questions. In K. Knapp & C. Meierkord (Eds.), Lingua franca communication (pp. 269302). Frankfurt, Germany: Peter Lang. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (1999). The Michigan Corpus of Academic Spoken English. Ann Arbor: The Regents of the University of Michigan. Swales, J. M. (1990) Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press. Swales, J. (1998). Other oors, other voices: A textography of a small university building. Mahwah, NJ: Erlbaum. Thomason, S. G. (2001). Language contact. Edinburgh, Scotland: Edinburgh University Press. Ventola, E., & Mauranen, A. (1990). Tutkijat ja englanniksi kirjoittaminen [Researchers and writing in English]. Helsinki, Finland: Helsinki University Press. Widdowson, H. (1994). The ownership of English. TESOL Quarterly, 28, 377389. Wray, A. (1999). Formulaic language in learners and native speakers. Language Teaching, 32, 213231. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

BRIEF REPORTS

527

Designing a Corpus for Translation and Language Teaching: The CEXI Experience
SILVIA BERNARDINI University of Bologna at Forl Forl, Italy
I

Building resources for the education of future language mediators is a widely felt need in Europe today. Translation and interpreting schools have experienced a boom in the past two decades (Caminade & Pym, 1995), with the consequent need to adapt language teaching practices and develop translation teaching approaches tuned to these specializations. A number of researchers working in this setting (see, e.g., the collection edited by Aston, 2001) have suggested that an inductive, datadriven ( Johns, 1991) approach to learning with the aid of corpora may be effective, motivating, and well tuned to such settings. They have also, however, warned against the risk of adapting pedagogy to technology rather than technology to pedagogylargely sharing Widdowsons (1984) view that it is the responsibility of applied linguists to consider the criteria for an educationally relevant approach to language and to avoid the uncritical assumption that applied linguistics must necessarily be the application of linguistics (p. 19), even corpus linguistics. Recently, multilingual corpora have started to become more widely available, complementing monolingual ones. More costly to build and arguably less relevant to the production of dictionaries and grammars,1 multilingual corpora tend to be smaller than monolingual ones and more difcult to access, both technically (e.g., in the case of parallel, aligned texts, in which each sentence in the original has to be made accessible in parallel with its translation) and legally (copyright problems are doubled). However, institutions in various parts of Europe (e.g., Finland, Germany, Norway, Portugal, and Sweden) have started to set up multilingual corpora for research purposes (to study translation practices and to do contrastive linguistics analysis) that can also be used for didactic ones. The approaches to building multilingual corpora are as varied as the corpus typologies and uses relevant to these elds. This report describes the approach of the Corpus of English X Italian (CEXI) project (see Aston, Bernardini, & Zanettin, n.d.), whose aim is to provide a resource for the study and teaching of the English and Italian languages and of translation between them. In particular, it focuses on issues of corpus
1 Multilingual resources may in fact be extremely useful in building bilingual dictionaries and contrastive grammars.

528

TESOL QUARTERLY

typology and design, relating these to intended uses and to current views of language and translation pedagogy.

CORPUS DESIGN AND CONSTRUCTION Overview


CEXI is a small-scale project of several faculty members in the School for Interpreters and Translators at the University of Bologna at Forl. Limited funding imposes severe limits on the size of the corpus. However, the small size has some advantages: Other academic institutions could replicate the corpus for different languages or text types or with different priorities in mind. More importantly, we have had to dene the structure of the corpus in modular terms, with a carefully constructed core that future projects carried out by students as well as researchers can develop and extend. As discussed below, corpus construction and corpus comparison provide opportunities for group work in language and translation classes. Class activities include communicatively oriented and awareness-raising tasks that combine a focus on form and a focus on meaning (as recommended, e.g., by Skehan, 1998), complementing more traditional learning and teaching approaches.

Macrostructure
The X in the name CEXI represents the structure of this bidirectional, parallel corpus, which contains English and Italian texts. The corpus is bidirectional in the sense that it includes both Italian and English originals chosen according to comparability criteria such as subject, text type, and time of publication. These texts are matched with their translations in the other language, resulting in four components: (a) Italian originals, (b) their translations into English, (c) English originals, and (d) their translations into Italian. These components can be combined in different ways, as shown in Figure 1. For instance, to carry out a contrastive pragmatics study of the use of discourse markers in English and Italian, one would use the comparable corpora of English and Italian originals. On the other hand, the parallel corpora (original English Italian translations and original Italian English translations) allow research on translation strategies in the two directions (e.g., to nd out whether translations tend to be more linguistically conservative than their source texts). Last, the translated comparable corpus (English and Italian translations) allows observation of translation norms irrespective of the direction of translation (e.g., to nd out whether language produced under the constraints of translation is more or less explicit than language produced from scratch, i.e., for which no original text exists).
BRIEF REPORTS 529

FIGURE 1 Subcorpora of CEXI

Microstructure
CEXI is very small by current standards. The core corpus will be approximately 1 million words distributed across four main subcorpora (Table 1), each of which contains 20 samples of 10,00015,000 words taken from 10 works each of contemporary ction and nonction published in volume form. Each subcorpus thus contains both ction and nonction texts (Table 2).
TABLE 1 Components of Main Subcorpora (Number of Words) Subcorpora Original English Original Italian 250,500 250,000 250,000 250,000 Translation of 250,000 Translation of 250,000 250,000 Translation of 250,000 Translated English Translation of 250,000 Translation of 250,000 Translated Italian

Component Originals in Italian and translations in English Originals in English and translations in Italian Originals in English and Italian Comparisons of translations in English and Italian Originals and translations in English Originals and translations in Italian

530

Bilingual parallel Bilingual comparable Monolingual comparable

250,000

Original English

Italian translations


Translation of 250,000 TESOL QUARTERLY

Original Italian

English translations

TABLE 2 Fiction and Nonction in Main Subcorpora (Number of Words) Subcorpus Original Italian Translated Italian Original English Translated English Fiction 125,000 Translation of 125,000 125,000 Translation of 125,000 Nonction 125,000 Translation of 125,000 125,000 Translation of 125,000

Sampling was restricted to book translations for three reasons. First, translation for the publishing industry has a somewhat prototypical status with respect to other markets and arguably represents the qualitatively highest peak of translation ability. Second, an independent sampling frame (Index Translationum on CDROM; UNESCO, 1998) was available to provide information about books translated in both directions. Similar data for other text types are extremely difcult to nd. Last, the size of the corpus advised against the inclusion of widely different text types. We also excluded from the corpus texts from the Index that were judged signicantly different from the prototype (adult ctional and informative prose). Thus childrens literature, school textbooks, poetry, and drama were excluded. Clearly, these decisions will have to undergo empirical verication at some stage. The time frame for texts included in the corpus is from 1965 to the present day. The arbitrariness of this choice was mitigated by considerations of a works potential shelf life. For books that are still translated today, the time frame criterion was relaxed to 1945. All the works included in CEXI are therefore still in copyright: We wanted to give learners and researchers access to contemporary written language in the sort of texts that are not easily collectable off the World Wide Web. Geographic variables were ignored at this stage; that is, we did not attempt to factor in the authors/translators or original publications/ translations country of origin, given the difculty of identifying an uncontroversial criterion for assigning a text to a certain variety of English. Future extensions may address this aspect, making possible, for example, contrastive studies of American and Australian English.

Format
CEXI is annotated according to a standard scheme, following the recommendations of the Text Encoding Initiative (Sperberg-McQueen & Burnard, 2002), in order to limit information loss, favor replicability of analyses and results, and ease the retrieval and interpretability of data
BRIEF REPORTS 531

according to external criteria. The semiautomatic annotation procedures aim to preserve surface features of texts that would otherwise be lost in transfer to electronic support (e.g., paragraph and sentence boundaries) interpret hierarchical divisions (e.g., chapters, epigraphs, sections) and salient pieces of text (e.g., names, quotations, lines of poetry, typographical errors) add extratextual information about the original book (e.g., author, translator, date and place of publication) and the electronic edition (e.g., changes made to the text)

For the moment, the corpus is not morphologically analyzed, but we do not rule out the possibility of doing such an analysis in the future.

USE OF CEXI: LEARNING ABOUT AND LEARNING TO


As the corpus is still under construction at this writing (it is expected to be available for research purposes by the end of 2003), examples of its use are not yet available. The following hypothetical illustrations present applications that are relevant to many teaching settings.

Culture
Through their texts, corpora are rich in sociocultural insights about a language (community). Nearly half a century ago, Firth (1956/1968) suggested that much could be learned about a societys values from an analysis of the collocational patterns of such words as labor and leisure a suggestion followed, most notably, by Stubbs (e.g., 1996, 2001). Though arguably secondary with respect to corpus work focusing on language use, this type of analysis may be very motivating for learners and provide interesting points of departure for communicative activities such as reports and discussions. All corpora provide the possibility of investigating culturally loaded words in some way, but here I consider two types of analyses possible with CEXI that not all corpora allow. All proper nouns in CEXI have been tagged as person, place, institution, or other. Users can thus retrieve these nouns and analyze their collocates and connotations. Similarly, all foreign words and expressions are signaled by a tag specifying the language they come from and their printed rendition (e.g., in italics or inverted commas). An analysis of Italian expressions in English ction may, for instance, stimulate discussion about the perception English writers have of Italian culture (as signaled, perhaps, by references to
532 TESOL QUARTERLY

food, wine, and arts). A search in the opposite direction would then be in order, undoubtedly highlighting similarities and differences. Learning about a foreign cultureand in the process learning about ones own culture from a different perspectiveis certainly one promising use for a comparable corpus. However, CEXI is also a parallel corpus, containing samples of originals and their translations. As such, it may offer a vantage point from which to observe and reect on professional translation strategies, an important aspect of translator education from the point of view of sociocognitive apprenticeship (Kiraly, 2000) into a professional community. The treatment of culturally relevant expressions is one problem area in which parallel corpora may help raise the learners awareness of possible mediation strategies. A learner who searched for the translations of institution names in CEXI, for instance, would nd the following original English text and Italian translation:
You know Francesca leaves for school in England in a month. Can you believe it? Shes almost sixteen. Wheres she going? Cheltenham, of course, says Alex. (Leavitt, 1990a, p. 128) Sai che Francesca partir tra un mese per andare a scuola in Inghilterra? Riesci a crederci? Ha quasi sedici anni. Dove va? A Harrow, naturalmente, come suo pap dice Alex. (Leavitt, 1990b, p. 161)

Here the translator has made two important alterations: She has changed the name of the school Francesca is about to go to from Cheltenham, a prestigious British public school for girls, to Harrow, an equally prestigious school for boys, and adds come suo pap ( just like her dad). There is no doubt that these choices are deliberate. Why, one might wonder, has the translator decided to distance her translation from its source text, inserting what is ultimately a referential mistake? A possible answer is that the translator thinks that her public is not likely to be acquainted with Cheltenham and that what matters is not the actual location but what it embodies: a place for the wealthy and the beautiful, the symbol of all the places Celia, the protagonist of this story, felt excluded from as a child. Making Francescas father say proudly just like her dad, the translator achieves two goals: First, she provides a framework for interpreting Harrow in case the reader lacks an appropriate cultural frame (i.e., that it has traditionally been a school for generation after generation of wealthy people), and she clearly, though indirectly, reminds the reader of Celias father, who died when she was a child, leaving her and her mother behind to live an unglamorous life.
BRIEF REPORTS 533

The observation and discussion of strategies of this kind may prove valuable not only for translators in the narrow sense but arguably for any learner of a foreign language who is preparing to act as a mediator between two cultures.

Discourse
Corpora may and, indeed, often have been used to investigate the mechanisms that writers and speakers use for structuring discourse (see, e.g., Stubbs, 2001, on monolingual material; see Hasselgrd, Johansson, & Hansen, 1999, on multilingual material). The teaching and learning of writing/reading and speaking/listening skills can also gain from the availability of corpora in the classroom. As in the case of cultural key words, CEXI may be searched for known discourse-relevant key words or expressions (e.g., modals, conjunctions, focusing devices of various kinds). These may then be concordanced in context and their equivalents found in the parallel texts. The comparable corpus of original texts can be queried further to make sure the translated texts conform to typical use in nontranslation settings. Or, instead of starting from a given expression, the learner can look for a given position in texts: Thanks to the annotations, corpus users can focus on salient bits of texts such as headings, the opening sentences of paragraphs or chapters in informative texts, or reporting clauses in narrative. The perspective, as always, may be monolingual, contrastive, translational, or any combination. What appears most promising about this kind of work is the possibility of combining the acquisition of competence about discourse strategies with the development of capacities for making discourse: Learners observe different aspects of discourse and then have a chance to try out the observed strategies when reporting and discussing their insights from corpora.

Language
The main use of corpora in language research so far has probably been the identication and description of lexical patterns in texts, from the rigid (like idioms; Moon, 1998) to the exible and elusive (collocations, colligations, semantic prosodies, and preferences; Sinclair, 1996). This work has provided evidence supporting the view that absolutely xed phrases are in fact very rare whereas there are very many recurring semantic patterns which have expected lexical realizations, but which can be highly variable (Stubbs, 2001, p. 243). These lexicalized phrases (or units of meaning, to use Sinclairs 1996 term) have long been hypothesized to have psychological reality (e.g., by Bolinger, 1976; see Nattinger & DeCarrico, 1992, for details): As indi534 TESOL QUARTERLY

viduals acquire a language, they memorize its typical prefabricated combinations as unanalyzed chunks together with their associated function in context. These chunks are subsequently analyzed and generalized into syntactic rules, and words are abstracted from their environment and become available as smaller units. However, the larger language chunk . . . continues to be available for ready access (Nattinger & DeCarrico, 1992, p. 12). The speakers ability to continue to access these forms as preassembled chunks, ready for a given functional use in an appropriate context (p. 13) is an aspect of pragmatic competence and may account for observed nativelike uency and nativelike selection of appropriate language in context (Pawley & Syder, 1983). If lexical phrases are indeed so central to nativelike command of a language, awareness of their existence and ability to use them appropriately in context should arguably be among the priorities of foreign language learning. CEXI allows learners not only to observe regularities and variation in the use of given words within lexical phrases but also to compare them in different ways, thus making them more salient. For instance, words like commit, cause, and somewhat are said to have negative semantic prosodies (i.e., a tendency to collocate with negatively connoted words). Learners may compare these words with their translations in the learners native language so as to observe whether or not the same prosody applies in both languages. Concordances and frequency counts for somewhat in the imaginative and in the informative subcorpora may also be compared, leading to hypotheses on the specialization of certain words with respect to specic registers, (meta)genres, modes of expression, and so on. What are the typical collocates in the two subcorpora? Is negativity the only element they have in common? If somewhat is uncommon in ction, how about other downtoners? Innumerable questions can be asked of virtually any word or phrase, all potentially leading to further questions and further searches.

Text Evaluation
A number of scholars (e.g., Maia, 1997; Varantola, in press) have pointed out that constructing as well as using corpora may be of pedagogic value. Corpus construction can be viewed as a communicatively oriented classroom activity in which learners are required to work in groups, designing their own corpus according to a set of goals, setting up criteria for inclusion and exclusion of given texts, and evaluating these texts according to the goals and criteria. Because corpus analysis is mainly about comparing and contrasting samples of use with a norm (Halliday, 1992, p. 68), activities of corpus construction would gain from the availability of different types of
BRIEF REPORTS 535

reference corpora representing different norms (i.e., parallel and comparable corpora, large monolingual corpora, or even the Web). In this way, learners can not only observe aspects of language use in relatively specialised contexts (e.g., in language for specic purposes, under translational constraints) but also have an opportunity to make and test hypotheses as to the (in)applicability of their generalizations to a wider norm. The function of CEXI in this framework is to provide evidence of an intermediate norm, complementing the insights gained from the very specialized small corpora learners can build for themselves and from the general and very large ones available through international projects. We expect this process to be of mutual advantage to the parties involved, allowing the corpus to expand by incorporating relevant collections of texts as subcorpora and thus continuing to grow in size and variety after completion of the project.

CONCLUSION
This report has documented the ongoing construction of CEXI, a parallel, bidirectional corpus of English and Italian; its distinctive macrostructure; and some of its distinctive features (e.g., size, sampling frame, annotation). The description illustrates the didactic relevance of corpora in general and of this corpus in particular. Corpora, particularly if multilingual and appropriately annotated, can be used to enhance cultural, discoursal, linguistic and text- and corpusanalytic competence and capacity through appropriate learning activities. The relevance of corpora to language learning is not limited to providing insights about how a given language is used; rather, their greatest potential lies in the multifarious opportunities for autonomous or group work of an inductive, exploratory nature that they offer to teachers and learners alike.
THE AUTHOR
Silvia Bernardini teaches computer-aided translation and English linguistics at the School for Interpreters and Translators of the University of Bologna. For the past 3 years she has been involved in the construction of CEXI.

REFERENCES
Aston, G. (Ed.). (2001). Learning with corpora. Houston, TX: Athelstan. Aston, G., Bernardini, S., & Zanettin, F. (n.d.). CEXI: The Italian-English translational corpus. Retrieved June 14, 2003, from http://www.sitlec.unibo.it/cexi Bolinger, D. (1976). Meaning and memory. Forum Linguisticum, 1, 110.

536

TESOL QUARTERLY

Caminade, M., & Pym, A. (1995). Les formations en traduction et interprtation: Essai de recensement mondial [Translator and interpreter training institutions: An attempt at a world survey]. Paris: Socit Franaise des Traducteurs. Firth, J. R. (1968). Descriptive linguistics and the study of English. In F. R. Palmer (Ed.), Selected papers of J. R. Firth 19521959 (pp. 96113). London: Longman. (Original work published 1956) Johns, T. (1991). Should you be persuadedtwo samples of data driven learning materials. English Language Research Journal, 4, 116. Kiraly, D. (2000). A social constructivist approach to translator education. Manchester, England: St. Jerome. Halliday, M. A. K. (1992). Language as system and language as instance: The corpus as a theoretical construct. In J. Svartvik (Ed.), Directions in corpus linguistics (pp. 30 43). Berlin: Mouton de Gruyter. Hasselgrd, H., Johansson, S., & Hansen, C. F. (Eds.). (1999). Information structure in parallel texts [Special issue]. Languages in Contrast, 2(1). Leavitt, D. (1990a). A place Ive never been. London: Penguin Books. Leavitt, D. (1990b). Un luogo dove non sono mai stato [A place Ive never been]. Milan: Mondadori. Maia, B. (1997). Do-it-yourself corpora . . . with a little help from your friends. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), PALC 97: Practical applications in language corpora (pp. 403410). Lodz, Poland: Lodz University Press. Moon, R. (1998). Fixed expressions and idioms in English. Oxford: Oxford University Press. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike uency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191226). London: Longman. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J. M. (1996). The search for units of meaning. Textus, 9, 75106. Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press. Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (2002). Text encoding initiative: The XML version of the TEI guidelines. Retrieved June 14, 2003, from http://www.tei-c.org /P4X Stubbs, M. (1996). Text and corpus analysis. Oxford: Blackwell. Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell. UNESCO. (1998). Index translationum on CDROM (5th ed.). Paris: Author. Varantola, K. (in press). Translators and disposable corpora. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education. Manchester, England: St. Jerome. Widdowson, H. G. (1984). Explorations in applied linguistics 2. Oxford: Oxford University Press.

BRIEF REPORTS

537

The International Corpus of Learner English: A New Resource for Foreign Language Learning and Teaching and Second Language Acquisition Research
SYLVIANE GRANGER University of Louvain Louvain-la-Neuve, Belgium
I

In the late 1950s, when corpus linguistics made its debut on the linguistic scene, it was a very modest enterprise in the hands of a small group of enthusiasts. Looking back on this period, Leech (1991), one of the pioneers of corpus linguistics, recalls that for years, corpus linguistics was the obsession of a small group which received little or no recognition from either linguistics or computer science (p. 25). Since that time, the group of enthusiasts has grown considerably and corpus linguistics has progressively inltrated mostif not alllanguage-related disciplines. One of its major contributions has been in the eld of variation studies. The diversication of corpora has given linguists a rm basis for comparing language varieties distinguished in terms of the medium (spoken vs. written), the eld (general vs. specialized), and geographical status (World Englishes). For years, foreign/second language learner varieties remained conspicuously absent from corpus-based research. Only in the early 1990s did publishers and academicsconcurrently but independentlystart collecting and analyzing learner data. Two learner English corpora originated in that early period: the Longman Learners Corpus (see Longman Corpus Network, 2003) and the International Corpus of Learner English (ICLE; see Granger, n.d.). As the latter is now being made available to the academic community (Granger, Dagneaux, & Meunier, 2002), it seems only tting to describe the corpus in detail and, more importantly, to highlight the benets it offers ESOL researchers and teachers.

DESIGN CRITERIA
A computer learner corpus (CLC) is an electronic collection of authentic texts produced by foreign or second language learners. Although all corpora need to be assembled according to explicit design criteria (Atkins & Clear, 1992), extra care has to be taken in collecting the data for learner corpora given the large number of variables affecting the learning/acquisition process. The ICLE is a very richly
538 TESOL QUARTERLY

documented corpus: More than 20 task and learner variables have been recorded for each of the texts in the corpus through a detailed prole questionnaire completed by all learners. As shown in Figure 1, some of these variables (medium, genre, average length, learner prociency level) were used as corpus design criteria and are therefore shared by all texts in the corpus whereas others (gender, mother tongue background, essay topic) differ from text to text. All the variables have been stored in a database and can be used by researchers as queries to compile subcorpora that match certain criteria, thus allowing for interesting comparisons (e.g., female vs. male learners, German- vs. Spanish-speaking learners).

Learner Variables
The learners who have contributed data to the ICLE have a great deal in common. All are young adults (about 20 years old) who study English in a non-English-speaking country; that is, they are EFL rather than ESL learners. They are all university undergraduates specializing in English in their second, third, or fourth year, and their level can be roughly described as advanced, although individual learners and learner groups differ in prociency. The corpus focuses on advanced interlanguage partly because of the wish to compensate for its relative neglect in comparison with lower prociency levels, resulting in a dearth of pedagogical materials for the advanced learner.
FIGURE 1 ICE Task and Learner Variables

International Corpus of Learner English

Shared features

Variable features

Learner variables Age Learning context Prociency level

Task variables Medium Field Genre Length

Learner variables Gender Mother tongue Region Other FL L2 exposure

Task variables Topic Task setting Timing Exam Reference tools

BRIEF REPORTS

539

In spite of their similarities in terms of age, L2 status, and prociency level, the learners display some signicant differences, the most important one being mother tongue. The ICLE database covers 11 different mother tongue backgrounds: Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish. These groups are further subcategorized according to the geographical provenance of the learners, thus distinguishing between Dutch-speaking learners from the Netherlands and Belgium, or Finnish-speaking learners from Finland and Sweden. In addition, the learners knowledge of other foreign languages is recorded. Another variable with a potentially signicant impact on learner output is the amount of time learners have spent in an English-speaking country. ICLE learners differ considerably in this respect: 40% have never stayed in an English-speaking country whereas some 30% have lived in an English-speaking environment for 3 months or more. A last relevant variable is gender: The corpus contains data from both male and female learners, although the latter clearly constitute the majority (80%).

Task Variables
The ICLE data share a large number of task attributes. They consist exclusively of written productions of a particular genre, namely, essay writing, and represent general English rather than English for specic purposes. They are, on average, 700 words in length, unabridged. The topics are extremely varied, although the majority of them (85%) are argumentative. (Given the difculty in collecting a sufcient number of argumentative essays, we allowed for the inclusion of a small portion 25% at mostof literary essays in the data.) The query system allows researchers to select essays on the same or similar topics. For instance, by entering the key word women as a search term, the researcher can retrieve a subcorpus of essays on topics such as Feminists have done more harm to the cause of women than good, Womens Liberation, Single women should not be allowed to have articial insemination, and Have real women disappeared? The essays also include certain differences in task settings. Recorded variables pertaining to the task are whether there was a time limit for writing, whether the essay was part of an exam, and whether the learners were allowed to use language reference tools such as grammars or dictionaries.

Size and Representativeness


The ICLE database contains 3,640 essays, totaling 2.5 million words. Each of the 11 national (i.e., L1-differentiated) varieties comprises around 330 essays totaling approximately 200,000 words. Compared with
540 TESOL QUARTERLY

current very large corpora, such as the British National Corpus (100 million words) or the Bank of English (450 million words), the ICLE is very small. However, when it comes to learner language, size cannot simply be assessed in terms of the number of words. Equally important is the number of learners, and, in this respect, the ICLE, which contains writing by well over 3,000 learners, constitutes a solid empirical basis for second language acquisition (SLA) and foreign language teaching research. (There are slightly more essays than learners, as some learners contributed more than one essay without exceeding the maximum limit of 1,000 words per learner.) However, because of its limited number of words, the ICLE cannot be used for all types of linguistic investigation. It lends itself well to the analysis of high-frequency phenomena at all linguistic levels (morphology, grammar, lexis, discourse) but is unsuited for the study of infrequent linguistic items.

ANALYSIS OF THE CORPUS Contrastive Interlanguage Analysis


The method most frequently used so far to analyze the ICLE is contrastive interlanguage analysis, an approach that consists in carrying out either a comparison of learner data with native speaker data (L2 vs. L1) or a comparison between different types of learner data (L2 vs. L2) (see Granger, 1996). The rst type of comparison makes it possible to uncover the patterns of use distinguishing learner data from native data. These fall into two categories: qualitative differences (misuse) and quantitative differences (over- and underuse). This type of analysis is greatly facilitated by text retrieval programs such as WordSmith Tools (Scott, 1996), which uses the compare lists function to give researchers immediate access to the words or phrases that are signicantly under- or overused by learners. The concord and collocate display functions are also extremely valuable as they shed light on the recurring patterns or collocates that learners use, whether correctly or incorrectly. The second type of comparison is essential to establish whether the differences uncovered are developmental or transfer related. With its wide range of mother-tongue backgrounds, the ICLE is an ideal resource to establish the importance of transfer in SLA. According to Odlin (1989), if the phenomenon of transfer is still incompletely understood, it is largely due to the heterogeneity of the data used:
A brief look at the studies cited will show considerable variation in the numbers of subjects, in the backgrounds of the subjects, and in the empirical data, which come from tape-recorded samples of speech, from student writing, from various types of tests, and from other sources. (p. 151)
BRIEF REPORTS 541

A highly controlled learner corpus such as the ICLE, with its strict design criteria and rich documentation, should go some way in answering Odlins call for improvements in data gathering (p. 151). Using this computer-aided contrastive approach, researchers have been able to uncover a wide range of patterns of under-, over-, and misuse in learner lexis, (lexico-)grammar, and discourse (see Centre for English Corpus Linguistics, 2002, for a comprehensive learner corpus bibliography based on the ICLE or other learner corpora). Among the many topics that have been analyzed so far on the basis of ICLE data are high-frequency words, Romance words, recurrent combinations, collocations and formulae, prefabricated language, lexical proling, lexical variation, adjective intensication, the verb make, progressives, passives, modality, noun phrase complexity, demonstratives, contractions, logical connectors, causal links, conjunctions, participle clauses, direct questions, tense errors, lexical errors, part-of-speech tagging, and parsing.

Computer-Aided Error Analysis


Differences in frequency patterns are not the only differences between learner and native corpora. Learner writing, even at an advanced prociency level, is characterized by a much higher error rate than native writing (e.g., in the French subcorpus of the ICLE, the rate is 1 error in every 16 words). As current spelling and grammar-checking programs are not capable of detecting, let alone correcting, the majority of these errors (Granger & Meunier, 1994), error annotation is the only solution for the time being. This time-consuming but highly rewarding process consists in annotating all errors (or errors in a particular category, e.g., verb complementation or modals) in the text les using a standardized system of error tags and an error editor to speed up the process (see Dagneaux, Denness, & Granger, 1998, for a detailed description). Once les have been error tagged, it is possible to search for any error category using a text retrieval program such as WordSmith Tools, sort the errors in various ways, and analyze them in the full context of the text. Although error tagging has not been used on a large scale yet, preliminary work shows the tremendous potential of the approach (see Granger, 1999, for an analysis of verb tense errors).

ENGLISH LANGUAGE TEACHING APPLICATIONS


Learner corpus research opens up exciting pedagogical perspectives in a wide range of areas of English language teaching (ELT) pedagogy: materials design, syllabus design, language testing, and classroom methodology. Here I limit discussion to the rst area (for the use of learner
542 TESOL QUARTERLY

corpora in language testing, see Hasselgren, 2002; for classroom methodology, see Seidlhofer, 2002). The link between corpus-based research and teaching is based on the idea that corpus evidence suggests which language items and processes are most likely to be encountered by language users, and which therefore may deserve more investment of time in instruction (Kennedy, 1998, p. 281). The area where corpus information is used most extensively, to the point of having become standard practice, is ELT lexicography: All monolingual learners dictionaries are now corpus based. Work has also started on the production of corpus-informed textbooks, although progress in this area is rather slow (however, see Carter, Hughes, & McCarthy, 2000; Thurstun & Candlin, 1997). As regards grammar, although a corpus-based ELT grammar has yet to be written, the frequency information contained in Biber, Johansson, Leech, Conrad, and Finegans (1999) corpus-based grammar of spoken and written English could be usedand, it is hoped, will soon be usedto design one. Although the benet of a corpus approach to teaching is evident, linguists are keen to point out that it is not a panacea (cf. Conrad, 1999, p. 17; McCarthy & Carter, 2001, p. 338). Perusal of native corpus data, however detailed, will never tell anything about the degree of difculty of words and structures for learners. Learner corpora are the resource par excellence to access this type of information. Evidence of learner under-, over-, and misuse can help materials designers and teachers select and rank ELT material at a particular prociency level. The benets that can be derived from using learner corpora are apparent from the few CLC-informed ELT resources that exist. The Longman Essential Activator (LEA, 1997) is the rst learners dictionary to incorporate CLC data. The compilers of the dictionary used the Longman Learners Corpus to nd out how learners used the words covered in the LEA. They then turned the information into help boxes designed to warn learners against typical errors (Gillard & Gadsby, 1998). Although the LEA targets all EFL/ESL learners irrespective of their L1 background, some CLC-based tools are tailor-made for particular groups of learners. Miltons WordPilot (n.d.) software is a writing kit especially designed for Hong Kong learners of English (see Milton, 1998). It contains error recognition exercises intended to sensitize learners to the most common errors attested in a Hong Kong learners corpus. The program also includes a concordance tool and native corpora of specic genres intended to provide learners with authentic native examples of words with which they have difculty. Allans (2002) Web-based TeleNex network (see http://www.telenex.hku.hk) is designed to provide support to secondary-level English teachers in Hong Kong. The Web site contains both students problems les that describe areas of learner difculty
BRIEF REPORTS 543

extracted from a learner corpus and teaching implications les intended to help teachers deal with these problems in the classroom. The IWiLL language learning environment (Wible, Kuo, Chien, Liu, & Tsao, 2001) is a highly interactive tool that allows students and teachers to create and use an online database of Taiwanese learners essays and teachers error annotations. Other Web-based projects (see; Cowan, Choi, & Kim, 2003; Kindt & Wright, 2001) bear witness to the tremendous potential of the Internet for CLC-based ELT applications. Although the ICLE has not yet resulted in concrete ELT resources, its tremendous potential in this respect is obvious. Thanks to its differentiation of mother tongue backgrounds, users can distinguish problem areas shared by all learners at an advanced level from those that are specic to a particular learner group and can ne-tune teaching materials accordingly.

CONCLUSION
In the preface to the rst volume devoted to learner corpora, Leech (1998) states that the concept of a learner corpus is an idea whose hour has come (p. xvi). At the time, however, most efforts were still expended on collecting data, establishing methodologies for learner corpus research, and trying them out in various case studies. The release of a learner corpus such as the ICLE marks the beginning of a new stage in the evolution of learner corpus research. The time has come to use the resource on a wider scale in both SLA and ELT. On a more theoretical level, the ICLE data can be used alongside other data types of a more experimental nature to give SLA theories a more solid empirical foundation, in particular as regards the important question of L1 transfer. On a practical level, the ICLE can help produce more learner-aware pedagogical material designed for advanced EFL learners in general or focused on the needs of one national learner population.
ACKNOWLEDGMENTS
I acknowledge the support in this research provided by the Belgian National Scientic Research Fund.

THE AUTHOR
Sylviane Granger is a professor of English language and linguistics and director of the Centre for English Corpus Linguistics at the University of Louvain. Her edited publications include Learner English on Computer (Longman, 1998) and Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching (with J. Hung and S. Petch-Tyson; Benjamins, 2002).
544 TESOL QUARTERLY

REFERENCES
Allan, Q. G. (2002). The TELEC Secondary Learner Corpus: A resource for teacher development. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 195211). Amsterdam: Benjamins. Atkins, S., & Clear, J. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 116. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow, England: Longman. Carter, R., Hughes, R., & McCarthy, M. (2000). Exploring grammar in context. Cambridge: Cambridge University Press. Centre for English Corpus Linguistics. (2002). List of publications. Retrieved May 22, 2003, from http://juppiter.tr.ucl.ac.be/FLTR/GERM/ETAN/CECL/publications .html Conrad, S. (1999). The importance of corpus-based research for language teachers. System: An International Journal of Educational Technology and Applied Linguistics, 27, 118. Cowan, R, Choi, H. E., & Kim, D. H. (2003). Four questions for error diagnosis and correction in CALL. CALICO Journal, 20, 451463. Dagneaux, E., Denness, S., & Granger, S. (1998). Computer-aided error analysis. System: An International Journal of Educational Technology and Applied Linguistics, 26, 163174. Gillard, P., & Gadsby, A. (1998). Using a learners corpus in compiling ELT dictionaries. In S. Granger (Ed.), Learner English on computer (pp. 159171). London: Addison Wesley Longman. Granger, S. (1996). From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In K. Aijmer, B. Altenberg, & M. Johansson (Eds.), Languages in contrast (Lund Studies in English 88, pp. 37 51). Lund, Sweden: Lund University Press. Granger, S. (Ed.). (1998). Learner English on computer. London: Addison Wesley Longman. Granger, S. (1999). Use of tenses by advanced EFL learners: Evidence from an errortagged computer corpus. In H. Hasselgard & S. Oksefjell (Eds.), Out of corpora: Studies in honour of Stig Johansson (pp. 191202). Amsterdam: Rodopi. Granger, S. (n.d.). International Corpus of Learner English. Retrieved May 22, 2003, from http://www.tr.ucl.ac.be/tr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm Granger, S., Dagneaux, E., & Meunier, F. (Eds.). (2002). The international corpus of learner English: Handbook and CD-ROM. Louvain-la-Neuve, Belgium: Presses Universitaires de Louvain. (Available from http://www.i6doc.com) Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: Benjamins. Granger, S., & Meunier, F. (1994). Towards a grammar checker for learners of English. In U. Fries & G. Tottie (Eds.), Creating and using English language corpora (pp. 7991). Amsterdam: Rodopi. Hasselgren, A. (2002). Learner corpora and language testing: Smallwords as markers of learner uency. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 143 173). Amsterdam: Benjamins. Kennedy, G. (1998). An introduction to corpus linguistics. London: Longman. Kindt, D., & Wright, M. (2001). Integrating language learning and teaching with the construction of computer learner corpora. Retrieved May 22, 2003, from http://www .nufs.ac.jp/~dukindt/pages/SOCCpapers.html
BRIEF REPORTS 545

Leech, G. (1991). The state of the art in corpus linguistics. In B. Altenberg & K. Aijmer (Eds.), English corpus linguistics (pp. 829). London: Longman. Leech, G. (1998). Learner corpora: What they are and what can be done with them. In S. Granger (Ed.), Learner English on computer (pp. xivxx). London: Addison Wesley Longman. Longman Corpus Network. (2003). The Longman learners corpus. Retrieved May 22, 2003, from http://www.longman-elt.com/dictionaries/corpus/lclearn.html Longman essential activator. (1997). Harlow, England: Addison Wesley Longman. McCarthy, M., & Carter, R. (2001). Size isnt everything: Spoken English, corpus, and the classroom. TESOL Quarterly, 35, 337340. Milton, J. (1998). Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In S. Granger (Ed.), Learner English on computer (pp. 186198). London: Longman. Milton, J. (n.d.). WordPilot [Computer software]. (Available from http://home .ust.hk/~autolang/download_WP.htm) Odlin, T. (1989). Language transfer: Cross-linguistic inuence in language learning. Cambridge: Cambridge University Press. Scott, M. (1996). WordSmith Tools [Computer software]. Oxford: Oxford University Press. Seidlhofer, B. (2002). Pedagogy and local learner corpora: Working with learningdriven data. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 213234). Amsterdam: Benjamins. Thurstun, J., & Candlin, C. (1997). Exploring academic English: A workbook for student essay writing. Sydney, Australia: National Centre for English Language Teaching and Research. Wible, D., Kuo C.-H., Chien, F.-Y., Liu, A., & Tsao, N.-L. (2001). A Web-based EFL writing environment: Integrating information for learners, teachers, and researchers. Computers and Education, 37, 297315.

The Multimedia Adult ESL Learner Corpus


STEPHEN REDER, KATHRYN HARRIS, and KRISTEN SETZLER Portland State University Portland, Oregon, United States
I

This report describes an innovative corpus project that will add several important dimensions to the emerging connections between corpus linguistics and TESOL. A multimedia learner corpus, the Multimedia Adult ESL Learner Corpus (MAELC), is being collected within an adult ESL instructional environment. This Lab School environment (see http://www.labschool.pdx.edu) is jointly operated by the Applied Linguistics Department at Portland State University and Portland Community College, an adult ESL provider. Low-level adult ESL classrooms within a regular program are continuously recorded with multiple video cameras and microphones. By the end of the 5-year project period
546 TESOL QUARTERLY

(August 1, 2001July 31, 2006), the resulting corpus will contain approximately 5,000 hours of classroom language and instruction involving approximately 1,000 adult learners. With software developed to attach transcriptions and classroom activity codes to the digital media corpus, users can readily search for and play back video-audio clips that illustrate particular points of second language acquisition (SLA) or L2 pedagogy. This multimedia corpus and associated software tools will be available online to scholars and practitioners for research and professional development activities.

INNOVATIVE CHARACTERISTICS OF THE CORPUS


MAELC will add considerably to existing L2 learner corpora. Although a range of L2 learner corpora are already available (e.g., Granger, 1998; Granger, Hung, & Petch-Tyson, 2002), this corpus adds several aspects to the connections between corpus linguistics and TESOL: (a) It focuses on the early stages of adult SLA, (b) it is highly extensible and searchable in terms of both transcribed language and coded pedagogical activities, and (3) with associated software, it maintains persistent links between transcriptions and original audio-video recordings.

Focus on Early Stages of SLA


A multimedia corpus is a particularly appropriate way to capture language in the early stages of acquisition. Low-level learner language has traditionally been very difcult to research, in part because emergent L2 forms and nonverbally conducted communication are difcult to represent in transcripts. MAELC represents learner language through both transcripts and the associated video and audio recordings. The corpus includes learners from the very beginning stages throughout their acquisition process, making longitudinal studies possible on large numbers of learners. The recording of individual learners on a regular basis over time will make possible in-depth research on learner language development.

Corpus Extensibility and Searchability


The digitally recorded multimedia corpus is very large, with numerous cameras and microphones used to record each class (see Appendix A). Such a corpus would be extremely time-consuming for either practitioners or scholars to use unless its contents were indexed in ways that allow ready access for use in research or professional development. The project has developed specialized software, called ClassAction (see Appendix B), to attach activity and content codes and transcriptions of classroom language to the multimedia corpus. Our classroom activity
BRIEF REPORTS 547

coding framework, described below, indexes and helps locate clips from the recorded language classrooms reecting particular participation patterns, pedagogical activities, and so forth. Searches based on these activity codes will make it easy to examine various aspects of learner acquisition processes in the classroom context. We have also developed a transcription framework appropriate for early stages of SLA. Transcripts include information on what students actually said and what their target utterance was, facilitating the ready identication of certain types of errors in learner language. Users of the corpus can search and analyze the transcribed language data using corpus linguistics software. ClassAction enables other researchers to add their own structured codes, open-ended annotations, or transcription details (e.g., a layer of phonetic transcription or a layer of grammatical tags) to the corpus, which will be available to a community of users. The corpus can be searched in terms of combinations of classroom codes and student language, facilitating research directed at relationships between pedagogical activities and student language development.

Persistent Links with Recorded Media


Persistent links between the corpus of transcribed language and the original media recordings add a number of important dimensions to corpus-based research and professional development in TESOL. First, when researchers query the corpus, they not only receive the selected linguistic data but can also view and listen to the associated clips from the classrooms. This retrieval will greatly extend researchers ability to interpret transcriptions and codes in the corpus as well as to add transcriptions or activity codes to the corpus. Language teachers and teacher educators can search the corpus and display selected clips to use in preservice and in-service education and professional development activities. Modules that have been developed on working with low-level learners and teachable moments (Kurzet, 2000) have included illustrative video clips, related readings, and discussion questions. Figure 1, a screen shot from the Toolbox module of ClassAction, illustrates how the transcription and coding of classroom language and activity are linked to the recorded media (see Lab School, 2003a, for the sample clip and associated corpus data referred to in this report). Onscreen is the Cinco de Mayo clip from a class on May 7, 2002 (0:18:04 into a 3-hour class). Six synchronized camera views are displayed (b), any one of which can be clicked upon to enlarge its size (a) and activate its audio. Classroom activity codes associated with this moment are also shown (cf). The real-time scrolling transcript appears in the upper right window (i), with a variety of transcript layers displayed

548

TESOL QUARTERLY

in the tabbed window below it (h); here the target layer is shown. Users can view, enter, or edit additional datawithin the standard MAELC framework or within a custom frameworkthrough the tabbed window (g).
FIGURE 1 Screen Shot of a MAELC Clip in ClassAction Toolbox

Note. Cinco de Mayo clip from a class on May 7, 2002 (0:18:04 into 3-hour class). (a) current camera view; (b) six synchronized camera views (two pair-focused views on the ends separated by four xed, whole-class views); (c) coded participation pattern (Pair); (d) coded prompt (Directions: Verbal); (e) coded information (Personal); (f) coded language (Question/ Answer); (g) tabbed panel for viewing/editing transcriptions, annotations, markings, and custom codes linked to media; (h) tabbed panel for viewing synchronized nonorthographic layers of transcription (target, phonetic, morphophonemic, grammatical tag, translation, and notes layers; target layer shown); (i) synchronized orthographic transcription window. Key to transcript notation: <Name.#> = speaker ID; <chn> = Chinese code switch [Name.#> = addressee ID; _ = false start; xxx = unrecoverable speech; (+) = pause of 0.51.0 second; ? = rising intonation; (#) = pause of 1 second or more; . = falling intonation; * = emergent lexical item.

BRIEF REPORTS

549

CODING SYSTEM
The design of the MAELC coding system reects the Lab School projects research focus on SLA as seen through student interaction in classrooms. In developing the coding system, we were guided by the need to index as many hours of classroom recordings as possible while maintaining consistency and reliability among numerous coders. We chose categories and category labels to maximize usability by researchers in SLA and L2 pedagogy as well as by L2 educational practitioners. The MAELC coding system draws on the Communicative Orientation of Language Teaching (COLT) observation scheme (Frhlich, Spada, & Allen, 1985) and the Foci for Observing Communications Used in Settings (FOCUS) (Fanselow, 1977) for categories. However, unlike those systems, our coding system takes advantage of the exibility of coding recorded as opposed to real-time classes. We divide time along overlapping, parallel dimensions (termed segment types). Within any one time line, time is divided into segments, with the end of one segment marking the beginning of the next. (The COLT scheme, Part A, also segments the time line but allows multiple category labels to be assigned to each segment. ClassAction allows only one category label for each segment within a segment type.) At the beginning of the clip shown in Figure 1 (18:04), two students are engaged in pair work that starts at 15:01 and ends at 22:28. Figure 2 shows a schematic of the overlapping segments that are coded for the entire activity. The activity that the students are engaged in begins at 11:04 with the teacher giving directions for the students to discuss their weekend. The teacher provides language for the students to use as a guide for their discussion. On the board she writes
How was your weekend? What did you do? Tell me more.

The students have a few minutes each to talk with their partner about their weekends. Beyond the initial question, the students use their own language, not needing the support of the language provided by the teacher. At 22:28 the teacher brings the class back together to wrap up the activity by eliciting details from the student pairs. The activity ends at 32:30. A new activity begins at 32:30 with the teacher asking the students questions on a different topic. The segment types or categories chosen for the MAELC coding system describe the organization of the classroom and the instructional activities in which the teachers and students engage. The codes describe what is observable from video recordings rather than inferences about the
550 TESOL QUARTERLY

FIGURE 2 Overlapping Segments in the MAELC Coding System Time Segment type Participation pattern 11:04 15:01 22:28 32:30

Teacher fronted Pair

Teacher fronted New activity

Pedagogical activity

Prompt: Teacher provided directions Information: Student provided personal information Language: Student provided questions and answers

intentions of the teachers within the instructional process. The goal of the coding system is not to compare the teacher to an established set of expectations but to index a large corpus of recorded classes so that corpus users can readily locate and observe periods during which particular student or teacher behaviors of interest are likely to occur. Twenty-four hours per week of classes have been recorded continuously since September 2001. At this writing, half of these classes have been indexed using the MAELC coding system, and a portion of each coded class has been transcribed.

Participation Pattern
Participation pattern codes, shown in Part (c) of Figure 1, reect the grouping of the class. Examples include teacher fronted, student fronted, individual private (students are working alone at their desks), individual public (students are working alone but in the public space; e.g., they are writing on the board), pair, group, and free movement. Indexing language classrooms in this way allows project researchers to locate and further analyze periods during which, for example, students are working and talking in pairs in the context of varied pedagogical activities.

Activity
The second way in which the time line is divided reects the pedagogical activity. Activities are the components of daily instruction organized and planned by the teacher. In the MAELC coding system, each activity is described in three dimensions: the prompt that starts the activity (as indicated in Figure 1, Part [d]), the information used in the activity (as indicated in Figure 1, Part [e]), and the language students use to participate in the activity (as indicated in Figure 1, Part [f]). Each
BRIEF REPORTS 551

dimension includes information about who (teacher or student) provides it and about what is provided. In Figure 1 these dimensions are coded respectively as a teacher-provided prompt that is a set of directions, student-provided information that is personal, and student-provided language that is question/answer. Using the MAELC coding system, users of the corpus can locate segments of time during which patterns of language or pedagogical behavior of interest may occur. An example is the location of pairparticipation pattern segments occurring during pedagogical activities that utilize students personal information. Lab School researchers are comparing the language produced in these activities with the language produced in activities using information from other sources, such as textbooks. Current research focuses on analyzing the degree of information transfer that occurs, the negotiation of meaning that happens, and the ways in which the students work to co-construct meaning in these and other activity types (e.g., Garland, 2002).

TRANSCRIPTION SYSTEM
Portions of students language in the classroom are also being transcribed. The nature of the data in the corpus requires a broader notion of transcript than the eld has previously associated with that term. A transcript in our project is not an isolated representation of language; it is a linked, multimedia combination of audio, video, and written representations. This type of corpus creates a powerful context for discussing our own representations and analyses of students emerging language as well as how such representations limit and facilitate the various understandings of student language within the TESOL eld. The transcription includes an identication of the speaker and addressee and the modality (written or oral) as well as representations of the language produced. Because the transcription can be searched, corpus users can access high-quality language samples for use in longitudinal language acquisition research. Making this kind of research possible required careful consideration of the transcription system used in the corpus. The development of the MAELC transcription system was guided by the difculties of representing both low-level student language and the dynamic classroom environment. Furthermore, the system had to strike a balance between the level of transcription detail and the limited project resources. The Lab School projects research focus on student language as seen in student-student interaction guides the types of information included in the transcription system as well as the principled selection of time periods for transcription. Because the corpus is Web-based, other researchers will be able to use ClassAction to transcribe additional material and add layers of transcrip552 TESOL QUARTERLY

tion detail to transcribed portions of the corpus (e.g., phonetic or morphophonemic layers; see Figure 1, Part [g]). (For information about obtaining access to the corpus and ClassAction software, see Lab School, 2003b.)

Transcription Segments and Bubbles


In order to anchor the transcription notation to its audio-video context, a transcription segment species a given camera, microphone, starting point, and ending point within a particular media le. The base transcription layer is a general, orthographic representation of language, including more detailed features than nontechnical transcripts usually include and fewer features than most conversation analysis approaches include. To avoid oversimplifying the complex nature of language produced by low-level learners in the classroom, we developed notation features specic to low-level discourse and protocols for transcribing classroom interaction that maximize the utility and extensibility of the initial transcription. Rather than attempt to represent all language that might be heard in a particular camera view, we use the notion of a language bubble to focus our transcription efforts. A bubble contains language that is produced by a student wearing a wireless microphone, the students interlocutors, and anyone else audible through that microphone who provides a discourse context for the student. The bubble concept sets up priorities within the multitude of simultaneous conversations that occur in the ESOL classroom, resulting in a usable transcript for the researcher/end user of the corpus. The system also has the benet of targeting language recorded with high audio quality, that is, language that takes place around the wireless microphones. Discourse that is transcribed in the student-student language bubbles includes information about the speaker and addressee, code-switching behaviors, the oral or written modality of the language produced, information on major phrase-level intonation, intraturn pauses, miscues and repairs, paralinguistic vocalizations (e.g., laughter), and nonlexical information used in lieu of lexical items. Although a transcription protocol for this type of language could include additional aspects of low-level learner language, we have reasoned that, as a rst pass, this level of detail will provide a platform for our current analysis and a basis for future research.

Flagging of Emergent Language


Many of the issues we encountered around how to represent emergent language were also confronted by researchers in L1 acquisition. How could we represent lexical items that were imperfectly acquired?
BRIEF REPORTS 553

How could we represent what the learner intended to express so as to facilitate research in interlanguage development? We often found that attempts to represent emergent forms either were in conict with the need to economize effort within our large-scale transcription enterprise or failed to meet reasonable standards of accuracy and reliability. Ultimately, to promote the broad and exible use of the corpus by scholars, we decided to indicate systematically the occurrence of emergent forms in a way that would facilitate future research in this area even if we do not fully characterize such forms in the base transcription layer; these emergent items are agged with asterisks. The asterisks serve as an easy way to locate nontarget items with meanings that may be reasonably inferred. In the interaction shown in Figure 1, a student attempts to understand the explanation of a Cinco de Mayo festival:
Actual transcript: 18:03 Chinh.1 *itsupendence <chn> xxx Target: *independence <Chinese code switch> xxx

This item is agged based on the substitution of the sound [nd] for [ts] in the lexical item. Extra or missing consonants also warrant a ag, although partial lexical items do not unless a lexical substitution is also occurring. (The example above also shows the marking of code switching, a feature common in the low-level learner language in the corpus.) Another example, also from clip shown in Figure 1, shows the agging of the substitution of [s] for [ks] and the missing nal consonant of the target adjective (Mexican):
Actual transcript: 16:07 Chinh.1 Target: *Mes_ *Mesico holiday? *Mex_ *Mexican holiday?

This example illustrates that, even though our protocol species that a sound substitution has occurred, it does so broadly. A researcher who wanted to add a layer of coding with a more principled phonetic representation could easily locate the relevant items. We include target transcriptions only in cases where the transcriber is reasonably sure of the representation; in cases of reasonable doubt, no target representation is attempted. In the example above, it is reasonable to assume that the target form in this case is the adjective form Mexican. The preceding and following utterances show the negotiation of the adjective form. However, the distinction is not always clear-cut. Transcribers make no attempt to determine what a speaker intended to say (e.g., whether a student intended to say Mexican or Mexico), and the decision to ag an item is based on consonant substitution, addition, or omission.

554

TESOL QUARTERLY

We do not represent phonetically students emergent language that contains sound substitutions (because doing so consistently throughout the corpus was not feasible), nor do we assume that such attempts are target forms, creating a misrepresentation of the students attempt toward accuracy. We hope that systematically agging these items for further principled inquiry will help the eld to understand better the role played by these forms in SLA. Vowels, which are somewhat more like points on a continuum than many of the consonant sounds are, entail more difcult transcription decisions. We are still in the process of developing a good way to identify vowels that need to be agged. The need for a ag is apparent in cases where vowel omissions or substitutions result in a change in lexical meaning. Less salient vowel substitutions have been difcult to pin down. When do vowel approximations that are evident in many accents qualify as interfering with comprehension, and when are they minor? We have not succeeded at dening, a priori, a consistent description of vowel errors that seem egregious within the broader orthographic (not phonetic) representation used in the base layer of the transcription. It is hoped that items already agged in the corpus can serve as the basis for a bottom-up analysis of native speaker judgments in agging vowel errors.

THE FUTURE OF MAELC


MAELC provides a new and perhaps unique view of low-level L2 development and the pedagogical context within which it occurs. The recording environment makes it possible to focus on emerging language in student-student interaction within classroom (as opposed to experimental) settings over time. A set of coding and transcription tools along with specialized software permits the project team to analyze the corpus in conducting research in SLA and pedagogy. Because the corpus and software will be accessible to researchers via the World Wide Web, we hope the corpus will grow to include additional codes, transcripts, and other types of annotation provided by a community of researchers. Several obvious directions for studies based on this corpus are morphology acquisition studies that address how L1 backgrounds inuence early language acquisition and studies of the development of form/ function relationships. Because the corpus is searchable, researchers are able to look at how lexical choice varies according to activity type. Perhaps more than anything, this corpus will provide the data necessary for researchers to explore how emergent learner language differs from higher level learner language and native language acquisition.

BRIEF REPORTS

555

ACKNOWLEDGMENTS
This project is supported in part by grant R309B6002 from the U.S. Department of Education, Educational Research and Development Centers Program, to the National Center for the Study of Adult Learning and Literacy, in which Portland State University is a partner.

THE AUTHORS
Stephen Reder is professor and chair of the Department of Applied Linguistics at Portland State University. His interests include adult literacy and language development and the relationships among literacy, technological change, and language. He is currently doing research in the ESOL Lab School and the Longitudinal Study of Adult Learning projects. Kathryn Harris is a research associate at the Adult ESOL Lab School and an assistant professor in the Department of Applied Linguistics at Portland State University. She is interested in process of adult beginning second language acquisition, in- and outof-class inuences on that acquisition, and how the acquisition process can be seen in classroom student-student interaction. Kristen Setzler is a research associate and MA TESOL student at Portland State University. Her interests include second language acquisition, adult learning and literacy development, and the issues involved in the transcription of low-level learner language.

REFERENCES
Fanselow, J. F. (1977). Beyond RASHOMONconceptualizing and describing the teaching act. TESOL Quarterly, 11, 1739. Frhlich, M., Spada, N., & Allen, P. (1985). Differences in the communicative orientation of L2 classrooms. TESOL Quarterly, 19, 2757. Garland, J. (2002). Co-construction of language and activity in low-level ESL pair activity. Unpublished masters thesis, Portland State University, Portland, OR. Granger, S. (1998). The computerized learner corpus: A versatile new source of data for SLA research. In S. Granger (Ed.), Learner English on computer (pp. 318). New York: Addison Wesley Longman. Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: John Benjamins. Kurzet, R. (2002). Teachable moments: Videos of adult ESOL classrooms. Focus on Basics, 5(D), 811. Lab School. (2003a). Cinco de mayo [Data le]. Portland, OR: Portland State University. Retrieved July 16, 2003, from http://www.labschool.pdx.edu/Viewer /viewer.php?TQ_XML3 Lab School. (2003b). Lab school softwareClassAction. Portland, OR: Portland State University. Retrieved July 16, 2003, from http://www.labschool.pdx.edu /ClassAction/

556

TESOL QUARTERLY

APPENDIX A Data Collection for MAELC


In the classroom, four xed ceiling-mounted cameras focus on the whole class, and two remotely controlled, ceiling-mounted cameras focus on individual students wearing wireless microphones (rotated among the students in a class from day to day). The result is a view of the instruction, a close-up view of the students participating in the instructional activities, and highquality audio recordings of their language. Classes are selected for coding and transcription in a stratied random design to provide approximately equal amounts of data for classes taught by each teacher and at each instructional level.

APPENDIX B ClassAction Software


ClassAction is a set of interrelated software programs: The Coder-Transcriber program is used by project staff to code and transcribe the data. The Toolbox program is used to view synchronized camera views (with audio) and to view, enter, and edit associated codes, transcripts, and annotations. The Query program is used to search the large ClassAction corpus of language and codes to identify clips illustrating selected features of language use and pedagogy. These selections (e.g., all the utterances of a particular speaker over time or Level A conversations between speakers with different L1s) are assembled into play lists (comprising one or more clips) that can be viewed using Toolbox. The Viewer program is a freely downloadable browser plug-in that has some but not all of the functionality of Toolbox. Play lists can be edited and published on the ClassAction server so that research articles and textbooks may incorporate multimedia examples of particular points of SLA, use, or pedagogy and make them publicly viewable with the Viewer program. Users can view all recorded classes, including those not coded by Lab School staff.

BRIEF REPORTS

557

REVIEWS
TESOL Quarterly welcomes evaluative reviews of publications relevant to TESOL professionals. Edited by SUSAN CONRAD Portland State University

Corpus Linguistics Texts


An Introduction to Corpus Linguistics.
Graeme Kennedy. Harlow, England: Addison Wesley Longman, 1998. Pp. xii + 315.

Corpus Linguistics.
Tony McEnery and Andrew Wilson. Edinburgh, Scotland: Edinburgh University Press, 1996. Pp. x + 209.

Corpus Linguistics: Investigating Language Structure and Use.


Douglas Biber, Susan Conrad, and Randi Reppen. Cambridge, England: Cambridge University Press, 1998. Pp. x + 300.
I

As linguists, teachers, and other language researchers have become more interested in language analysis based on corpus linguistic methods, the demand for publications that clearly explain and exemplify those methods has increased. Each of the three books reviewed here is aimed at a slightly different audience, but in combination they deliver all the essential information one would need to fully understand and utilize the history, current state, and broad applications of corpus linguistics. Kennedys An Introduction to Corpus Linguistics provides a thorough history of the development of corpus linguistics, beginning with investigations that used corpus methods before computers were readily available. The book is divided into chapters that outline the design and development of corpora over time (chapter 2), descriptions of English that have resulted from corpus methods (chapter 3), examples of the types of analyses that have been conducted (chapter 4), and the implications of those analyses (chapter 5).
TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

559

One of the great strengths of Kennedys work is the care he takes to situate corpus linguistics within linguistic theory, current and past research paradigms, and the history of linguistics. As such, it is a valuable resource for teachers and other language researchers who want to make theoretically informed sense of the growing body of language analyses and teaching materials that have resulted from the use of corpus linguistic methods. The book is not, however, aimed at guiding readers who wish to become involved in the compilation and analyses of natural language corpora. For those readers, the other two books reviewed here will be of interest. McEnery and Wilsons Corpus Linguistics is intended as the text for an undergraduate introductory course in corpus linguistics. The topics include early (pre-1950s) machine-readable corpora and the empirical studies that employed them (chapter 1), issues related to compiling and annotating corpora (chapter 2), the use of corpora for qualitative as well as quantitative analyses (chapter 3), a survey of the range of published corpus-based language studies (chapter 4), the uses of corpora in computational linguistics (chapter 5), and a case study on sublanguages using corpus linguistic methods (chapter 6). The book concludes with a look at the directions corpus linguistics might take in the future. Each chapter includes study questions (with suggested answers provided in an appendix) and a list of further readings. Two other appendixes give detailed information on the corpora mentioned in the text (including contact information) and descriptions and contact information for some corpus research software. McEnery and Wilson have provided a good deal of information for students wishing to engage in corpus-based analyses, especially by walking through an entire sublanguage case study from hypothesis to analysis to interpretation of ndings. In addition, the authors do an admirable job of presenting general information on the most common quantitative approaches used in corpus linguistics (e.g., chi-square, mutual information, Z-scores). However, some sections of the text would be quite challenging, if not overwhelming, for most undergraduate students. For example, the text-encoding information provided in chapter 2 is quite dense and extremely detailed, and the discussion in chapter 5 fails to dene computational linguistics or natural language processing yet seems to use them interchangeably. Corpus Linguistics: Investigating Language Structure and Use, by Biber, Conrad, and Reppen, could be used as the textbook in an introductory corpus linguistics course as well as by language researchers interested in engaging in corpus-based studies. The book is arranged according to the types of corpus-based language investigations that have been conducted, and each chapter includes the important research questions that apply

560

TESOL QUARTERLY

to the chapter topic as well as the steps to set up the corpus-based investigation into that topic and the interpretation of the ndings. Part 1 details the use of language features in the realms of lexicography, grammar, lexicogrammar, and discourse characteristics. Part 2 examines the characteristics of varieties of language, including register variation, language acquisition, and historical and stylistic variation. Part 3 takes a brief look at the future of corpus linguistics applications, and Part 4 provides a series of methodology boxes that guide the reader through such issues as corpus design, concordancing, part-of-speech tagging, and various quantitative analyses. An appendix provides information on commercially available corpora and corpus software, including relevant Web site addresses. This book provides the best guidance for those interested in designing their own corpus-based research. The inclusion of the methodology boxes as well as the step-by-step arrangement of each chapter provides the reader with the basic information needed to begin corpus linguistic research. Moreover, the authors provide contact information for readily available corpora (written, spoken, historical, international varieties of English, and tagged) and the software tools necessary to begin analyses. Readers would need additional instruction in and practice with specic software tools before conducting their own studies, however. In sum, each book reviewed here is highly recommendable to different audiences for different reasons. Kennedys work helps readers and language researchers understand corpus linguistics from theoretical and historical perspectives; McEnery and Wilson provide numerous examples of corpus-based research and a very effective, step-by-step case study on sublanguages; and Biber, Conrad, and Reppen walk the reader through the research process from start to nish in a wide variety of linguistic analyses using corpus methods.
MARIE E. HELT California State University, Sacramento Sacramento, California, United States

REVIEWS

561

Edited Volumes on Corpus Linguistics


Corpus Linguistics in North America: Selections From the 1999 Symposium.
Rita C. Simpson and John M. Swales (Eds.). Ann Arbor: University of Michigan Press, 2001. Pp. vi + 241.

Learner English on Computer.


Sylviane Granger (Ed.). London: Longman, 1998. Pp. xxii + 228.
I The two collections of articles in the eld of corpus linguistics reviewed here address issues and report studies of interest for experienced corpus linguists. However, they are also accessible to readers new to the eld. Corpus Linguistics in North America offers concrete, diverse samples of corpus-based scholarship to readers desiring a well-rounded introduction to corpus linguistics. This book, which presents selected papers from a 1999 symposium (the rst corpus linguistics symposium in North America), is particularly useful to readers who are considering using corpus linguistics as a tool in their research. However, language teachers who are interested in using corpus-based materials or methods in their classrooms will also nd the book relevant. In Part 1, ve articles describe a variety of corpus design projects and analytical tools. Two articles discuss corpora that are valuable resources for spoken academic English: Douglas Biber, Randi Reppen, Victoria Clark, and Jenia Walter provide a concise explanation of the spoken component of the TOEFL 2000 Spoken and Written Academic Language Corpus, and Chris Powell and Rita Simpson describe the Michigan Corpus of Academic Spoken English (MICASE), with particular reference to its Web interface, which makes corpus searches possible for all Internet users. Mark Davies chapter, Creating and Using MultimillionWord Corpora From Web-Based Newspapers, may be of particular interest to readers, as he discusses making a corpus from texts that are publicly available on the World Wide Web. It will also appeal to readers wondering about corpus linguistics in languages other than English; his examples describe the Spanish and Portuguese corpora he has compiled. For those interested in world varieties of English, Charles Meyer discusses the International Corpus of English and the types of comparisons that the corpus will make possible. In the nal article in Part 1, Susan Hockey discusses annotations that are often added to corpora, and some well-known software programs and the types of analyses that they facilitate. A bonus for readers new to corpus linguistics is the glossary for

562

TESOL QUARTERLY

Part 1, which covers terms and acronyms related to computers generally (such as ASCII and batch le) and to corpus linguistics specically (such as KWIC and tagger). Part 2 includes many examples of actual corpus analyses. Three chapters discuss the application of corpus linguistics to language teaching: Biber argues the importance of ndings about the frequency and use of verbs in English for material writers and teachers; Aaron Lawson uses a corpus-based study to critique the coverage of the subjunctive and demonstrative in French language textbooks; and Stephanie Burdine describes the teaching of lexical phrases for disagreement to ESL learners, based on corpus analysis. Two other articles (by John Swales and Bonnie Malczewski and by Anna Mauranen) describe analyses of spoken academic discourse in MICASE; the rst focuses on expressions that mark episode boundaries, and the second, on metadiscourse. Two other types of corpus studies are illustrated with an account of the form and function of a single word, remember (Hongyin Tao), and the writing development over time of school-aged children (Reppen). Corpus Linguistics in North America covers a considerable variety of topics and applications. Although portions of the book may include too many specic, methodological details to interest readers new to the eld, all readers should come away with new research and teaching possibilities. In contrast to the more general view of corpus linguistics in Corpus Linguistics in North America, Learner English on Computer focuses on a particular area, computer learner corpus (CLC) research. This relatively recent development within the eld involves the application of the methods of corpus linguistics to examining large, electronic databases of learner language in order to increase the understanding of interlanguage and second language acquisition (SLA) processes. The book comprises 15 papers divided into three main sections. Part 1 provides a comprehensive introduction to the eld of corpus-based research into learner language: Chapters by Granger and by Fanny Meunier discuss the position of CLC research within corpus linguistics, SLA, and English language teaching; describe basic criteria for learner corpus compilation and computer-aided analysis of learner data; and illustrate techniques and software tools for analyzing learner corpora. The eight studies of learner lexis, grammar, and discourse included in Part 2 provide concrete examples of how such techniques and tools can be used to investigate learner language. The papers are based on various subcorpora of the International Corpus of Learner English (ICLE; see Granger, this issue) and illustrate different applications of contrastive interlanguage analysis, an approach involving the comparison of corpora compiled from learners with different L1s as well as comparisons with corpora of native speaker language. Though the ICLE is limited to learners in European countries, some studies include comparison of as
REVIEWS 563

many as seven different L1 groups. The range of features studied is also impressive: vocabulary frequencies (Hkan Ringbom); adjective intensication (Gunter Lorenz); formulaic language (Sylvia DeCock, Granger, Geoffrey Leech, and Tony McEnery); adverbial connectors (Bengt Altenberg and Marie Tapper); direct questions in argumentative writing (Tuija Virtanen); features of reader/writer visibility, such as rst- and second-person pronouns and emphatics (Stephanie Petch-Tyson); and frequencies of word categories and sequences of word categories (Granger and Paul Rayson; Jan Aarts and Granger). Part 3 includes papers looking at how CLC-based studies can be used to improve English language teaching materials, such as grammars (Biber and Reppen), dictionaries (Patrick Gillard and Adam Gadsby), and writing textbooks (Przemyslaw Kaszubski). Teachers might be especially interested in the nal two chapters, which give examples of teaching materials that incorporate the use of learner corporafor both online exercises ( John Milton) and classroom activities (Granger and Chris Tribble). Learner English on Computer provides a useful introduction to the eld of corpus-based research into learner language and surveys a wide array of applications of CLC methodology. The book will be of interest not only for anyone interested in carrying out corpus-based studies of learner language but also for anyone with more a general interest in corpus linguistics, SLA, English language teaching, or computer-assisted language learning.
FEDERICA BARBIERI Northern Arizona University Flagstaff, Arizona, United States SUZANNE E. B. ECKHARDT Beloit College Beloit, Wisconsin, United States

Useful Web Sites for Corpus Linguistics in TESOL


I

For readers interested in learning more about corpus linguistics and its applications, we have compiled a list of Web sites that we have found especially helpful. We hope that these sites will be useful for ESL professionals wishing to explore corpus linguistic applications for teaching and materials as well as for research.

564

TESOL QUARTERLY

GENERAL INFORMATION

Corpus Linguistics.
Michael Barlow. http://www.ruf.rice.edu/~barlow/corpus.html

Bookmarks for Corpus-Based Linguists.


David Lee. http://devoted.to/corpora

Links to Corpus Linguistics & Related Sites.


Przemek Kaszubski. http://www.hum.amu.edu.pl/~przemka /corplink.html General corpus-related informational Web sites are plentiful. However, many are lled with dead links and outdated information. The three Web sites reviewed here are comprehensive, easy to navigate, and updated relatively frequently. All three include links to multiple, varied corpora; word lists; software; and other information relevant to corpus linguistics. Barlows site is a starting point for a wide variety of information. The site provides links to texts and corpora in more than 21 languages and extensive links to software and taggers, including Barlows own MonoConc and ParaConc (for parallel corpora in up to four languages). Lees Bookmarks for Corpus-Based Linguistics is notable for its amiable explanations of site use and purpose. The site primarily targets language teachers and applied linguists, but it includes an extensive set of links to sites of interest to computational linguists. Kaszubskis site incorporates additional information about lexicography, online publishers and journals, and text distributors.

CORPORA

Cobuild.
Collins Cobuild. http://titania.cobuild.collins.co.uk

MICASE: Michigan Corpus of Academic Spoken English.


University of Michigan, Humanities Text Initiative. http://www.hti.umich.edu/m/micase

British National Corpus.


Oxford University Computing Services. http://www.hcu.ox.ac.uk/BNC

American National Corpus.


American National Corpus Consortium. http://americannationalcorpus.org
REVIEWS 565

International Corpus of English.


University College London. http://www.ucl.ac.uk/english-usage/ice/

Linguistic Data Consortium.


University of Pennsylvania. http://www.ldc.upenn.edu/ Of the numerous corpora available for purchase or use over the Internet, the six reviewed here are some of the best known. The rst three (Collins Cobuild, the Michigan Corpus of Academic Spoken English [MICASE], and the British National Corpus [BNC]) allow at least basic searches for free. The last three (the American National Corpus [ANC], the International Corpus of English [ICE], and the Linguistic Data Consortium [LDC]) are not corpora that are searchable online but provide examples of the types of corpora available. Collins Cobuilds teacher-friendly site has language games using concordance searches and instructions for creating activities for students using a concordancer. The MICASE site enables users to search the 2million-word online corpus or browse transcripts by speaker and speech attributes. Search results can be sorted online, a useful feature. The BNC site includes information about its 100-million-word corpus and a bibliography of research articles using BNC data. The free search tool is limited, but the site is well designed and accessible. Still under construction, the ANC project aims to build a resource for American English that is comparable to the BNC. Although the rst installment of the corpus, planned for fall 2002, has been delayed, future developments on this site are worth watching for. The ICE is a collection of English varieties spoken in numerous countries. The sample sound les available on this site make it unique. Finally, the LDC site includes the consortiums catalogue of more than 200 corpora in a variety of languages. This site is not the best one for beginners, but it gives a sense of how many different corpora are available.

CONCORDANCERS

TACTWeb 1.0.
John Bradley and Geoffrey Rockwell. http:// tactweb.humanities.mcmaster.ca/tactweb/doc/tact.htm

The Web Concordances and Workbooks.


R. J. C. Watt. http://www.dundee.ac.uk/english/wics/wics.htm

Athelstan.
http://www.athel.com
566 TESOL QUARTERLY

WordSmith Tools.
Mike Scott. http://www.lexically.net/wordsmith/version3/index.htm

The Survey of English Usage.


University College London. http://www.ucl.ac.uk/english-usage /home.htm

Conc: A Concordance Generator for the Macintosh.


SIL International. http://www.sil.org/computing/conc/

Emerging Technologies: Tools and Trends in Corpora Use for Teaching and Learning.
Bob Godwin-Jones. Language Learning & Technology, Vol. 5, No. 3 (September 2001), pp. 712. http://llt.msu.edu/vol5num3/emerging /default.html Concordancers are the most commonly used type of software within corpus linguistics. For those unfamiliar with concordancers, an easy way to become more familiar with them is to try them on a Web site. A number of concordancers are available online and are generally associated with the corpora they were designed to search. Of those mentioned in the section on corpora, Cobuild is the simplest but the most limited in function. The MICASE concordancer allows certain parameters, such as speech event type, academic discipline, and speaker gender as search limiters. The Text Analysis Computing Tool (TACT) allows queries of a Shakespearean text and has links to other sites where the concordancer can be used. Online concordancing of English literary texts is possible at The Web Concordances and Workbooks. A demo version of its concordancer, Concordance, can be downloaded for use on other texts. Free downloadable concordancers are also available, such as demo versions of ParaConc and MonoConc on Athelstans site and Wordsmith Tools on Mike Smiths site. The Survey of English Usages ICECUP, from University College London, provides a sample of the ICE. SIL International gives a good introduction to concordancing along with its concordancer for the Macintosh computer, Conc. An excellent source for links to concordancing information (especially for teaching) is Emerging Technologies, a column in a special issue of Language Learning and Technology. (When downloading a concordancer, note that concordancing can be frustratingly slow on older computers with inadequate processing speeds.)

REVIEWS

567

TEACHING APPLICATIONS

Concordancing with Language Learners: Why? When? What?


Vance Stevens. CLL Journal, Vol. 6, No. 2 (Summer 1995), pp. 210. http://www.ruf.rice.edu/~barlow/stevens.html

Is There Any Measurable Learning From Hands-On Concordancing?


Tom Cobb. System, Vol. 25, No. 3, pp. 301315. http://www.er.uqam.ca/nobel/r21270/cv/Hands_on.html

The Academic Word List.


Averil Coxhead. http://www.vuw.ac.nz/lals/staff/averil_coxhead/

Virtual Data-Driven Learning Library.


Tim Johns. http://web.bham.ac.uk/johnstf/ddl_lib.htm Specically addressing ESL pedagogy, a number of Web sites illustrate how corpus linguistics is being used to enhance classroom teaching. Stevens describes the use of concordancers in teaching, including sources for both concordancers and texts. Cobb describes a study of students learning vocabulary by using a corpus and concordancersone of the few empirical studies of the effectiveness of concordancing. For English for academic purposes educators, an excellent Web site is Coxheads home page, which describes the corpus-based development of a new academic word list (see also Coxhead, 2000). The page has valuable links to other vocabulary-related sites. Finally, Johns Virtual Data-Driven Learning Library covers concordance-based materials and studies, including handouts for grammar classes and vocabulary lessons. Although we have found the above-mentioned sites especially helpful, new sites are constantly appearing. We encourage readers not only to explore the sites we suggest but to surf the Web to nd sites useful to their own interests and to continue exploring new sites.
REFERENCE
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34, 213238. LESLIE BOYD, JENNIFER GARLAND, CAROL RADICH, and DEBORAH SAARI Portland State University Portland, Oregon, United States

568

TESOL QUARTERLY

INFORMATION FOR CONTRIBUTORS


EDITORIAL POLICY
TESOL Quarterly, a professional, refereed journal, encourages submission of previously unpublished articles on topics of signicance to individuals concerned with the teaching of English as a second or foreign language and of standard English as a second dialect. As a publication that represents a variety of cross-disciplinary interests, both theoretical and practical, the Quarterly invites manuscripts on a wide range of topics, especially in the following areas: 1. psychology and sociology of language learning and teaching; issues in research and research methodology 2. curriculum design and development; instructional methods, materials, and techniques 3. testing and evaluation 4. professional preparation 5. language planning 6. professional standards

Because the Quarterly is committed to publishing manuscripts that contribute to bridging theory and practice in our profession, it particularly welcomes submissions drawing on relevant research (e.g., in anthropology, applied and theoretical linguistics, communication, education, English education [including reading and writing theory], psycholinguistics, psychology, rst and second language acquisition, sociolinguistics, and sociology) and addressing implications and applications of this research to issues in our profession. The Quarterly prefers that all submissions be written so that their content is accessible to a broad readership, including those individuals who may not have familiarity with the subject matter addressed. TESOL Quarterly is an international journal. It welcomes submissions from English language contexts around the world.

GENERAL INFORMATION FOR AUTHORS Submission Categories


TESOL Quarterly invites submissions in ve categories: Full-length articles. Contributors are strongly encouraged to submit manuscripts of no more than 2025 double-spaced pages or 8,500 words (including references, notes, and tables). Submit three copies plus three copies of an informative abstract of not more than 200 words. If possible, indicate the number of words at the end of the article. To facilitate the blind review process, authors names should appear only on a cover sheet, not on the title page; do not use running heads. Submit manuscripts to the Editor of TESOL Quarterly:

INFORMATION FOR CONTRIBUTORS TESOL QUARTERLY Vol. 37, No. 3, Autumn 2003

569

Carol A. Chapelle Department of English 203 Ross Hall Iowa State University Ames, IA 50011-1201 USA The following factors are considered when evaluating the suitability of a manuscript for publication in TESOL Quarterly: The manuscript appeals to the general interests of TESOL Quarterlys readership. The manuscript strengthens the relationship between theory and practice: Practical articles must be anchored in theory, and theoretical articles and reports of research must contain a discussion of implications or applications for practice. The content of the manuscript is accessible to the broad readership of the Quarterly, not only to specialists in the area addressed. The manuscript offers a new, original insight or interpretation and not just a restatement of others ideas and views. The manuscript makes a signicant (practical, useful, plausible) contribution to the eld. The manuscript is likely to arouse readers interest. The manuscript reects sound scholarship and research design with appropriate, correctly interpreted references to other authors and works. The manuscript is well written and organized and conforms to the specications of the Publication Manual of the American Psychological Association (4th ed.). Reviews. TESOL Quarterly invites succinct, evaluative reviews of professional books. Reviews should provide a descriptive and evaluative summary and a brief discussion of the signicance of the work in the context of current theory and practice. Submissions should generally be no longer than 500 words. Send one copy by e-mail to the Review Editor: Roberta Vann rvann@iastate.edu Review Articles. TESOL Quarterly also welcomes occasional review articles, that is, comparative discussions of several publications that fall into a topical category (e.g., pronunciation, literacy training, teaching methodology). Review articles should provide a description and evaluative comparison of the materials and discuss the relative signicance of the works in the context of current theory and practice. Submissions should generally be no longer than 1,500 words. Submit two copies of the review article to the Review Editor at the address given above.

570

TESOL QUARTERLY

Brief Reports and Summaries. TESOL Quarterly also invites short reports on any aspect of theory and practice in our profession. We encourage manuscripts that either present preliminary ndings or focus on some aspect of a larger study. In all cases, the discussion of issues should be supported by empirical evidence, collected through qualitative or quantitative investigations. Reports or summaries should present key concepts and results in a manner that will make the research accessible to our diverse readership. Submissions to this section should be 710 double-spaced pages, or 3,400 words (including references, notes, and tables). If possible, indicate the number of words at the end of the report. Longer articles do not appear in this section and should be submitted to the Editor of TESOL Quarterly for review. Send one copy of the manuscript each to: Cathie Elder Department of Applied Language Studies and Linguistics University of Auckland Private Bag 92019 Auckland, New Zealand Paula Golombek 305 Sparks Building Pennsylvania State University University Park, PA 16802 USA

The Forum. TESOL Quarterly welcomes comments and reactions from readers regarding specic aspects or practices of our profession. Responses to published articles and reviews are also welcome; unfortunately, we are not able to publish responses to previous exchanges. Contributions to The Forum should generally be no longer than 710 double-spaced pages or 3,400 words. If possible, indicate the number of words at the end of the contribution. Submit three copies to the Editor of TESOL Quarterly at the address given above. Brief discussions of qualitative and quantitative Research Issues and of Teaching Issues are also published in The Forum. Although these contributions are typically solicited, readers may send topic suggestions or make known their availability as contributors by writing directly to the Editors of these subsections. Research Issues: Teaching Issues: Patricia A. Duff Bonny Norton Department of Language Department of Language and Literacy Education and Literacy Education University of British Columbia University of British Columbia 2125 Main Mall 2125 Main Mall Vancouver, BC V6T 1Z4 Vancouver, BC V6T 1Z4 Canada Canada Special-Topic Issues. Typically, one issue per volume will be devoted to a special topic. Topics are approved by the Editorial Advisory Board of the Quarterly. Those wishing to suggest topics or make known their availability as guest editors should contact the Editor of TESOL Quarterly. Issues will generally contain both invited articles designed to survey and illuminate central themes as well as articles solicited through a call for papers.
INFORMATION FOR CONTRIBUTORS 571

General Submission Guidelines


1. All submissions to the Quarterly should conform to the requirements of the Publication Manual of the American Psychological Association (4th ed.), which can be obtained from the American Psychological Association, Book Order Department, Dept. KK, P.O. Box 92984, Washington, DC 20090-2984 USA. Orders from the United Kingdom, Europe, Africa, or the Middle East should be sent to American Psychological Association, Dept. KK, 3 Henrietta Street, Covent Garden, London, WC2E 8LU, England. For more information, e-mail order@apa.org or consult http:// www.apa.org/books/ordering.html. 2. All submissions to TESOL Quarterly should be accompanied by a cover letter that includes a full mailing address and both a daytime and an evening telephone number. Where available, authors should include an electronic mail address and fax number. 3. Authors of full-length articles, Brief Reports and Summaries, and Forum contributions should include two copies of a very brief biographical statement (in sentence form, maximum 50 words), plus any special notations or acknowledgments that they would like to have included. Double spacing should be used throughout. 4. TESOL Quarterly provides 25 free reprints of published full-length articles and 10 reprints of material published in the Reviews, Brief Reports and Summaries, and The Forum sections. 5. Manuscripts submitted to TESOL Quarterly cannot be returned to authors. Authors should be sure to keep a copy for themselves. 6. It is understood that manuscripts submitted to TESOL Quarterly have not been previously published and are not under consideration for publication elsewhere. 7. It is the responsibility of the author(s) of a manuscript submitted to TESOL Quarterly to indicate to the Editor the existence of any work already published (or under consideration for publication elsewhere) by the author(s) that is similar in content to that of the manuscript. 8. The Editor of TESOL Quarterly reserves the right to make editorial changes in any manuscript accepted for publication to enhance clarity or style. The author will be consulted only if the editing has been substantial. 9. The views expressed by contributors to TESOL Quarterly do not necessarily reect those of the Editor, the Editorial Advisory Board, or TESOL. Material published in the Quarterly should not be construed to have the endorsement of TESOL.

Informed Consent Guidelines


TESOL Quarterly expects authors to adhere to ethical and legal standards for work with human subjects. Although we are aware that such standards vary among institutions and countries, we require authors and contributors to
572 TESOL QUARTERLY

meet, as a minimum, the conditions detailed below before submitting a manuscript for review. TESOL recognizes that some institutions may require research proposals to satisfy additional requirements. If you wish to discuss whether or how your study met these guidelines, you may e-mail the managing editor of TESOL publications at tq@tesol.org or call 703-535-7852. As an author, you will be asked to sign a statement indicating that you have complied with Option A or Option B before TESOL will publish your work. A. You have followed the human subjects review procedure established by your institution. B. If you are not bound by an institutional review process, or if it does not meet the requirements outlined below, you have complied with the following conditions. Participation in the Research 1. You have informed participants in your study, sample, class, group, or program that you will be conducting research in which they will be the participants or that you would like to write about them for publication. 2. You have given each participant a clear statement of the purpose of your research or the basic outline of what you would like to explore in writing, making it clear that research and writing are dynamic activities that may shift in focus as they occur. 3. You have explained the procedure you will follow in the research project or the types of information you will be collecting for your writing. 4. You have explained that participation is voluntary, that there is no penalty for refusing to participate, and that the participants may withdraw at any time without penalty. 5. You have explained to participants if and how their condentiality will be protected. 6. You have given participants sufcient contact information that they can reach you for answers to questions regarding the research. 7. You have explained to participants any foreseeable risks and discomforts involved in agreeing to cooperate (e.g., seeing work with errors in print). 8. You have explained to participants any possible direct benets of participating (e.g., receiving a copy of the article or chapter). 9. You have obtained from each participant (or from the participants parent or guardian) a signed consent form that sets out the terms of your agreement with the participants and have kept these forms on le (TESOL will not ask to see them). Consent to Publish Student Work 10. If you will be collecting samples of student work with the intention of publishing them, either anonymously or with attribution, you have made that clear to the participants in writing.
INFORMATION FOR CONTRIBUTORS 573

11. If the sample of student work (e.g., a signed drawing or signed piece of writing) will be published with the students real name visible, you have obtained a signed consent form and will include that form when you submit your manuscript for review and editing (see http://www.tesol.org /pubs/author/consent.html for samples). 12. If your research or writing involves minors (persons under age 18), you have supplied and obtained signed separate informed consent forms from the parent or guardian and from the minor, if he or she is old enough to read, understand, and sign the form. 13. If you are working with participants who do not speak English well or are intellectually disabled, you have written the consent forms in a language that the participant or the participants guardian can understand.

GUIDELINES FOR QUANTITATIVE AND QUALITATIVE RESEARCH


Because of the importance of substantive ndings reported in TESOL Quarterly, in addition to the role that the Quarterly plays in modeling research in the eld, articles must meet high standards in reporting research. To support this goal, the Spring 2003 issue of TESOL Quarterly (Vol. 37, No. 1) contains guidelines for reporting quantitative research and three types of qualitative research: case studies, conversation analysis, and (critical) ethnography. Each set of guidelines contains an explanation of the expectations for research articles within a particular tradition and provides references for additional guidance. The guidelines are also published on TESOLs Web site (http://www.tesol.org/pubs/author/serials/tqguides.html).

574

TESOL QUARTERLY

You might also like