Collocation and knowledge production in an academic discourse community

Keith Stuart Ana Botella Trelis Universidad Politcnica de Valencia (Spain)

This paper analyses the discourse of science and technology through the study of lexical and grammatical co-selection in research articles. The corpus comprises 1,376 articles, from specialist leading journals (a total of 6,104,323 tokens, 71,516 types, and 1.17 type/token ratio). The main criterion for choosing these journal articles for our corpus is that they are written by the members of the academic discourse community of the Universidad Politcnica de Valencia. The research follows on work by Gledhill (2000) who analysed the most frequent collocations in a corpus of pharmaceutical research articles and proposed possible functions for these collocations. However, this paper takes a more lexico-grammatical approach to collocations as a system of preferred expressions of knowledge in scientific research. The concept of collocation used here is in the Hallidayan tradition as an intermediate level between syntax and lexis, which focuses on recurrent word patterns (Hunston & Francis, 2000). The papers ultimate objective is to show that collocations as a system of preferred expressions of knowledge in scientific research can help us to analyse knowledge produced in our academic discourse community. Key words: corpus linguistics, co-selection, collocation, colligation, research articles

Introduction An important discovery of corpus linguistics has been that there is a level of syntagmatic phrasal organisation, which had been largely ignored. These may be described as n-grams to mean a recurrent string of uninterrupted word-forms or, as in Scott (1997), they are called word clusters. They form part of what has been denominated in the Firthian tradition as collocation. The reason why evidence from corpora was needed is because these syntagmatic structures did not fit into either lexis or grammar and because they involve facts about frequency which depends on computer technology. As Leech (1992: 106) envisaged, the computer was going to do more than just act as a research tool; it was going to open up new ways of thinking about language by providing more data and better counting. Clear (1993: 274) pointed out that the use of computational (algorithmic and statistical) methods has lead to a difference of scale in the corpus data that can be analysed and this in turn has led to a qualitative difference in observations about language based on corpus evidence. This paper explores corpus evidence about collocations as a first step towards establishing a conceptual map through collocational networks of the knowledge being produced in our academic discourse community. This paper analyses the discourse of science and technology through the study of lexical and grammatical co-selection in research articles in a corpus comprising of 1,376 articles (a total of 6,104,323 tokens). The main criteria for choosing these journals for our corpus was the fact that they are cited in the Science Citation Index (SCI), they are read by our university lecturers and students, and it is where our lecturers and postgraduate students try to publish their research. All the articles have been written by our lecturers and, therefore, represent the 238

work of a single academic discourse community, in this case, the Universidad Politcnica de Valencia. The research follows on work by Gledhill (2000) where he analyses the most frequent collocations in a corpus of pharmaceutical research articles and proposes possible functions for these collocations. However, this paper takes a more lexico-grammatical approach to collocations as a system of preferred expressions of knowledge in scientific research. We do not though restrict the analysis to a strict collocational approach but rather investigate the notion of co-selection as describing the general phenomenon of words that habitually keep company, to paraphrase Firth. A syntagmatic view of language takes account of the contribution of sense and syntax to meaning. The argument that sense and syntax (Sinclair, 1991), or meaning and pattern (Hunston & Francis, 2000), are associated is based on two pieces of evidence. Firstly, meanings tend to be distinguished by differing patterns, and secondly, words with the same pattern sometimes share aspects of meaning. Sinclair (1991: 170) refers to collocation as the occurrence of two or more words within a short space of each other in a text; this could logically refer to co-selection between lexical or grammatical items. Some authors (Firth, 1957; Hoey, 2005: 43) draw a distinction between collocation and colligation, using the former to refer to the cooccurrence of lexical items and the latter to the interrelationship of words and grammatical items (the grammatical company a word or word sequence keeps). Sinclair himself refers to colligation within a collocation context, in terms of collocational frameworks, which are units based on a grammatical, as opposed to a lexical, core (e.g., the/an...of) (Renouf & Sinclair, 1991: 128-143). Analysis of lexical and grammatical co-selection in our corpus of research articles proceeded by asking three questions:

What are the collocations of X word or words in the corpus? What meanings do X word or words tend to associate with? What grammatical constructions (colligation) do X word or words tend to enter into?

The papers ultimate objective is to analyse collocations as a system of preferred expressions of knowledge in scientific research that can help us to analyse knowledge produced in our academic discourse community.

Method Once the corpus had been designed and implemented, we proceeded to analyse the data by creating wordlists of technical and semi-technical terms through frequency counts and keyword identification. This process involved initially comparing a general English wordlist (from the 100 million BNC corpus) with a wordlist from our corpus. Frequencies were compared and a keyword list was created from our corpus. To compute the "key-ness" of an item, the software (WordSmith) used computes the following and cross-tabulates them (Scott, 2004):

its frequency in the smaller wordlist (our corpus) the number of running words in the smaller wordlist (our corpus) 239

its frequency in the reference corpus (BNC) the number of running words in the reference corpus (BNC)

Once we had established the candidate terms to be analysed, our software started extracting collocations for these terms and dumped them into an Excel spreadsheet. Collocates of terms are extracted from the entire corpus within the span of 5 words both sides of the node term. The candidate terms selected for this analysis were the following: 1.- Semi-technical words which are very frequent in the corpus and constitute significant examples of both lexical and grammatical co-occurrences (collocations and colligations), for example, results, system, model, etc. 2.- Semi-technical and technical words which tend to appear next to or near certain terms producing relevant semantic content which represents knowledge generated at our Institution.

Results The first example we would like to present in this paper is the term results, as it is the most frequent semi-technical term in the UPV corpus (9,730 times). Moreover, this term gives us clear examples of lexical associations not only for three-word recurrent patterns (clusters) but also if we look at longer strings. It is worth mentioning the fact that the most frequent collocation found, the results obtained, is followed by different prepositions (depending on the noun group that follows the preposition).
TABLE 1. Obtained as a collocate of results the results obtained the results obtained for the results obtained with the results obtained by the results obtained in the results obtained from 636 98 97 82 82 56

Collocates for the term results fall into three categories: evaluative adjectives (experimental, similar, good, different, previous), past participle adjectives/passive structures (obtained, shown, presented, compared), active verbs: show, indicate, present. Position of terms with respect to the node before or after it is clearly fixed in some of the examples and, consequently, relevant in those cases.
TABLE 2. Most frequent collocates of results with obtained experimental show results results results Total 1457 627 591 Total Left 191 558 90 Total Right 1266 69 501


discussion shown presented similar good different simulation compared between indicate agreement analysis observed given present previous

results results results results results results results results results results results results results results results results

526 426 284 277 251 236 217 201 194 186 185 176 156 149 149 149

66 82 48 199 170 104 174 68 136 8 112 96 96 37 101 88

460 344 236 78 81 132 43 133 58 178 73 80 60 112 48 61

It may be also especially worth mentioning that 5-word clusters with results are different from those shown above.
TABLE 3. 5 -word clusters with results in agreement with the results often leads to misleading results according to the results obtained basic notions and preliminary results the basis of the results taking into account the results on the basis of the results 20 12 10 9 9 9 9

The collocates and clusters found for the same word in the singular (result) differ substantially from those found in the plural form.
TABLE 4. 3-word clusters for result as a result the following result the result of a result of 329 282 250 228

System is the second semi-technical word in frequency in the corpus (8,205). The results obtained when analysing the term show two facts we would like to mention.


First, some repeated lexical patterns, indicating types of system such as closed loop system, the decision system, the control system, the physical system. Then, the presence of longer noun groups formed with system: N periodic forward backward system, a net rewriting system, discrete time periodic control system. The plural systems shows no relevant clusters. Its principal collocates are:
TABLE 5.Collocates of systems control fuzzy time linear distributed systems systems systems systems systems 218 207 152 133 112

The next term to be selected for the analysis was model (7,834). It is a semi-technical word which shows a fixed structure (colligation) the X of the model such as the correctness of the model, the values of the model, the structure of the model, the results of the model. We find both structures: parameters of the model and the set of model parameters. It co-occurs with other semi-technical terms forming lexical units with an important semantic load in the texts belonging to our corpus: the conceptual model, the workflow model, the theoretical model, the mathematical model, the execution model, the plane wave model, the heat transfer model, the simulation model. A substantial number of noun groups are also found in 4 and 5-word clusters.
Table 6. 4 and 5-word clusters for model the hybrid language model level iv fugacity model enterprise planning function descriptive model zero dimensional diesel combustion model age dependent reliability model a bernoulli mixture model 39 34 24 14 13 8

Significant examples are offered by the term method in 3-word clusters, as it comes together with other nouns (proper names) providing us with relevant information related to scientific knowledge:
Table 7. 3-word clusters for method lax wendroff method jacobi davidson method subspace iteration method second degree method cauchy kovalevskaya method 27 24 21 19 18


Other noun groups, although less statistically frequent, are formed with this semitechnical word: the traditional pile salting method, the discrete analytical stiffness derivative method, proposed shape restricted snake method, the split step Fourier method, etc. Another term which shows a fixed pattern of use is performance, being the performance of the most frequent cluster found (494). This term usually collocates in our corpus with other words with a positive meaning: evaluative adjectives such as best, better than, high, good and with verbs that have positive connotations such as improve, boost, achieve. The semi-technical term temperature tends to colligate with the preposition, at (the pattern at room temperature is the most frequent: 484 times). Clusters with semantic content are also found with this term: glass transition temperature, the annealing temperature, cooling water temperature, burnt products temperature. Other examples of semi-technical words in the corpus which show fixed lexical and grammatical patterns are samples and values. In the case of samples, we find a repeated use of passive structures with verbs indicating actions performed by scientists in this context.
TABLE 8. Passive structures with sample samples were taken samples were prepared samples were analysed samples stored at samples treated with samples were dried samples were analyzed 50 31 23 23 20 19 18

Both value and values are used in recurrent combinations with prepositions. We find patterns such as: of a/the X value of, for a/the value of, with the value/s of, with different values of, in the value of, from the values of. These examples with value are similar to Sinclairs collocational frameworks (Renouf & Sinclair, 1991). One of the most common collocational frameworks in our corpus is the string: preposition+the+x+of+y. The most frequent examples found for the preposition in are: in the case of (1,256), in the presence of (957), in the absence of (268), in the range of (141), etc. The results for on are: on the basis of (219), on the use of (124), on the surface of (95), etc. With at we have: at the beginning of (94), at the bottom of (92), at the centre of (90), at the end of (80). Another collocational framework which is very frequent in the scientific writing of our corpus is: under x conditions. Examples found in the corpus are: under these conditions, under the conditions, under certain conditions, under different conditions, under noncavitating conditions, under super critical conditions, etc. This kind of analysis can constitute a useful resource for scientists writing in English as L2 and for ESP teachers and their students.


Semi-technical terms with a high degree of frequency in the corpus provide us with information about the knowledge in our community. The association of these terms with other technical terms and the relationships established between some of them will help the linguist to represent knowledge in terms of more or less complex semantic networks.
FIGURE 1.First step towards a collocational network with acid

Sites (416) Cinnamic (77) Membrane (88) Concentration (211) Amino (379)

Catalysts (94) Citric (185)

ACID 4,083

Acetic (97)

Strength (129)

PH (104) Coumaric (114) Solution (118)

Groups (136)

FIGURE 2. First step towards a collocational network with sites

Acid (416) Concentration (49) Active (174)

Zeolites (50)

Number (110)

SITES 1,460

Frameworks (50)

Binding (91)

Strong (56) Surface (60)

Strength (63)


Conclusion What is important about the figures above is not the figures themselves but the points at which they would form connecting nodes (acid, sites, strength, concentration). These would expand the network and it is possible that the whole corpus could be represented in this way. We have described methods and techniques for constructing networks of terms, extracted from text corpora, which show how knowledge in a subject domain is organised. Initial discussions with domain experts have validated the first results being produced with some degree of confidence. We have also shown how candidate terms can be extracted, including collocations. Once collocational and semantic relational networks are produced, one can begin to describe the knowledge being produced by our academic discourse community, the Universidad Politcnica de Valencia.

Clear, J. (1993). From Firth pinciples: computational tools for the study of collocation in M. Baker, G. Francis, and E. Tognini-Bonelli (eds.). Text and Technology: In honour of John Sinclair. Amsterdam: John Benjamins, 271-292. Firth, J.R. (1957). Papers in Linguistics 1934-1951. London: Oxford University Press. Gledhill, C. (2000). Collocations in Science Writing. Tbingen: Gunter Narr Verlag Tbingen. Hoey, M. (2005). Lexical Priming. London: Routledge. Hunston, S. & G. Francis (2000). Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. Leech, G. (1992) Corpora and theories of linguistic performance in J. Svartvik (ed.). Directions in Corpus Linguistics. Berlin: Mouton de Gruyter, 105-122. Renouf, A. & J.Sinclair (1991). Collocational frameworks in English in Aijmer K. & B. Altenberg (eds.). English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman, 128-143. Scott, M. (1997). PC Analysis of Key Words -- and Key Key Words. System 25, 1: 1-13. ________. (2004). WordSmith Tools version 4. Oxford: Oxford University Press. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.