You are on page 1of 5

On the design and use of non-traditional authorship

attribution methods

Definition of non-traditional

The term non-traditional refers to the use of computational, statistical or mathematical


methods of authorship attribution developed mostly in the twentieth century, as
opposed to traditional methods of authorship largely carried out by literary and
religious scholars in previous centuries. So-called traditional methods include analysis
of literary aspects of style, such as the use of particular types of metaphor and
imagery and other literary devices. Up until the late nineteenth century, most types of
analysis were ‘traditional’. The literary scholar was required to have an appreciation
of a text’s aesthetic qualities and to be able to discourse on these eloquently. Part of
this knowledge lay in the ability to distinguish how different writers created artistic
effects. Emphasis moved, in the early twentieth century to a more critical analysis of
text, and the degree of focus on the author diminished. This approach was referred to
as the New Criticism, exemplified by F.R. Leavis.
The object of the New Criticism was to establish, by close analysis, the interpretation
of the text. The text was said to be independent of author or reader, to be an
autonomous object. The aim of literary criticism was not literature per se but the
“cultural health” of society (Leavis 1934: 2). Hence there was a social object to
literary criticism, particularly urgent in a period widely perceived to be one of social
disintegration. This approach was later rejected by the Post-Modernist movement,
which refuted the idea of textual autonomy in favour of intertextuality. The text was
still seen to be a social object, and the importance of the individual author in its
creation was likewise relegated to secondary importance, but no text could be seen to
be independent of society, and indeed every member of it. Between New Criticism
and Post-Modernism, then, the significance of the author had been reduced to that of a
mere social functionary. The object was to give birth to the reader, even at the
expense, in Barthes’ words, of “the death of the author”.
Non-traditional methods of authorship attribution arose largely independently of these
social movements and were driven mostly by statisticians and mathematicians with an
interest in applying statistical theory to language. Early efforts included the work of
Mendenhall (1887), de Morgan and – later – Yule (1938, 1944). The most widely

1
known example of non-traditional attribution is that carried out by Mosteller and
Wallace (1964) on the authorship of the disputed Federalist Papers.

Assumptions of non-traditional methods

The chief assumption of many non-traditional methods is that an author has a


distinctive style profile, often called a ‘linguistic fingerprint’. With this taken for
granted, the difficulties in attributing authorship are characterised as being largely
those of a methodological nature:
It is often recognized that authors have inherent literary styles which serve as "fingerprints" for
their written works. Thus, in principle, one should be able to determine the authorship of
unsigned manuscripts by carefully analyzing the style of the text. The difficulty lies in
characterizing the style of each author, that is, determining which sets of features in a text most
accurately summarize an author's style. When doing a quantitative or statistical analysis of
literary style, the problem is finding adequate numerical representations of an author's inherent
style.
Peng and Hentgartner 2002 (OL).

As noted in the above quotation, the perceived difficulty in quantitative analysis is


seen as the finding of adequate material to quantify. The ‘problem’ – or one of them –
is thus often perceived to be the quantity of material available for study, rather than
what is to be studied. The ‘other’ problem is finding the set of features which most
accurately characterizes the “style of each author” (see above). Madigan et al, another
group of computational linguists, write in similar vein: “Individuals have distinctive
ways of speaking and writing, and there exists a long history of linguistic and stylistic
investigation into authorship attribution”. They claim that, unlike other ‘biometric’
methods, “authorship attribution promises more accurate results and objective
measures of reliability” (Madigan et al, OL). Given the success of DNA methods and
even the humble fingerprint itself, this is an astonishing claim and is totally
unwarranted at the present time. Madigan et al claim their analysis focuses on “very
high-dimensional, topic-free document representations”. Their analysis relies on
complex statistical interpretations of function word densities. They appear to assume,
like many others, that function words are distributed uniformly across text types
whereas, in fact, even the most common word in the language, the, shows widely
differing distributions from one text type to another, as the following tables illustrate:

2
Table 1: Article Frequency in a range of newspaper articles and email
messages

News Email
the 0.074 0.044
a 0.023 0.022
an 0.004 0.004
From a sample of text in the author’s collection

Table 2: Distribution of some personal pronouns in a range of newspaper


articles and email messages

News Email
I 0.00 0.04
you 0.00 0.02
he/she 0.01 0.01
From a sample of text in the author’s collection

From the above tables we see that first and second person pronouns are much more
densely distributed in emails than in news articles, with determiner distribution in the
former being low in density, and that – conversely – news articles contain higher
densities of determiner usage than emails and virtually non-existent instances of first
and second person pronouns. Clearly, the most common function words are not
uniformly distributed across text types.
The three assumptions commonly found in non-traditional studies, therefore, seem to
be open to challenge: that we each have a linguistic fingerprint, that the main obstacle
to finding this ‘fingerprint’ is one of sample size, and that we can find tokens which
are independent of text type or topic.

Authorship markers

It is commonly assumed – as the above quotation of Peng and Hengartner shows –


that authorship is best captured by the measurement of some specific set of markers.
Some authors will attest to the virtue of ‘their’ markers over and above the validity of
any other analyst’s authorship markers.

The nature of variation

Underlying the claim that we each have a linguistic fingerprint is an unstated


assertion: that we do not vary in our use of language or, if we do, the extent to which
we do so is not significant. Note, for example, Peng and Hengartner’s preoccupation
with ‘characterizing the style of each author’ (my italics).

3
Mathematical models

It is in the nature of scientific braggadocio that no other scientist’s method, unless


canonised by time or universal approbation, is of any worth. Hence, the modern
computational linguist resorts to ever more opaque mathematical models to illustrate
the existence of the linguistic fingerprint. Holmes (1985) provides an account of
several, including Bayes and Poisson. More recently there have been varying types of
factorial analysis, including principal component analysis, and, additionally, machine
learning or the study of neural networks.

The identity imperative

In its early stages authorship attribution was the offshoot of literary analysis: the quest
was to know which particular authors were worth reading firstly from a moral, then an
aesthetic, perspective. We could only know the answer to this if we could recognise
the style of the ‘principal’ authors of the time. A parallel imperative was the quest for
authority: on whose authority was such-and-such a principle declared to be true? If we
could not state for certain the author of the principle, then the importance of the
principle was open to devaluation. Later, authorship became a kind of parlour game
for under-occupied intellectuals: who ‘wrote’ the Bible, was Shakespeare the author
of the plays and poems attributed to him? More recently, authorship became the
concern of linguists – particularly those engaged in forensic work. The authorship of
texts involved in criminal investigations needed to be ascertained, frequently as a
matter of some urgency. In the current security conscious climate, these questions
have become even more important. If the authors of certain types of terrorist
document can be identified, it is asserted, this will assist in the defeat of terrorism
(e.g. Abbasi and Chen, 2005). Given the current political climate such assertions can
be likened to offering a desert dweller the promise of an everlasting water supply, and
is, in the view of this author, somewhat irresponsible.

Authorship and mythology

In fact, authorship attribution is surrounded by and embedded in mythology: the


mythology of the linguistic fingerprint, the mythology that individuals have
distinctive sets of authorship markers, the mythology that authorship attribution
‘puzzles’ can be solved by hurling giant computer resources at large bodies of text,
the mythology that enshrines a lack of individual variation, and so on. In my view,

4
these myths need to be explored and, where necessary, exploded. To this end, in this
research programme I will, specifically, be making the following claims:
(i) that authorship attribution can only be understood as an artefact of
authorship, which is itself a construct derived from author, and that this in
turn is a social construct. All this implies the need to understand the
philosophical and historical significance of the notion of author and the
history of authorship methods;
(ii) that authors vary, for identifiable reasons, and that this variation can be
measured;
(iii) that no particular type of authorship marker has automatic superiority over any
other set of authorship markers: almost any set of markers can be used
depending on text type/s, quantity of texts available for study, and the topics in
those texts;
(iv) that the particular statistical method used to quantify error rates is trivial:
almost any appropriate standard statistical method, provided it is properly and
honestly carried out, is useful. There is thus no need to develop new methods
or make existing ones more complex or opaque;
(v) that any authorship attribution must be undertaken with an understanding of,
and preferably training in, linguistics or possibly psychology or some related
field such as anthropology;
(vi) that non-traditional authorship attribution, provided it is undertaken with care
by suitably trained linguists or those working under linguistic supervision
within a linguistic framework, and provided that the necessary precautions are
observed with regard to statements of probability is not an inherently difficult
task, and can often be accomplished satisfactorily;
(vii) finally, that authors do not have a linguistic fingerprint, though they have
some core features which, however, vary – for a number of reasons, which I
will specify.

You might also like