Keywords: authorship attribution, stylistics, statistics
Willard McCarty's recent posting on "Humanist" (Vol. 10, No. 137) "Communication and Memory" points out one of these problems, "...scholarship in the field is significantly inhibited, I would argue, by the low degree to which previous work in humanities computing and current work in related fields is known and recognized."
A major indication that there are problems in a field is when there is no consensus as to correct methodology or technique. Every area of authorship attribution studies has this problem -- research, experimental set-up, linguistic methods, statistical methods....
It seems that for every paper announcing an authorship attribution method that "works" or a variation of one of these methods, there is a counter paper pointing out crucial flaws:
The time has come to sit back, review, digest, and then present a theoretical framework to guide future authorship attribution studies.
The first paper, by David Holmes, will give the necessary history, scope, and present direction of authorship attribution studies with particular emphasis on recent trends.
The second paper, by Harald Baayen and Fiona Tweedie, will focus on one problem: the use of so-called constants in authorship attribution questions.
The third paper, by Joseph Rudman, will point out some of the problems that are keeping authorship attribution studies from being universally accepted and will offer suggestions on how these problems can be overcome.
Stylometry - the statistical analysis of literary style - complements traditional literary scholarship since it offers a means of capturing the often elusive character of an author's style by quantifying some of its features. Most stylometric studies employ items of language and most of these items are lexically based. A sound exposition of the rationale behind such studies has been provided by Laan (1995).
The main assumption underlying stylometric studies is that authors have an unconscious as well as a conscious aspect to their style. Every author's style is thought to have certain features that are independent of the author's will, and since these features cannot be consciously manipulated by the author, they are considerd to provide the most reliable data for a stylometric study. The two primary applications are attributional studies and chronological problems, yet a difference in date or author is not the only possible explanation for stylistic peculiarities. Variation in style can be caused by differences of genre or content, and similarity by literary processes such as imitation.
By measuring and counting stylistic traits, we hope to discover the 'characteristics' of a particular author. This paper looks at criteria which may serve as a basis of measurement within the context of stylometry's origins and historical development.
The idea of using sets (at least 50 strong) of common high-frequency words and conducting what is essentially a principal components analysis on the data has been developed by Burrows (1987) and represents a landmark in the development of stylometry. The technique is very much in vogue now as a reliable stylometric procedure and Holmes and Forsyth (1995) have successfully applied it to the classic 'Federalist Papers' problem. Examples of the technique will be displayed.
Mathematical models for the frequency distributions of the number of vocabulary items appearing exactly r times (r=1,2,3....) have aroused the interest of statisticians ever since the work of Zipf (1932). The best fitting model appears to be that attributed to Sichel (1975) and this paper will cover the Sichel model in addition to looking at the behaviour of the once-occurring words (hapax legomena) and twice-occurring words (hapax dislegomena) as useful stylometric tools.
Stylometry, though, presents no threat to traditional scholarship. In the context of authorship attribution, stylometric evidence must be weighed in the balance along with that provided by more conventional studies made by literary scholars.
Burrows, J.F. "Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style", Literary and Linguistic Computing, 2, 61-70, (1987).
Forsyth, R.S. and Holmes, D.I. "Feature-Finding for Text Classification", Literary and Linguistic Computing, 11, 4, (1996).
Holmes, D.I. and Forsyth, R.S. "The 'Federalist' Revisited: New Directions in Authorship Attribution", Literary and Linguistic Computing, 10, 111-127, (1995).
Laan, N.M. "Stylometry and Method. The Case of Euripides", Literary and Linguistic Computing, 10, 271-278, (1995).
Lowe, D. and Matthews, R. "Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions", Computers and the Humanities, 29, 449-461, (1995).
Martindale, C. and McKenzie, D. "On the Utility of Content Analysis in Author Attribution: The 'Federalist'", Computers and the Humanities, 29, 259-270, (1995).
Mendenhall, T.C., "The Characteristic Curves of Composition", Science, IX, 237-249, (1887).
Merriam, T. and Matthews, R. "Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe", Literary and Linguistic Computing, 9, 1-6, (1994).
Morton, A.Q., "The Authorship of Greek Prose", Journal of the Royal Statistical Society (A), 128, 169-233, (1965).
Morton, A.Q., Literary Detection, New York:Scribners, (1978).
Mosteller, F. and Wallace, D.L., Inference and Disputed Authorship: The Federalist, Reading:Addison-Wesley, (1964).
Sichel, H.S., "On a Distribution Law for Word Frequencies", Journal of the American Statistical Association, 70, 542-547, (1975).
Smith, M.W.A., "An Investigation of Morton's Method to Distinguish Elizabethan Playwrights", Computers and the Humanities, 19, 3-21, (1985).
Yule, G.U., "On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship", Biometrika, 30, 363-390, (1938).
Zipf, G.K., Selected Studies of the Principle of Relative Frequency in Language, Harvard University Press, (1932).
Text constants have been developed because the simplest measure of lexical richness, the vocabulary size V(N), varies with the number of tokens in the text, N. In order to remove this dependency, constants have been proposed that are supposed to be independent of N. These range from the simple type-token ratio to more complex measures such as Orlov's Zipf size (Orlov, 1983).
Combinations of these constants have been used to investigate problems of authorship (see for example Holmes, 1992 and Baayen et al, 1996). The latter discriminated two authors at lexical and syntactic levels using analyses of function words and lexical richness. They found that function words performed better than the constants at both levels, and that the inclusion of syntactic information improved the discrimination. Baayen et al. (1996) concluded that considering the text at a more abstract level, using the reduced variability of the syntactic vocabulary, increases the efficacy of the techniques. Nevertheless, the constants also tap into stylistic properties of texts at a fairly abstract level. In order to properly evaluate the discriminatory potential of the text constants, we must clarify whether and how effectively they capture similarities and differences between authors, and to what extent they are truly constant.
Figure 1 shows plots of 4 constants for Carroll's Alice in Wonderland that illustrate the main patterns in our survey of 15 measures of lexical richness. Measurements have been taken at 20 equally-spaced points in the text. The first panel shows Brunet's W to be an increasing function of the text size. The second plot, that of Honore's H, initially appears unstable, but less variable above N=13,000. The lower plots show Sichel's S and Yule's K; S is quite variable while K descends sharply from its initial value, then rises with N.
To evaluate the extent to which violation of the randomness assumption is responsible for the observed variability in the values of the `constants', the order of words in the texts was completely randomised and the measurements retaken. One hundred such randomisations were carried out.. The means of the randomisations are shown as points, the maxima and minima by + and - respectively. Crucially, only the mean values for K indicate that its value is theoretically truly constant for randomised text; those for W and H increase and decrease with text size, while S rises then decreases. It is clear from these graphs, both of the actual and randomised texts, that far from being stable, the constants are as variable as V, the variable that they were intended to replace.
In sum, with the exception of K and possibly the Zipf size, constants are not constant in theory, and, without exception, none are constant in practice. The empirical values of the constants are co-determined by the way in which the randomness assumption is violated in running text, namely by coherence in lexical use at the discourse level (see Baayen, 1996).
Thus far we have considered a single text. It is possible that the variability we have observed is very small when compared to other texts and that discrimination is still possible between authors. In order to investigate this we analysed a total of fifteen texts, detailed in Table 1.
The resulting graphs for W, H, S and K are shown in Figure 2. Examining the first plot, it can be seen that, while the value of W varies with N, texts by the same author vary in the same way; the Carroll texts are coincident, as are the James texts and two of the Conan Doyle texts. It is also clear that this is not necessarily the case; the Baum texts are widely separated, as is the third Conan Doyle text from the other pair. A similar structure is found in the graph of H, with slightly different orderings. Turning to S, however, we find that the constant is so variable that it is impossible to separate authors, even at larger text sizes. The plot of K again yields a pattern in which texts are fairly well separated. The different ordering of the texts in this graph indicates that K is measuring a different facet of the lexical structure of these texts. The Conan Doyle texts now group together, as do the Baum texts, but now the Carroll texts diverge.
We have calculated the values for fifteen lexical richness constants and found that the resulting profiles could be classified into four families, exemplified in the graphs above. The largest family of constants is that to which W belongs. Honore's H represents a much smaller family. S comprises the family of constants that are of no discriminatory value. K makes up a family with D, variables that are theoretically constant given the urn model of word distribution within text. Some texts that are separated in the other families are coincident in this family, others are more divergent.
It is clear from the above that several constants measure the same facet of the vocabulary structure. Thus, only those constants with the greatest discriminatory sensitivity within a given family need to be considered. The developmental profiles of the constants show sensitivity to authorship, although this is not absolute in that texts written by the same author may diverge. We have also developed techniques for evaluating the statistical significance of patterns of similarity and dissimilarity in the developmental curves. While the variance of most constants is not known, so that comparisons on the basis of constants for full texts remain impressionistic, we can now evaluate in a more precise way whether or not the developmental profile of a constant differentiates between texts.
Almost all textual constants in our survey are highly variable, and assume values that change systematically as the text size is increased. Some constants are inherently variable, others are truly constant in theory. All constants are substantially influenced by the non-random way in which word usage is governed by discourse cohesion. This variability indicates that the constants cannot be relied on to compare texts of different lengths. Crucially, however, the developmental profiles of the majority of constants have an interesting discriminatory potential, in that they reveal consistent and interpretable patterns that pick up author-specific aspects of word use.
For authorship attribution studies, we strongly recommend the use of the developmental profiles of selected constants, rather than the isolated values of the constants for complete texts. Our data shows, however, that authors are not `prisoners' of their own developmental profile. The discourse structure of texts by the same author can be quite different, and the same holds for the kind of vocabulary an author exploits for a given text. Compared to the use of syntax, word use is more easily influenced by choices which are under the conscious control of authors. Consequently, the developmental profiles of constants are less reliable than syntax-based measures for the purpose of authorship attribution. At the same time, the developmental profiles capture essential differences in word use and discourse structure. From this perspective, we would like to defend their usefulness in the domain of quantitative stylistics.
Author Title Key -------------------------------------------------------- Baum, L. F. The Wonderful Wizard of Oz b1 Tip Manufactures a Pumpkinhead b2 Carroll, L. Alice's Adventures in Wonderland a1 Through the Looking-glass and a2 what Alice found there Conan Doyle, A. The Hound of the Baskervilles c1 The Valley of Fear c2 The Sign of Four c3 James, Confidence j1 The Europeans j2 St Luke Gospel according to St Luke (KJV) L1 Acts of the Apostles (KJV) L2 London, J. The Sea Wolf l1 The Call of the Wild l2 Wells, H. G. The War of the Worlds w1 The Invisible Man w2
A major indication that there are problems in a field is when there is no consensus as to correct methodology or technique. Every area of authorship attribution studies has this problem -- e.g. research, experimental set-up, linguistic methods, statistical methods.
It seems that for every paper announcing an authorship attribution method that "works" or a variation of one of these methods, there is a counter paper pointing out crucial flaws, e.g.:
Most authorship attribution studies have been governed by expediency, e.g.:
The problems with the use of statistics by many authorship attribution are many and varied. Too many researchers are led into the swampy quicksand of statistical studies by the ignis fatuus of a "more sophisticated statistical technique".
The corrections for many of the specific problems become apparent once the problem is pointed out and there is a consensus that there is a problem. This paper will expand upon, expound, and give examples from published studies of the following problems. The paper will also give the detailed solutions for these problems. One of the "solutions" will be the dissemination of a bibliography of over 500 entries.
Not really knowing the field of the questioned work (e.g. does someone trained in physics know enough about Plato and all that is involved with the study of the classics to do a valid authorship attribution study of a questioned Plato work).
Not knowing the sub-disciplines of authorship attribution studies (e.g. linguistics, statistics, stylistics, computer science).
Not doing the necessary research for each step of the study. (The steps will be shown.)
Not doing a traditional authorship attribution study.
Not knowing when the flaws in the experimental set-up are fatal. And, therefore, not realizing that the study should not be done.
Taking shortcuts and making unverified assumptions with the experimental set-up, the data, and the statistical tests (e.g. poor or wrong controls,"cherry picking").
Ad hominem attacks and self-serving critiques.