Session

The State of Authorship Attribution Studies: (1) The History and the Scope; (2) The Problems -- Towards Credibility and Validity.

Joe Rudman

Carnegie Mellon University
rudman@cmphys.phys.cmu.edu

David I. Holmes

University of the West of England
david.holmes@csm.uwe.ac.uk

Fiona J. Tweedie

University of Glasgow, United Kingdom
fiona@stats.gla.ac.uk

R. Harald Baayen

Max Planck Institute for Psycholinguistics
baayen@mpi.nl

Keywords: authorship attribution, stylistics, statistics

Session Abstract

There are many serious problems with the science of authorship attribution studies. This session proposes to look at the history of the field, identify many of the more major problems, and offer some solutions that will go a long way towards giving the field credibility and validity.

Willard McCarty's recent posting on "Humanist" (Vol. 10, No. 137) "Communication and Memory" points out one of these problems, "...scholarship in the field is significantly inhibited, I would argue, by the low degree to which previous work in humanities computing and current work in related fields is known and recognized."

A major indication that there are problems in a field is when there is no consensus as to correct methodology or technique. Every area of authorship attribution studies has this problem -- research, experimental set-up, linguistic methods, statistical methods....

It seems that for every paper announcing an authorship attribution method that "works" or a variation of one of these methods, there is a counter paper pointing out crucial flaws:

This widespread disagreement has not only kept authorship attribution studies out of most United States court proceedings, but it also threatens to undermine even the legitimate studies in the court of public and professional opinion.

The time has come to sit back, review, digest, and then present a theoretical framework to guide future authorship attribution studies.

The first paper, by David Holmes, will give the necessary history, scope, and present direction of authorship attribution studies with particular emphasis on recent trends.

The second paper, by Harald Baayen and Fiona Tweedie, will focus on one problem: the use of so-called constants in authorship attribution questions.

The third paper, by Joseph Rudman, will point out some of the problems that are keeping authorship attribution studies from being universally accepted and will offer suggestions on how these problems can be overcome.


Stylometry: Its Origins, Development and Aspirations.

David I. Holmes

Introduction

This paper is the opening paper in the session on stylometry and aims to review the historical development of stylometry up to and including its current standing as a statistical tool within the humanities.

Stylometry - the statistical analysis of literary style - complements traditional literary scholarship since it offers a means of capturing the often elusive character of an author's style by quantifying some of its features. Most stylometric studies employ items of language and most of these items are lexically based. A sound exposition of the rationale behind such studies has been provided by Laan (1995).

The main assumption underlying stylometric studies is that authors have an unconscious as well as a conscious aspect to their style. Every author's style is thought to have certain features that are independent of the author's will, and since these features cannot be consciously manipulated by the author, they are considerd to provide the most reliable data for a stylometric study. The two primary applications are attributional studies and chronological problems, yet a difference in date or author is not the only possible explanation for stylistic peculiarities. Variation in style can be caused by differences of genre or content, and similarity by literary processes such as imitation.

By measuring and counting stylistic traits, we hope to discover the 'characteristics' of a particular author. This paper looks at criteria which may serve as a basis of measurement within the context of stylometry's origins and historical development.

Word-length and sentence-length

The origins of stylometry may be traced back to the work of Mendenhall (1887) on word-lengths and the idea of counting features of a text was extended by Yule (1938) to include sentence-lengths. Morton (1965) used sentence-lengths for tests of authorship of Greek prose, but we now know that neither of these measures are wholly reliable indicators of authorship.

Function words

Word-usage offers a great many opportunities for discrimination. Some words vary considerably in their rate of use from one work to another by the same author, others show remarkable stability within an author. For discrimination purposes we need context-free or 'function' words and this paper reviews the seminal work of Mosteller and Wallace (1964) on function word frequencies. Morton (1978) developed techniques of studying the position and immediate context of individual word-occurrences but his method has, however, come under much criticism and Smith (1985) has demonstrated that it cannot reliably distinguish between the works of Elizabethan and Jacobean playwrights.

The idea of using sets (at least 50 strong) of common high-frequency words and conducting what is essentially a principal components analysis on the data has been developed by Burrows (1987) and represents a landmark in the development of stylometry. The technique is very much in vogue now as a reliable stylometric procedure and Holmes and Forsyth (1995) have successfully applied it to the classic 'Federalist Papers' problem. Examples of the technique will be displayed.

Vocabulary distributions

One of the fundamental notions in stylometry is the measurement of what is termed the 'richness' or 'diversity' of an author's vocabulary. If we sample a text produced by a writer we might expect the extent of his/her vocabulary to be reflected in the frequency profile of word-usage. This paper reviews measures which may be thought of as 'indices of diversity'.

Mathematical models for the frequency distributions of the number of vocabulary items appearing exactly r times (r=1,2,3....) have aroused the interest of statisticians ever since the work of Zipf (1932). The best fitting model appears to be that attributed to Sichel (1975) and this paper will cover the Sichel model in addition to looking at the behaviour of the once-occurring words (hapax legomena) and twice-occurring words (hapax dislegomena) as useful stylometric tools.

Content analysis

Content analysis refers to tabulating the frequency of types of words in a text, the aim being to reach the denotative or connotative meaning of the text. Although content analysis should be useful in stylometry it has seldom been employed, but this paper will review the successful application of content analysis to the 'Federalist' problem by Martindale and McKenzie (1995).

Neural networks

Stylometry is essentially a case of pattern recognition. Neural networks have the ability to recognise the underlying organisation of data which is of vital importance for any pattern recognition problem, so their application in stylometry is both inevitable and welcome. The results achieved by Merriam and Matthews (1994) and by Lowe and Matthews (1995) will be discussed.

The future

As the amount of available computer-readable literary texts continues to increase, we can expect expansion in the use of automated pattern recognition techniques, such as neural networks, to act as 'assistants' to help in the resolution of outstanding authorship disputes. Automated feature finders will be developed (Forsyth and Holmes, 1996) to let the computer take over the task of finding the features that best discriminate between two candidate authors for a disputed text. There will be theoretical advances too, as in the change from lexically based techniques to syntactic annotation proposed by Baayen, Van Halteren and Tweedie (1996).

Stylometry, though, presents no threat to traditional scholarship. In the context of authorship attribution, stylometric evidence must be weighed in the balance along with that provided by more conventional studies made by literary scholars.

References

Baayen, H., Van Halteren, H. and Tweedie, F.J., "Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution", Literary and Linguistic Computing, 11, 121-131, (1996).

Burrows, J.F. "Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style", Literary and Linguistic Computing, 2, 61-70, (1987).

Forsyth, R.S. and Holmes, D.I. "Feature-Finding for Text Classification", Literary and Linguistic Computing, 11, 4, (1996).

Holmes, D.I. and Forsyth, R.S. "The 'Federalist' Revisited: New Directions in Authorship Attribution", Literary and Linguistic Computing, 10, 111-127, (1995).

Laan, N.M. "Stylometry and Method. The Case of Euripides", Literary and Linguistic Computing, 10, 271-278, (1995).

Lowe, D. and Matthews, R. "Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions", Computers and the Humanities, 29, 449-461, (1995).

Martindale, C. and McKenzie, D. "On the Utility of Content Analysis in Author Attribution: The 'Federalist'", Computers and the Humanities, 29, 259-270, (1995).

Mendenhall, T.C., "The Characteristic Curves of Composition", Science, IX, 237-249, (1887).

Merriam, T. and Matthews, R. "Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe", Literary and Linguistic Computing, 9, 1-6, (1994).

Morton, A.Q., "The Authorship of Greek Prose", Journal of the Royal Statistical Society (A), 128, 169-233, (1965).

Morton, A.Q., Literary Detection, New York:Scribners, (1978).

Mosteller, F. and Wallace, D.L., Inference and Disputed Authorship: The Federalist, Reading:Addison-Wesley, (1964).

Sichel, H.S., "On a Distribution Law for Word Frequencies", Journal of the American Statistical Association, 70, 542-547, (1975).

Smith, M.W.A., "An Investigation of Morton's Method to Distinguish Elizabethan Playwrights", Computers and the Humanities, 19, 3-21, (1985).

Yule, G.U., "On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship", Biometrika, 30, 363-390, (1938).

Zipf, G.K., Selected Studies of the Principle of Relative Frequency in Language, Harvard University Press, (1932).


Lexical `constants' in stylometry and authorship studies

Fiona J. Tweedie

R. Harald Baayen

Introduction

Various measures of lexical richness have been employed in stylometry and authorship attribution (see, e.g., Holmes, 1994, for a review). These measures have been advanced as characteristic constants whose value is not influenced by the text size. This study investigates in detail to what extent these measures are truly constant, how well they are suited for discriminating authors, and to what extent the values assumed by these measures are influenced by discourse structure (see Baayen, 1996).

Text constants have been developed because the simplest measure of lexical richness, the vocabulary size V(N), varies with the number of tokens in the text, N. In order to remove this dependency, constants have been proposed that are supposed to be independent of N. These range from the simple type-token ratio to more complex measures such as Orlov's Zipf size (Orlov, 1983).

Combinations of these constants have been used to investigate problems of authorship (see for example Holmes, 1992 and Baayen et al, 1996). The latter discriminated two authors at lexical and syntactic levels using analyses of function words and lexical richness. They found that function words performed better than the constants at both levels, and that the inclusion of syntactic information improved the discrimination. Baayen et al. (1996) concluded that considering the text at a more abstract level, using the reduced variability of the syntactic vocabulary, increases the efficacy of the techniques. Nevertheless, the constants also tap into stylistic properties of texts at a fairly abstract level. In order to properly evaluate the discriminatory potential of the text constants, we must clarify whether and how effectively they capture similarities and differences between authors, and to what extent they are truly constant.

Validity - Are the Constants Constant?

Figure 1 shows plots of 4 constants for Carroll's Alice in Wonderland that illustrate the main patterns in our survey of 15 measures of lexical richness. Measurements have been taken at 20 equally-spaced points in the text. The first panel shows Brunet's W to be an increasing function of the text size. The second plot, that of Honore's H, initially appears unstable, but less variable above N=13,000. The lower plots show Sichel's S and Yule's K; S is quite variable while K descends sharply from its initial value, then rises with N.

Figure 1

To evaluate the extent to which violation of the randomness assumption is responsible for the observed variability in the values of the `constants', the order of words in the texts was completely randomised and the measurements retaken. One hundred such randomisations were carried out.. The means of the randomisations are shown as points, the maxima and minima by + and - respectively. Crucially, only the mean values for K indicate that its value is theoretically truly constant for randomised text; those for W and H increase and decrease with text size, while S rises then decreases. It is clear from these graphs, both of the actual and randomised texts, that far from being stable, the constants are as variable as V, the variable that they were intended to replace.

In sum, with the exception of K and possibly the Zipf size, constants are not constant in theory, and, without exception, none are constant in practice. The empirical values of the constants are co-determined by the way in which the randomness assumption is violated in running text, namely by coherence in lexical use at the discourse level (see Baayen, 1996).

Developmental Profiles

Thus far we have considered a single text. It is possible that the variability we have observed is very small when compared to other texts and that discrimination is still possible between authors. In order to investigate this we analysed a total of fifteen texts, detailed in Table 1.

The resulting graphs for W, H, S and K are shown in Figure 2. Examining the first plot, it can be seen that, while the value of W varies with N, texts by the same author vary in the same way; the Carroll texts are coincident, as are the James texts and two of the Conan Doyle texts. It is also clear that this is not necessarily the case; the Baum texts are widely separated, as is the third Conan Doyle text from the other pair. A similar structure is found in the graph of H, with slightly different orderings. Turning to S, however, we find that the constant is so variable that it is impossible to separate authors, even at larger text sizes. The plot of K again yields a pattern in which texts are fairly well separated. The different ordering of the texts in this graph indicates that K is measuring a different facet of the lexical structure of these texts. The Conan Doyle texts now group together, as do the Baum texts, but now the Carroll texts diverge.

Figure 2

We have calculated the values for fifteen lexical richness constants and found that the resulting profiles could be classified into four families, exemplified in the graphs above. The largest family of constants is that to which W belongs. Honore's H represents a much smaller family. S comprises the family of constants that are of no discriminatory value. K makes up a family with D, variables that are theoretically constant given the urn model of word distribution within text. Some texts that are separated in the other families are coincident in this family, others are more divergent.

It is clear from the above that several constants measure the same facet of the vocabulary structure. Thus, only those constants with the greatest discriminatory sensitivity within a given family need to be considered. The developmental profiles of the constants show sensitivity to authorship, although this is not absolute in that texts written by the same author may diverge. We have also developed techniques for evaluating the statistical significance of patterns of similarity and dissimilarity in the developmental curves. While the variance of most constants is not known, so that comparisons on the basis of constants for full texts remain impressionistic, we can now evaluate in a more precise way whether or not the developmental profile of a constant differentiates between texts.

Conclusions

Almost all textual constants in our survey are highly variable, and assume values that change systematically as the text size is increased. Some constants are inherently variable, others are truly constant in theory. All constants are substantially influenced by the non-random way in which word usage is governed by discourse cohesion. This variability indicates that the constants cannot be relied on to compare texts of different lengths. Crucially, however, the developmental profiles of the majority of constants have an interesting discriminatory potential, in that they reveal consistent and interpretable patterns that pick up author-specific aspects of word use.

For authorship attribution studies, we strongly recommend the use of the developmental profiles of selected constants, rather than the isolated values of the constants for complete texts. Our data shows, however, that authors are not `prisoners' of their own developmental profile. The discourse structure of texts by the same author can be quite different, and the same holds for the kind of vocabulary an author exploits for a given text. Compared to the use of syntax, word use is more easily influenced by choices which are under the conscious control of authors. Consequently, the developmental profiles of constants are less reliable than syntax-based measures for the purpose of authorship attribution. At the same time, the developmental profiles capture essential differences in word use and discourse structure. From this perspective, we would like to defend their usefulness in the domain of quantitative stylistics.

References

Baayen, R. H. (1996) The Randomness Assumption in Word Frequency Statistics. In G. Perissinotto (Ed.), Research in Humanities Computing 5, Oxford: Oxford University Press, 17-31.

Baayen, R. H., van Halteren, H. and Tweedie, F. J. (1996) Outside the Cave of Shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing: 11(3) 121-131.

Holmes, D. I. (1992), A Stylometric Analysis of Mormon Scripture and Related Texts, Journal of the Royal Statistical Society Series A: 155(1) 91--120.

Holmes, D. I. (1994), Authorship Attribution, Computers and the Humanities: 28(2) 87--106.

Orlov, J. K. (1983), Ein Model der häufigskeitstruktur des Vokabulars. In H. Guiter and M. Arapov (eds), Studies in Zipf's Law, Bochum: Brockmeyer, 154-233.

Table 1: Texts used in this study

Author           Title                              Key
--------------------------------------------------------
Baum, L. F.      The Wonderful Wizard of Oz         b1
                 Tip Manufactures a Pumpkinhead     b2
Carroll, L.      Alice's Adventures in Wonderland   a1
                 Through the Looking-glass and      a2
                   what Alice found there
Conan Doyle, A.  The Hound of the Baskervilles      c1
                 The Valley of Fear                 c2
                 The Sign of Four                   c3
James,           Confidence                         j1
                 The Europeans                      j2
St Luke          Gospel according to St Luke (KJV)  L1
                 Acts of the Apostles (KJV)         L2
London, J.       The Sea Wolf                       l1
                 The Call of the Wild               l2
Wells, H. G.     The War of the Worlds              w1
                 The Invisible Man                  w2

The State of Authorship Attribution Studies: Problems and Solutions.

Joseph Rudman

Introduction:

There are major problems in the science of "non-traditional" authorship attribution studies (those using statistics and the computer). This paper will show that the problems exist, will list and explain some of the more major problems, and will offer some suggestions on how these problems can be resolved.

Problems exist:

Non-traditional authorship attribution research has had enough time and effort -- well over 300 studies and 30 years -- to pass through the "shake-down" phase and enter one marked by steady, solid, and scientific studies that force a consensus among its practitioners.

A major indication that there are problems in a field is when there is no consensus as to correct methodology or technique. Every area of authorship attribution studies has this problem -- e.g. research, experimental set-up, linguistic methods, statistical methods.

It seems that for every paper announcing an authorship attribution method that "works" or a variation of one of these methods, there is a counter paper pointing out crucial flaws, e.g.:

This widespread disagreement has not only kept authorship attribution studies out of most United States court proceedings, but it also threatens to undermine even the legitimate studies in the court of public and professional opinion.

Most authorship attribution studies have been governed by expediency, e.g.:

There is a lack of experimental memory. Researchers working in the same "area" of authorship attribution fail to cite and make use of pertinent previous efforts. Willard McCarty's recent posting on "Humanist" (Vol. 10, No. 137) "Communication and Memory" points this out, "...scholarship in the field is significantly inhibited, I would argue, by the low degree to which previous work in humanities computing and current work in related fields is known and recognized."

The problems with the use of statistics by many authorship attribution are many and varied. Too many researchers are led into the swampy quicksand of statistical studies by the ignis fatuus of a "more sophisticated statistical technique".

Problems and suggested solutions:

The "umbrella" problem is that most non-traditional authorship attribution researchers do not understand what constitutes a valid study. They do not understand that it is a scientific experiment and must be approached and carried out as such.

The corrections for many of the specific problems become apparent once the problem is pointed out and there is a consensus that there is a problem. This paper will expand upon, expound, and give examples from published studies of the following problems. The paper will also give the detailed solutions for these problems. One of the "solutions" will be the dissemination of a bibliography of over 500 entries.

Problem 1:

Not really knowing the field of the questioned work (e.g. does someone trained in physics know enough about Plato and all that is involved with the study of the classics to do a valid authorship attribution study of a questioned Plato work).

Not knowing the sub-disciplines of authorship attribution studies (e.g. linguistics, statistics, stylistics, computer science).

Problem 2:

Not doing the necessary research for each step of the study. (The steps will be shown.)

Not doing a traditional authorship attribution study.

Problem 3:

Not knowing when the flaws in the experimental set-up are fatal. And, therefore, not realizing that the study should not be done.

Problem 4:

Taking shortcuts and making unverified assumptions with the experimental set-up, the data, and the statistical tests (e.g. poor or wrong controls,"cherry picking").

Problem 5:

Ad hominem attacks and self-serving critiques.