Does the Canon Represent a Sampling Problem? A Two Part Series

February 11, 2016

NovelTM

The most recent pamphlet from the Stanford Literary Lab takes up the question of the representativeness of the literary canon. Is the canon — that reduced subset of literary texts that people actually read long after they have been published — a smaller version of the field of literary production more generally? Or is it substantially different? And if so, how? What are the selection biases that go into constructing the canon?

The Stanford pamphlet offers some really interesting initial insights as these questions relate to the British novel of the nineteenth century. That’s actually not as arbitrary a time period as it may sound — as I’ve shown elsewhere, if we look at world translations the nineteenth-century marks the cut-off of still widely circulating texts. Anything earlier and you are entering the more rarified world of scholarship and education, not popular reading.

The Stanford findings tell us that the canon is different in at least two ways: first, it has a higher degree of unpredictability at the level of word sequence (combinations of words); and second, it has a narrower, and slightly lower, range of vocabulary richness. More common sets of words are appearing in less predictable patterns. That’s a very nice, and neat, way of summarizing what makes a work of “literature.”

Needless to say there may be many other ways in which more canonical literature differs from its winnowed brethren. This is what I will be exploring in the second part of this series. Here I want to take up the question of whether the canon might actually tell us the same things as a much larger sample of novels more generally — or if not same then highly similar. As researchers we have choices facing us in how many texts we choose to look, which ones, and in what kinds of state those texts arrive in. Not unlike other fields that are wrestling with the question of whether size matters (how big is your N), computational literary studies needs to be addressing these questions as well. Understanding the biases and the efficacy of samples, whether it be the so-called “canon,” the “archive,” “women’s writing,” “contemporary writing,” or any number of other textual categories, is going to be a key area of research as we move forward. There won’t be one answer, but having as many examples as possible to draw on will help us reach more consensus when it comes time to select data sets for particular questions.

So the question becomes something like, yes, we can find some differences between canonical novels and the less well-remembered. This has been found to be true in another study if we take “downloads” as a measure of prestige. But do those differences matter? The obvious answer is yes, everything matters! But it also depends on the task. Take the following example.

In my current project, I am looking at the predictability of fictional texts and more specifically what features help us predict whether a text is “true” (i.e. non-fiction) or “imaginary” (fiction). Following on the work of Ted Underwood who has developed methods to make these predictions, I’m interested in better understanding what the predictive features have to tell us about fictionality more generally. When texts signal to readers that they are not about something real, what techniques do they use?

I began this process with a very small sample of 100 highly canonical works of fiction (novels, novellas, and classical epic fiction in prose) and an counter-corpus of non-fiction of the same size (essays, histories, philosophy, advice manuals, etc). I computed the predictability of each class and came out with about 96% accuracy. I then reran this process controlling for narration, point of view, and even dialogue (by removing it) — so I looked at only third person novels and only at histories and only at narration (due to the fact that novels consist of so much dialogue which is much lower in any other kind of text). I did so for a group of nineteenth century texts in both German and English (n=200) and a group of contemporary texts only in English (n=400). The predictability actually increased (98%) and was constant between languages and across time.

Like you, I began to worry about the size of my N. So I reran this process on a collection of 18,000 documents in English drawn from the Hathi Trust, half from Ted Underwood’s fiction data set and half randomly sampled from the non-fiction pile. Overall, the story stayed largely the same. The accuracy was 95% and the list of features that were most indicative of fiction were all the same, with some slight reordering and shifting of effect sizes. In other words, for my question the canon worked just fine. There was very little knowledge gained by expanding out my sample. In fact, because of the OCR errors in the larger collection there were important facets of those texts that I could not reliably study — like punctuation — that I could observe in my sample.

Of course, some things did change and it is those details I want to explore here because they give us leads as to how the canon and the archive might also be different from each other beyond conditional entropy and type-token ratios. When we use a larger text collection, in what ways does it change our understanding of the problem and in what ways does it not alter the picture?

Below you will see a series of tables describing the features that I explored and their relative increase in one corpus over another. The features are all drawn from the Linguistic Inquiry Word Count Software (LIWC), which I have used elsewhere on other tasks. I won’t go into the details here, but I like LIWC because of its off-the-shelf ease of use and the way its categories are well-aligned with the types of stylistic and psychologicaly-oriented questions we tend to ask in literary studies. We’ll want to develop much more expanded feature-sets in the future, such as these, but for now LIWC gives us a way of generalizing about a text’s features that can help us understand the broader nature of what makes a group cohere. It also helps with the problem of feature-reduction, which is nice, and I’ve found that for long, psychologically-oriented texts like novels it performs as well as if not better in classification tests than bag-of-words. Of course, the interpretation of the features needs to be handled with a great deal of caution due to its vocabulary-driven nature, but when isn’t that true?

In the table below you see a list of the features that were most indicative of fiction according to the small canonical sample. They are ranked by their increase relative to the non-fiction corpus to which they were compared. Alongside those numbers you can see their same levels and ranks within the significantly larger Hathi corpus. I cut-off the features below a 50% increase form one corpus to another. Remember, in each case the sample is being compared to a control corpus of non-fiction of the same relative size. Ideally, this allows us to compare how the collections give us slightly different portraits of what makes “fiction” unique.

Features	% Increase (Canon)	% Increase (Hathi)	Difference	Rank (Canon)	Rank (Canon)	Difference
exclamation	485.9191656	214.3074747	-271.6116909	1	4	-3
you	308.7453646	228.7512467	-79.99411799	2	3	-1
assent	243.622449	258.679788	15.057339	3	1	2
QMark	238.15893	148.6655116	-89.49341835	4	8	-4
Quote	235.7273919	228.9431077	-6.784284149	5	2	3
Apostro	213.725122	75.34737064	-138.3777514	6	22	-16
I	200.9248816	163.252316	-37.67256553	7	5	2
hear	165.7366071	157.5999112	-8.136695931	8	6	2
family	140.5416667	103.4734248	-37.06824188	9	10	-1
shehe	139.8259188	156.0277929	16.20187417	10	7	3
swear	119.7452229	50.76492771	-68.98029522	11	29	-18
ppron	99.89550084	112.4959023	12.60040142	12	9	3
friend	87.75137112	99.76247191	12.01110079	13	11	2
body	81.4178127	75.70947344	-5.708339264	14	21	-7
filler	76.92307692	83.79134347	6.868266543	15	15	0
percept	73.83399209	93.38763387	19.55364178	16	12	4
past	73.71890405	84.53014313	10.81123909	17	14	3
home	63.74434389	81.79697353	18.05262964	18	17	1
social	61.76244341	79.38343963	17.62099621	19	18	1
sexual	61.16504854	70.18107798	9.01602944	20	23	-3
see	61.06848125	83.55935611	22.49087486	21	16	5
sad	57.35031847	78.66689203	21.31657355	22	19	3
anx	57.02984538	89.89178755	32.86194217	23	13	10

Beginning with the first table, we see how 16 of 23 features are within 0-3 rank positions. These represent a very strong degree of congruence between the two collections. For those features that are not well-matched in terms of the rankings, we see how features like swearing, apostrophes, and body words appear to be significantly over-represented in the canon, while anxiety and sight-words are under-represented. This is mostly born out if we look at the differences in effect size: the three highest are exclamation marks, question marks, and apostrophes, with “you,” swearing, and “I” also representing significant decreases in the Hathi collection. Conversely, anxiety, seeing, and sadness all have the highest increase in the Hathi collection, with “home” not far behind. The “body” words mis-ranking we saw above does not seem to register at the level of actual increase within the collections (there is only about a 5% difference in what the two collections report).

So what does this tell us? First, punctuation seems to be the most variable between the collections, which again, might have something to do with OCR. But for the other types of words, it seems like we are seeing the ways in which each collection has a particular semantic bias (how significant that bias is is a different question). The canon seems slightly more oriented towards family-concerns (and dialogue through the I/you prevalence), while the Hathi collection seems to put some more emphasis on negative emotions as well as the space of the “home” (literally words having to do with houses, like “home”, “garden”, “closet”, etc). Interestingly, these more specific dictionaries usually encapsulate about .4-.6% of words in a given novel, meaning about 400 words per mid-length novel or about 1-2 per page. That’s neither small nor large. Just the word “you” for example accounts for about 1.3% of tokens, or roughly 3-times as many instances (while swearing is about 1/10th the rate of family words).

The caveats to all of this are a) it’s important that my canonical sample is not “the canon” — it is a sample of the canon. Different samples might perform somewhat differently. And the Hathi Trust is not exclusively representing the “archive” or “non-canon”. It contains many canonical as well as non-canonical novels. The same could be said for the non-fiction side of things. These samples overlap to a certain degree. As I said, this isn’t about directly comparing the canon to the forgotten, but rather to first find out if using a larger sample impacts a particular type of test.

The answer to that question in this case is provisionally: not by much. I am glad to have both collections to see how they perform relative to one another. But I would feel confident if someone undertook a similar project and used a smaller sample to base their claims off of. I would be curious if others disagree.

In the next post I will look more exclusively at comparing the canon to the archive.