LIWC for Literature: Releasing Data on 25,000 Documents

Increasing emphasis is being placed in the humanities on sharing data. Projects like the Open Syllabus Project, for example, have made a tremendous effort in discovering, collecting, and cleaning large amounts of data relevant to humanities research. Much of our data, however, is still locked-up behind copyright and paywalls within university libraries, even when the underlying information is part of the public domain. This is one of the main inhibitors to the field’s development.

In an effort to contribute to the opening of closed humanities data, I am sharing LIWC tables for 25,000+ documents that were used in a recent study I did on “fictionality.” The documents consist of both fiction and non-fiction texts drawn from a number of different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). They allow us to explore a variety of literary historical questions across broad swaths of time and place.

LIWC stands for Linguistic Inquiry and Word Count Software and is a lexicon-based tool that aggregates individual words into larger semantic and syntactic categories. Some of these categories, like punctuation marks or personal pronouns, are more straightforward than others, like “cognitive insight” or the thematics of “home.” But as we know in literary studies, even straightforward marks can have multiple meanings.

LIWC is far from perfect. Much work has been done to address the problems of polysemy of individual words. Nevertheless I want to make the case that it can be an effective tool for solving three problems within the computational study of literature and culture.

1) It gives us a useful way of beginning to categorize the lexical orientation of different populations of texts. Unlike topic modeling, where labels are provided after the fact, LIWC categories allow us to test hypotheses in advance. The categories are independent of the texts we are observing.

2) It gives us a way of reducing the dimensional complexity of linguistic features. Given enough documents, novels, poetry, or plays, you can easily end up with tens if not hundreds of thousands of word types when you’re building a given model. That’s often way too many variables from which to make statistical inferences. LIWC offers a very straightforward way of reducing the number of dimensions according to categories that are intrinsically relevant to the study of literature. There is plenty more to be done here to better understand the correlation between features or how much information is lost in this process. But with ca. 80 dimensions you are on much better footing for a variety of modeling tasks than you are with three- or ten-thousand.

3) Finally, these aggregate features allow us to share data that is otherwise not sharable. This is a huge problem in the humanities right now. LIWC provides a solution. Again, it’s not perfect. But it is better than keeping the data locked up.

I hope in the future that people will do more of this kind of sharing of transformed data. While it is always better to have the underlying data so you can understand and be in control of the process of transformation (let alone collection), we can at least start to generate some shared data sets. This is a point nicely discussed by Sarah Allison in the Journal of Cultural Analytics (CA) as well as in a forthcoming piece by Andrew Goldstone.

Perhaps the most important point though is that LIWC is just one way of reducing the complexity of a text into higher-level categories, which are largely in LIWC semantically oriented. We can argue about those categories and the degree of ambiguity within them. More importantly though we need to also think beyond purely semantic models of texts (as many are increasingly doing). We need “LIWC for literature” in a different sense — i.e. in the creation of new kinds of literary features that are derived from texts that aren’t purely semantic. Here I’m thinking of plot features, character features, dialogue features, you name it. There is so much work to do to identify features specific to literary texts. These will ultimately help not only in the sharing of data, but also in the process of literary modeling more generally.

This is a core area of exciting new research that I hope people increasingly engage with.

Does the Canon Represent a Sampling Problem? A Two Part Series

The most recent pamphlet from the Stanford Literary Lab takes up the question of the representativeness of the literary canon. Is the canon — that reduced subset of literary texts that people actually read long after they have been published — a smaller version of the field of literary production more generally? Or is it substantially different? And if so, how? What are the selection biases that go into constructing the canon?

The Stanford pamphlet offers some really interesting initial insights as these questions relate to the British novel of the nineteenth century. That’s actually not as arbitrary a time period as it may sound — as I’ve shown elsewhere, if we look at world translations the nineteenth-century marks the cut-off of still widely circulating texts. Anything earlier and you are entering the more rarified world of scholarship and education, not popular reading.

The Stanford findings tell us that the canon is different in at least two ways: first, it has a higher degree of unpredictability at the level of word sequence (combinations of words); and second, it has a narrower, and slightly lower, range of vocabulary richness. More common sets of words are appearing in less predictable patterns. That’s a very nice, and neat, way of summarizing what makes a work of “literature.”

Needless to say there may be many other ways in which more canonical literature differs from its winnowed brethren. This is what I will be exploring in the second part of this series. Here I want to take up the question of whether the canon might actually tell us the same things as a much larger sample of novels more generally — or if not same then highly similar. As researchers we have choices facing us in how many texts we choose to look, which ones, and in what kinds of state those texts arrive in. Not unlike other fields that are wrestling with the question of whether size matters (how big is your N), computational literary studies needs to be addressing these questions as well. Understanding the biases and the efficacy of samples, whether it be the so-called “canon,” the “archive,” “women’s writing,” “contemporary writing,” or any number of other textual categories, is going to be a key area of research as we move forward. There won’t be one answer, but having as many examples as possible to draw on will help us reach more consensus when it comes time to select data sets for particular questions.

So the question becomes something like, yes, we can find some differences between canonical novels and the less well-remembered. This has been found to be true in another study if we take “downloads” as a measure of prestige. But do those differences matter? The obvious answer is yes, everything matters! But it also depends on the task. Take the following example.

In my current project, I am looking at the predictability of fictional texts and more specifically what features help us predict whether a text is “true” (i.e. non-fiction) or “imaginary” (fiction). Following on the work of Ted Underwood who has developed methods to make these predictions, I’m interested in better understanding what the predictive features have to tell us about fictionality more generally. When texts signal to readers that they are not about something real, what techniques do they use?

I began this process with a very small sample of 100 highly canonical works of fiction (novels, novellas, and classical epic fiction in prose) and an counter-corpus of non-fiction of the same size (essays, histories, philosophy, advice manuals, etc). I computed the predictability of each class and came out with about 96% accuracy. I then reran this process controlling for narration, point of view, and even dialogue (by removing it) — so I looked at only third person novels and only at histories and only at narration (due to the fact that novels consist of so much dialogue which is much lower in any other kind of text). I did so for a group of nineteenth century texts in both German and English (n=200) and a group of contemporary texts only in English (n=400). The predictability actually increased (98%) and was constant between languages and across time.

Like you, I began to worry about the size of my N. So I reran this process on a collection of 18,000 documents in English drawn from the Hathi Trust, half from Ted Underwood’s fiction data set and half randomly sampled from the non-fiction pile. Overall, the story stayed largely the same. The accuracy was 95% and the list of features that were most indicative of fiction were all the same, with some slight reordering and shifting of effect sizes. In other words, for my question the canon worked just fine. There was very little knowledge gained by expanding out my sample. In fact, because of the OCR errors in the larger collection there were important facets of those texts that I could not reliably study — like punctuation — that I could observe in my sample.

Of course, some things did change and it is those details I want to explore here because they give us leads as to how the canon and the archive might also be different from each other beyond conditional entropy and type-token ratios. When we use a larger text collection, in what ways does it change our understanding of the problem and in what ways does it not alter the picture?

Below you will see a series of tables describing the features that I explored and their relative increase in one corpus over another. The features are all drawn from the Linguistic Inquiry Word Count Software (LIWC), which I have used elsewhere on other tasks. I won’t go into the details here, but I like LIWC because of its off-the-shelf ease of use and the way its categories are well-aligned with the types of stylistic and psychologicaly-oriented questions we tend to ask in literary studies. We’ll want to develop much more expanded feature-sets in the future, such as these, but for now LIWC gives us a way of generalizing about a text’s features that can help us understand the broader nature of what makes a group cohere. It also helps with the problem of feature-reduction, which is nice, and I’ve found that for long, psychologically-oriented texts like novels it performs as well as if not better in classification tests than bag-of-words. Of course, the interpretation of the features needs to be handled with a great deal of caution due to its vocabulary-driven nature, but when isn’t that true?

In the table below you see a list of the features that were most indicative of fiction according to the small canonical sample. They are ranked by their increase relative to the non-fiction corpus to which they were compared. Alongside those numbers you can see their same levels and ranks within the significantly larger Hathi corpus. I cut-off the features below a 50% increase form one corpus to another. Remember, in each case the sample is being compared to a control corpus of non-fiction of the same relative size. Ideally, this allows us to compare how the collections give us slightly different portraits of what makes “fiction” unique.

Features% Increase (Canon)% Increase (Hathi)DifferenceRank (Canon)Rank (Canon)Difference
exclamation485.9191656214.3074747-271.611690914-3
you308.7453646228.7512467-79.9941179923-1
assent243.622449258.67978815.057339312
QMark238.15893148.6655116-89.4934183548-4
Quote235.7273919228.9431077-6.784284149523
Apostro213.72512275.34737064-138.3777514622-16
I200.9248816163.252316-37.67256553752
hear165.7366071157.5999112-8.136695931862
family140.5416667103.4734248-37.06824188910-1
shehe139.8259188156.027792916.201874171073
swear119.745222950.76492771-68.980295221129-18
ppron99.89550084112.495902312.600401421293
friend87.7513711299.7624719112.0111007913112
body81.417812775.70947344-5.7083392641421-7
filler76.9230769283.791343476.86826654315150
percept73.8339920993.3876338719.5536417816124
past73.7189040584.5301431310.8112390917143
home63.7443438981.7969735318.0526296418171
social61.7624434179.3834396317.6209962119181
sexual61.1650485470.181077989.016029442023-3
see61.0684812583.5593561122.4908748621165
sad57.3503184778.6668920321.3165735522193
anx57.0298453889.8917875532.86194217231310

Beginning with the first table, we see how 16 of 23 features are within 0-3 rank positions. These represent a very strong degree of congruence between the two collections. For those features that are not well-matched in terms of the rankings, we see how features like swearing, apostrophes, and body words appear to be significantly over-represented in the canon, while anxiety and sight-words are under-represented. This is mostly born out if we look at the differences in effect size: the three highest are exclamation marks, question marks, and apostrophes, with “you,” swearing, and “I” also representing significant decreases in the Hathi collection. Conversely, anxiety, seeing, and sadness all have the highest increase in the Hathi collection, with “home” not far behind. The “body” words mis-ranking we saw above does not seem to register at the level of actual increase within the collections (there is only about a 5% difference in what the two collections report).

So what does this tell us? First, punctuation seems to be the most variable between the collections, which again, might have something to do with OCR. But for the other types of words, it seems like we are seeing the ways in which each collection has a particular semantic bias (how significant that bias is is a different question). The canon seems slightly more oriented towards family-concerns (and dialogue through the I/you prevalence), while the Hathi collection seems to put some more emphasis on negative emotions as well as the space of the “home” (literally words having to do with houses, like “home”, “garden”, “closet”, etc). Interestingly, these more specific dictionaries usually encapsulate about .4-.6% of words in a given novel, meaning about 400 words per mid-length novel or about 1-2 per page. That’s neither small nor large. Just the word “you” for example accounts for about 1.3% of tokens, or roughly 3-times as many instances (while swearing is about 1/10th the rate of family words).

The caveats to all of this are a) it’s important that my canonical sample is not “the canon” — it is a sample of the canon. Different samples might perform somewhat differently. And the Hathi Trust is not exclusively representing the “archive” or “non-canon”. It contains many canonical as well as non-canonical novels. The same could be said for the non-fiction side of things. These samples overlap to a certain degree. As I said, this isn’t about directly comparing the canon to the forgotten, but rather to first find out if using a larger sample impacts a particular type of test.

The answer to that question in this case is provisionally: not by much. I am glad to have both collections to see how they perform relative to one another. But I would feel confident if someone undertook a similar project and used a smaller sample to base their claims off of. I would be curious if others disagree.

In the next post I will look more exclusively at comparing the canon to the archive.

 

 

 

Quantifying the Weepy Bestseller

I have a new piece out that is appearing in The New Republic. In a number of recent book reviews, literary critics and novelists arrive at the consensus that to be a great writer, one must avoid being “sentimental.” One famous novelist describes it as a “cardinal sin” of writing. But is it actually true? Using a computer science method called “sentiment analysis,” we tested this claim on a large corpus of novels from the early twentieth century to the present, and found the opposite. Writers who win book prizes and get reviewed in the New York Times are not any less sentimental than novelists who write popular fiction, such as romances or bestsellers. The only group for whom this was not true were the 50 most canonical novels ever written since about 1950. Our analysis tells us that if you want to write one of the most important books of the next half century, then you should tone down the sentiment. But if you want to be reviewed in a major newspaper, sell books, or win prizes, go ahead and emote away.

But the larger point for us is the way our cultural taste-makers are often wrong or extremely biased in their assumptions about what matters. We found that a computer, ironically, can paint a more nuanced picture of what makes great literature.

Here is a an excerpt:

If you want to be a great writer, should you withhold your sentimental tendencies? The answer for most critics and writers seems to be yes. Sentimentality is often seen as a useful way of distinguishing between serious literature and the not-so-serious, probably best-selling kind. “Sentimentality,” James Baldwin wrote, is “the ostentatious parading of excessive and spurious emotion…the mark of dishonesty, the inability to feel.” While sentimentality is false, grandiose, manipulative, and over-boiled, high literature is subtle, nuanced, cool, and true. As Roland Barthes, the dean of high cultural criticism, once remarked: “It is no longer the sexual which is indecent, it is the sentimental.” This sentiment (yes sentiment) has been around since at least the early twentieth century and is still a subject of debate in the review pagesof numerous media outlets today. But is it true? Whether you are for subtlety or against sentimentality, is this a good way to think about writing your next novel?

Read more here.