LIWC for Literature: Releasing Data on 25,000 Documents
Increasing emphasis is being placed in the humanities on sharing data. Projects like the Open Syllabus Project, for example, have made a tremendous effort in discovering, collecting, and cleaning large amounts of data relevant to humanities research. Much of our data, however, is still locked-up behind copyright and paywalls within university libraries, even when the underlying information is part of the public domain. This is one of the main inhibitors to the field’s development.
In an effort to contribute to the opening of closed humanities data, I am sharing LIWC tables for 25,000+ documents that were used in a recent study I did on “fictionality.” The documents consist of both fiction and non-fiction texts drawn from a number of different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). They allow us to explore a variety of literary historical questions across broad swaths of time and place.
LIWC stands for Linguistic Inquiry and Word Count Software and is a lexicon-based tool that aggregates individual words into larger semantic and syntactic categories. Some of these categories, like punctuation marks or personal pronouns, are more straightforward than others, like “cognitive insight” or the thematics of “home.” But as we know in literary studies, even straightforward marks can have multiple meanings.
LIWC is far from perfect. Much work has been done to address the problems of polysemy of individual words. Nevertheless I want to make the case that it can be an effective tool for solving three problems within the computational study of literature and culture.
1) It gives us a useful way of beginning to categorize the lexical orientation of different populations of texts. Unlike topic modeling, where labels are provided after the fact, LIWC categories allow us to test hypotheses in advance. The categories are independent of the texts we are observing.
2) It gives us a way of reducing the dimensional complexity of linguistic features. Given enough documents, novels, poetry, or plays, you can easily end up with tens if not hundreds of thousands of word types when you’re building a given model. That’s often way too many variables from which to make statistical inferences. LIWC offers a very straightforward way of reducing the number of dimensions according to categories that are intrinsically relevant to the study of literature. There is plenty more to be done here to better understand the correlation between features or how much information is lost in this process. But with ca. 80 dimensions you are on much better footing for a variety of modeling tasks than you are with three- or ten-thousand.
3) Finally, these aggregate features allow us to share data that is otherwise not sharable. This is a huge problem in the humanities right now. LIWC provides a solution. Again, it’s not perfect. But it is better than keeping the data locked up.
I hope in the future that people will do more of this kind of sharing of transformed data. While it is always better to have the underlying data so you can understand and be in control of the process of transformation (let alone collection), we can at least start to generate some shared data sets. This is a point nicely discussed by Sarah Allison in the Journal of Cultural Analytics (CA) as well as in a forthcoming piece by Andrew Goldstone.
Perhaps the most important point though is that LIWC is just one way of reducing the complexity of a text into higher-level categories, which are largely in LIWC semantically oriented. We can argue about those categories and the degree of ambiguity within them. More importantly though we need to also think beyond purely semantic models of texts (as many are increasingly doing). We need “LIWC for literature” in a different sense — i.e. in the creation of new kinds of literary features that are derived from texts that aren’t purely semantic. Here I’m thinking of plot features, character features, dialogue features, you name it. There is so much work to do to identify features specific to literary texts. These will ultimately help not only in the sharing of data, but also in the process of literary modeling more generally.
This is a core area of exciting new research that I hope people increasingly engage with.