Where’s the data? Notes from an international forum on limited use text mining

I’m attending a two-day workshop on issues related to data access for text and data mining (TDM). We are 25 participants from different areas, including researchers who do TDM, librarians who oversee digital content, and content providers who package and sell data to academic libraries (principally large publishers), and finally, lawyers.

I am excited to be here because these issues strike me as both complicated and intractable. I have for several years tried to gain greater access to data in our university library with no success. I have also worked extensively with limited use data and wished I could be more open with the data. Whenever I ask how the situation can improve, a finger pointing circle begins where everyone points at someone else and nothing changes.

The overarching question that we are all implicitly asking ourselves: Will anything change after our meeting?

Here we go.

Continue reading “Where’s the data? Notes from an international forum on limited use text mining”

The Replication Crisis I: Restoring confidence in research through replication clusters

Much has been written about the so-called “replication crisis” going on across the sciences today. There are many ways that these issues impact literary and cultural studies, but not always in the most straightforward way. “Replication” has a complicated fit with more interpretive disciplines and it warrants thinking about its implications. In the next few weeks I’ll be writing some posts about this to try to generate a conversation around the place of replication in the humanities.

Continue reading “The Replication Crisis I: Restoring confidence in research through replication clusters”

Are novels getting easier to read?

I’ve been experimenting with using readability metrics lately (code for the below is here). They offer a very straightforward way of measuring textual difficulty, usually consisting of some ratio of sentence and word length. They date back to the work of Rudolf Flesch, who developed the “Flesch Reading Ease” metric. Today, there are over 30 such measures.

Flesch was a Viennese immigrant who fled Austria from the Nazis and came to the U.S. in 1933. He ended up as a student in Lyman Bryson’s Readability Lab at Columbia University. The study of “readability” emerged as a full-fledged science in the 1930s when the U.S. government began to invest more heavily in adult education during the Great Depression. Flesch’s insight, which was based on numerous surveys and studies of adult readers, was simple. While there are many factors behind what makes a book or story comprehensible (i.e. “readable”), the two most powerful predictors are a combination of sentence and word length. The longer a book’s sentences and the more long words it uses, the more difficult readers will likely find it. Flesch reduced this insight into a single predictive, and somewhat bizarre formula:


W = # words, St = # sentences, Sy = # syllables


According to Flesch’s measure, Rudyard Kipling’s The Jungle Book has a higher readability score (87.5) than James Joyce’s Ulysses (81.0). Presidential inaugural speeches have been getting more readable over time. The question that I began to ask was, have novels as well?

The answer, at first glance, is yes. Considerably so. Below you see a plot of the mean readability score per decade for a sample of ca. 5,000 English-language novels. These novels are drawn from the Stanford Literary Lab collection and Chicago Text Lab. The higher the value the more “readable” (i.e. less difficult) a text is assumed to be. The calculations are made by taking 20 sample passages of 15-sentences from each novel and calculating the Flesch reading ease for every passage. Then for every decade I use a bootstrapping process to estimate the mean reading ease for that decade. Error bars give you some idea of the variability around the mean per decade. What this masks is a very high variability at the passage level. Nevertheless, despite this the overall average is clearly moving up in significant ways.

One question that immediately came to mind was the extent to which these scores are being driven by an increase in dialogue. Dialogue is notably “simpler” in structure with considerably shorter sentences, and potentially shorter words to capture spoken language. I wondered whether this might be behind this change.

Below you see a second graph with the percentage of quotation marks per decade. Here I simply calculated the number of quotation mark (pairs) per novel and used bootstrapping to estimate the decade mean. As you can see, they rise in very similar fashion, though with a noticeable break where two data sets are joined together. Mark Algee-Hewitt has a lot to say on this issue of combining data sets. It’s interesting that typographic things like quotation marks are way more problematic for this issue than something more complex like “readability.” A lot also depends on my very simple model of modelling dialogue. It could just be that they get more standardized and thus appear more frequent, but I don’t think that’s entirely the case. Either way, this could definitely use improvement.

With these caveats in mind, there is a very strong correlation between the number of quotation marks used per decade and the readability of novels (r = 0.86). It suggests that dialogue is a big part of this shift towards more readable novels.

But what if we remove dialogue? Are novel sentences outside of dialogue getting simpler, too?

I don’t have an answer to that yet. And while it will be an important facet in order to nuance this issue, either way what we are seeing is how the novel, as represented in these two collections, follows a very straightforward trajectory towards simpler sentence and word lengths over the past two centuries. Much of that can be explained by greater reliance on dialogue, but that too is an important part of the readability story.

Why has this been the case? Commercialization, growth of the reading public…I don’t know. I think these are potential explanations but they require more data to show causality. What I can say is that based on the work I’m doing with Richard So on fan fiction is that fan-based writing — non-professional, yet high volume — does not exhibit significantly higher readability scores than “canon” does (i.e. the novels on which fanfic is based). In other words, in this one case expanding the user/reader base doesn’t correlate with simpler texts like you might expect.

It also looks as though readability has plateaued. Perhaps we’re seeing a cultural maximum being achieved in terms of the readability of novels. Then again, only time will tell.


* The other nice thing about readability is there is a great R package called koRpus to implement it. You can access the code through GitHub here.

LLCU 255: Intro to Literary Text Mining — New Syllabus 2017

Less but better. That’s the essentialist’s motto and that’s the one I use every year when I revise my syllabus. I keep removing things and students keep learning more every year. While there is clearly a ceiling for this approach, it works remarkably well as a pedagogical tactic. Here’s the full syllabus.

This year’s class will focus on three things:

  1. understanding what text mining or literary modeling is. I am always struck by how few students have ever heard of this field.
  2. being able to undertake a variety of analytical tasks, including preparing your data, significance testing, clustering, machine learning, sentiment analysis, and social network analysis.
  3. starting to generate ideas about how to apply these tools to good questions.

It’s the last one that is always the hardest. Learning how to use R may seem intimidating at first, but being good at creating creative models and measures for complex literary concepts is always the hardest part of this research.

The most rewarding part of this class is to see the mental transformation of students when the light bulb goes off — oh you mean I can test my beliefs on more than 1 text!?! That’s awesome!

Just Review, a student led project on gender bias in book reviewing

For years, women have been aware that their books are less likely to get reviewed in the popular press and they are also less likely to serve as reviewers of books. Projects like VIDA and CWILA were started to combat this kind of exclusion. Over time they have managed to make some change happen in the industry. Although nowhere near parity, more women are being reviewed in major outlets than they were five or ten years ago.

Just Review was started in response to the belief that things were getting better. Just because you have more female authors being reviewed doesn’t mean those authors aren’t being pigeon-holed or stereotyped into writing about traditionally feminine topics. “Representation” is more than just numbers. It’s also about the topics, themes, and language that circulates around particular identities. In an initial study run out of our lab, we found there was a depressing association between certain kinds of words and the book author’s gender (even when controlling for the reviewer’s gender). Women were strongly associated with all the usual tropes of domesticity and sentimentality, while men were associated with public facing terms related to science, politics and competition. It seemed like we had made little progress from a largely Victorian framework.

To address this, we created an internship in “computational cultural advocacy” in my lab focused on “women in the public sphere.” We recruited five amazing students from a variety of different disciplines and basically said, “Go.”

The team of Just Review set about understanding the problem in greater detail, working together to identify their focus more clearly (including developing the project title Just Review) and reaching out to stakeholders to learn more about the process. By the end, they created a website, advocacy tools to help editors self-assess, recommendations for further reading, and a computational tool that identifies a book’s theme based on labels from Goodreads.com. If you know the author’s gender and the book’s ISBN (or even title), you can create a table that lists the themes of the books reviewed by a website. When we did this for over 10,000 book reviews in the New York Times, we found that there are strong thematic biases at work, even in an outlet prized for its gender equality.

Topics are identified from Goodreads. Topics that show no bias are omitted.

Beyond the important findings that they have uncovered, the really salient point about this project is the way it has been student-led from the beginning. It shows that with mentoring and commitment young women can become cultural advocates. They can take their academic interests and apply them to existing problems in the world and effect change. There has been a meme for awhile in the digital humanities that it is a field alienating to women and feminists. I hope this project shows just how integral a role data and computation can play to promote ideals of gender equality.

We will be creating a second iteration in the fall that will focus on getting the word out and tracking review sites more closely with our new tools. Congratulations to this year’s team who have made a major step in putting this issue on the public’s radar.


LIWC for Literature: Releasing Data on 25,000 Documents

Increasing emphasis is being placed in the humanities on sharing data. Projects like the Open Syllabus Project, for example, have made a tremendous effort in discovering, collecting, and cleaning large amounts of data relevant to humanities research. Much of our data, however, is still locked-up behind copyright and paywalls within university libraries, even when the underlying information is part of the public domain. This is one of the main inhibitors to the field’s development.

In an effort to contribute to the opening of closed humanities data, I am sharing LIWC tables for 25,000+ documents that were used in a recent study I did on “fictionality.” The documents consist of both fiction and non-fiction texts drawn from a number of different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). They allow us to explore a variety of literary historical questions across broad swaths of time and place.

LIWC stands for Linguistic Inquiry and Word Count Software and is a lexicon-based tool that aggregates individual words into larger semantic and syntactic categories. Some of these categories, like punctuation marks or personal pronouns, are more straightforward than others, like “cognitive insight” or the thematics of “home.” But as we know in literary studies, even straightforward marks can have multiple meanings.

LIWC is far from perfect. Much work has been done to address the problems of polysemy of individual words. Nevertheless I want to make the case that it can be an effective tool for solving three problems within the computational study of literature and culture.

1) It gives us a useful way of beginning to categorize the lexical orientation of different populations of texts. Unlike topic modeling, where labels are provided after the fact, LIWC categories allow us to test hypotheses in advance. The categories are independent of the texts we are observing.

2) It gives us a way of reducing the dimensional complexity of linguistic features. Given enough documents, novels, poetry, or plays, you can easily end up with tens if not hundreds of thousands of word types when you’re building a given model. That’s often way too many variables from which to make statistical inferences. LIWC offers a very straightforward way of reducing the number of dimensions according to categories that are intrinsically relevant to the study of literature. There is plenty more to be done here to better understand the correlation between features or how much information is lost in this process. But with ca. 80 dimensions you are on much better footing for a variety of modeling tasks than you are with three- or ten-thousand.

3) Finally, these aggregate features allow us to share data that is otherwise not sharable. This is a huge problem in the humanities right now. LIWC provides a solution. Again, it’s not perfect. But it is better than keeping the data locked up.

I hope in the future that people will do more of this kind of sharing of transformed data. While it is always better to have the underlying data so you can understand and be in control of the process of transformation (let alone collection), we can at least start to generate some shared data sets. This is a point nicely discussed by Sarah Allison in the Journal of Cultural Analytics (CA) as well as in a forthcoming piece by Andrew Goldstone.

Perhaps the most important point though is that LIWC is just one way of reducing the complexity of a text into higher-level categories, which are largely in LIWC semantically oriented. We can argue about those categories and the degree of ambiguity within them. More importantly though we need to also think beyond purely semantic models of texts (as many are increasingly doing). We need “LIWC for literature” in a different sense — i.e. in the creation of new kinds of literary features that are derived from texts that aren’t purely semantic. Here I’m thinking of plot features, character features, dialogue features, you name it. There is so much work to do to identify features specific to literary texts. These will ultimately help not only in the sharing of data, but also in the process of literary modeling more generally.

This is a core area of exciting new research that I hope people increasingly engage with.

Congratulations to Eva Portelance ARIA Intern for 2016

Eva Portelance presented her work this past week that was completed under an Arts Undergraduate Research Internship (ARIA). Her project focuses on the computational detection of narrative frames. It involves three steps that include a theoretical definition of a frame, writing code to detect narrative frames and comparing those to existing methods of text segmentation, and developing literary use cases for such an algorithm.

Literary theorists have long been interested in the question of narrative frames. Whether it involves changes in point of view, time, setting, or character-clusters, narrative frames are crucial ways through which information is communicated and structured in narrative form. Being able to detect these boundaries allows us to better understand the pacing and orientation of framing in literary narratives.

Portelance’s current approach is able to detect frames with about 67% accuracy across different kinds of novels. We measure these predictions against hand-annotated frames. We have since augmented these annotations with other reader annotations and are in the process of assessing how much agreement there is among readers about when and where a frame happens. While our performance leaves room for improvement (it is a very hard task), the next best segmenting tool captures frames with about 18% accuracy!

Portelance’s project also includes a second dimension which involves aggregating frames into larger “plotlines.” While we’re still debating the best way to do this — and whether “plotline” is the best way to understand them — the ability to cluster material by larger narrative threads gives us the ability to understand just how narratively diverse a given novel or work of fiction might be.

It offers we hope one more way of beginning to account for literary expression beyond purely semantic-level analysis.