Are novels getting easier to read?

I’ve been experimenting with using readability metrics lately (code for the below is here). They offer a very straightforward way of measuring textual difficulty, usually consisting of some ratio of sentence and word length. They date back to the work of Rudolf Flesch, who developed the “Flesch Reading Ease” metric. Today, there are over 30 such measures.

Flesch was a Viennese immigrant who fled Austria from the Nazis and came to the U.S. in 1933. He ended up as a student in Lyman Bryson’s Readability Lab at Columbia University. The study of “readability” emerged as a full-fledged science in the 1930s when the U.S. government began to invest more heavily in adult education during the Great Depression. Flesch’s insight, which was based on numerous surveys and studies of adult readers, was simple. While there are many factors behind what makes a book or story comprehensible (i.e. “readable”), the two most powerful predictors are a combination of sentence and word length. The longer a book’s sentences and the more long words it uses, the more difficult readers will likely find it. Flesch reduced this insight into a single predictive, and somewhat bizarre formula:


W = # words, St = # sentences, Sy = # syllables


According to Flesch’s measure, Rudyard Kipling’s The Jungle Book has a higher readability score (87.5) than James Joyce’s Ulysses (81.0). Presidential inaugural speeches have been getting more readable over time. The question that I began to ask was, have novels as well?

The answer, at first glance, is yes. Considerably so. Below you see a plot of the mean readability score per decade for a sample of ca. 5,000 English-language novels. These novels are drawn from the Stanford Literary Lab collection and Chicago Text Lab. The higher the value the more “readable” (i.e. less difficult) a text is assumed to be. The calculations are made by taking 20 sample passages of 15-sentences from each novel and calculating the Flesch reading ease for every passage. Then for every decade I use a bootstrapping process to estimate the mean reading ease for that decade. Error bars give you some idea of the variability around the mean per decade. What this masks is a very high variability at the passage level. Nevertheless, despite this the overall average is clearly moving up in significant ways.

One question that immediately came to mind was the extent to which these scores are being driven by an increase in dialogue. Dialogue is notably “simpler” in structure with considerably shorter sentences, and potentially shorter words to capture spoken language. I wondered whether this might be behind this change.

Below you see a second graph with the percentage of quotation marks per decade. Here I simply calculated the number of quotation mark (pairs) per novel and used bootstrapping to estimate the decade mean. As you can see, they rise in very similar fashion, though with a noticeable break where two data sets are joined together. Mark Algee-Hewitt has a lot to say on this issue of combining data sets. It’s interesting that typographic things like quotation marks are way more problematic for this issue than something more complex like “readability.” A lot also depends on my very simple model of modelling dialogue. It could just be that they get more standardized and thus appear more frequent, but I don’t think that’s entirely the case. Either way, this could definitely use improvement.

With these caveats in mind, there is a very strong correlation between the number of quotation marks used per decade and the readability of novels (r = 0.86). It suggests that dialogue is a big part of this shift towards more readable novels.

But what if we remove dialogue? Are novel sentences outside of dialogue getting simpler, too?

I don’t have an answer to that yet. And while it will be an important facet in order to nuance this issue, either way what we are seeing is how the novel, as represented in these two collections, follows a very straightforward trajectory towards simpler sentence and word lengths over the past two centuries. Much of that can be explained by greater reliance on dialogue, but that too is an important part of the readability story.

Why has this been the case? Commercialization, growth of the reading public…I don’t know. I think these are potential explanations but they require more data to show causality. What I can say is that based on the work I’m doing with Richard So on fan fiction is that fan-based writing — non-professional, yet high volume — does not exhibit significantly higher readability scores than “canon” does (i.e. the novels on which fanfic is based). In other words, in this one case expanding the user/reader base doesn’t correlate with simpler texts like you might expect.

It also looks as though readability has plateaued. Perhaps we’re seeing a cultural maximum being achieved in terms of the readability of novels. Then again, only time will tell.


* The other nice thing about readability is there is a great R package called koRpus to implement it. You can access the code through GitHub here.

Congratulations to Eva Portelance ARIA Intern for 2016

Eva Portelance presented her work this past week that was completed under an Arts Undergraduate Research Internship (ARIA). Her project focuses on the computational detection of narrative frames. It involves three steps that include a theoretical definition of a frame, writing code to detect narrative frames and comparing those to existing methods of text segmentation, and developing literary use cases for such an algorithm.

Literary theorists have long been interested in the question of narrative frames. Whether it involves changes in point of view, time, setting, or character-clusters, narrative frames are crucial ways through which information is communicated and structured in narrative form. Being able to detect these boundaries allows us to better understand the pacing and orientation of framing in literary narratives.

Portelance’s current approach is able to detect frames with about 67% accuracy across different kinds of novels. We measure these predictions against hand-annotated frames. We have since augmented these annotations with other reader annotations and are in the process of assessing how much agreement there is among readers about when and where a frame happens. While our performance leaves room for improvement (it is a very hard task), the next best segmenting tool captures frames with about 18% accuracy!

Portelance’s project also includes a second dimension which involves aggregating frames into larger “plotlines.” While we’re still debating the best way to do this — and whether “plotline” is the best way to understand them — the ability to cluster material by larger narrative threads gives us the ability to understand just how narratively diverse a given novel or work of fiction might be.

It offers we hope one more way of beginning to account for literary expression beyond purely semantic-level analysis.


I am pleased to announce the publication of a new piece I have written that appears today in CA: Journal of Cultural Analytics. The aim of the piece is to take a first look at the ways in which fictional language distinguishes itself from non-fiction using computational approaches. When authors set out to write an imaginary narrative as opposed to an ostensibly “true” one, what kinds of language do they use to signal such fictionality? One of the interesting findings that the piece offers is the way such signalling has remained remarkably constant for the past two centuries. Using a classification algorithm trained on nineteenth-century fiction, we can still predict contemporary fiction with above 91% accuracy (down from about 95% when tested against data from its own time period). These results hold across at least one other European language (German). In the future I hope to be able to test more languages to better understand just how constant such fictional discourse can be said to be.

In addition to seeing the constancy of these features across time and languages, the piece also highlights the specific nature of those features. As I argue in the piece, fictional language distinguishes itself most strongly by an attention to a phenomenological investment: an attention to a language of sensing and perceiving embodied individuals. It is this heightened focus on sense perception — the world’s feltness — that makes fiction stand out as a genre. When we look at the ways novels in particular distinguish themselves from other kinds of fictional texts, we see a very interesting case of a language of “doubt” and “prevarication” emerge, suggesting that the novel does not put us into the world in a fundamentally realist way, but inserts people into the world in a skeptical, testing, hypothetical relationship to the world around them.

This piece is part of a nascent project to use computation to better understand creative human practices. The aim is not to replace human judgments about literary meaning or quality, but to make more transparent the semantic profiles of different types of cultural practices. Computation can be a useful tool in showing us how different cultures use different kinds of writing to convey meaning to readers over time. It helps us transcend the impressionistic ideas we develop when we read a smaller sample of novels or stories and test the extent to which these beliefs hold across much broad collections of writing.

While the original text data could not be shared in this project, all derived data has been shared as part of the article. One of the advantages of using non-word-based feature sets as I do in the piece is that that derived data can then be freely shared.

Identity: NovelTM Annual Workshop 2016

tm-2016-poster2I am very pleased to announce the upcoming workshop for the NovelTM research group. This year’s theme is “Identity” and will be taking place at the Banff Research Centre in Banff, Alberta. For two days participants will meet and share new work that uses computational modelling to understand the various ways that novels construct identity — both through the fictional entities that populate novels and the actual readers whose identities are constructed through the large-scale configuration of different types of writing. This year’s special guest lecture will be offered by David Mimno.

Last year’s papers will be available next week at CA: Journal of Cultural Analytics.






This year’s participants will be presenting on the following topics:

  • Elizabeth Evans and Matthew Wilkens, “Race and Geography of London Writing, 1880-1940.”
  • Hoyt Long, Anthony Detwyler, and Yuancheng Zhu, “Repetition Madness in Modern Japanese and Chinese Fiction.”
  • Laura Mandell, “Gender and the Novel.”
  • Matt Erlin and Lynne Tatlock, “Reading Formations in Late Nineteenth-Century America: The Case of the Muncie Public Library.”
  • Mark Algee-Hewitt, J.D. Porter and Hannah Walser, “Representing Race and Ethnicity in American Fiction, 1789-1964.”
  • Richard Jean So, Hoyt Long, and Yuangcheng Zhu, “Modeling White-Black Literary Relations, 1880-2000.”
  • Susan Brown, Abigel Lemak, Colin Faulkner, and Rob Warren, “Cultural Formations: Translation and Intersectionality in a Linked Data Ontology.”
  • Ted Underwood and David Bamman, “The Gendering of Character in English-Language Fiction.”
  • Andrew Piper and Hardik Vala, “Characterization.”


CA Fall Preview: Food, Folklore and Lots of Novels

We have some exciting new material that will be appearing shortly in CA: Journal of Cultural Analytics, which I thought I would share here.

Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith will have a new piece out on the relationship between food menus and social class. As they argue in their piece, “The language used to discuss food offers a powerful way to examine associations of culture and identity and better understand how our cultural norms around food are shaped by the lens of class status and economics. We address this question by exploring Bourdieu’s notions of distinction as reflected in the language of US restaurant menus. Our study uses 5 million words of dish descriptions from 6500 menus drawn from 7 US cities, comparing the different linguistic strategies used in restaurants in four different price classes.”

Timothy Tangherlini, David Mimno, and Peter Broadwell have a new piece on classification and folklore that takes an in-depth look at the long history of folklore classification and what happens when a machine gets hold of these systems. In addition to providing a practical tool for identifying intersections between folkloric stories, they also show how confusion matrices can be used as a tool for understanding the ways in which stories can occupy multiple different generic categories at once. As they write in their piece challenging the prevailing CS view of ground truth labels, their approach “highlights the fundamentally different perspective that humanists have on classification as a tool. Our goal is not to create a system that mimics the decisions of a human annotator, but rather to better represent the porous boundaries between labels.”

The NovelTM research group will be producing a cluster on “The Novel and Genre,” with contributions from Ted Underwood, Matthew Wilkens, Matthew Jockers and Gabi Kirilloff, Matt Erlin, Andrew Piper, Mark Algee-Hewitt and his team Laura Eidem, Ryan Heuser, Anita Law, and Tanya Llewellyn. The aim of this cluster is to address how literary scholars have historically grouped novels, whether as subcategories like detective fiction, gothic fiction, white-male fiction, or as marketing devices used by publishers in the eighteenth century (the tale, romance, history), or as subject to different kinds of characterization, or even more fundamentally through the novel’s distinction from non-fiction. In each article, the cluster will explore how computational approaches can shed light onto the coherence and affinities between novels and between different kinds of groupings of novels. What does such a computational understanding of genre allow us to do and say about the history of the novel?

Is data good for creative writing? My interview with @DIYMFA Radio

Screen Shot 2016-09-01 at 10.47.25 AM

This past Spring I conducted an interview with Gabriela Pereira, host of @DIYMFA Radio. These are a fantastic series of podcasts for aspiring writers to learn more about the craft without paying enormous sums of money to attend an MFA program.

In the interview, we talk about how and whether data can be useful for creative writers. It’s clear a lot of new computational work is coming out that tries to predict whether a book will be successful or not. In my own work, I’ve been interested in trying to assess whether degrees like the MFA have a noticeable impact on the style of novels. In that piece, we found no noticeable differences, suggesting that the MFA does not set writers apart from those who forego the degree.

The idea we discuss in the interview is how these tools can help us be more creative as writers rather than just help publishers weed out their slush pile or come up with marketing budgets. As I discuss in the interview, data can make us more self-conscious as writers, teaching us about stylistic weaknesses or giving suggestions about how to expand our plots or make our characters more complex. At the same time, it can also help writers better understand the markets they are writing for. Publishing can often seem like a black box to those on the outside. Data can make that world more transparent.

You can locate the interview here.

How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading

Screen Shot 2016-05-11 at 2.40.17 PM

This new essay published in Post45 is about the relationship between prizewinning novels and their economic counterparts, bestsellers. It is about the ways in which social distinction is symbolically manifested within the contemporary novel and how we read social difference through language. Not only can we observe very strong stylistic differences between bestselling and prizewinning writing, but this process of cultural distinction appears to revolve most strongly around the question of time. The high cultural work of prizewinning novels appears most strongly defined by an attention to childhood, nature, and retrospection, while the economic work of bestsellers is defined by a diligent attention to the moment. As the forthcoming work of James English has shown, it is these temporal frameworks, or what Bakhtin might have called “chronotopes,” that emerge as some of the more meaningful ways to distinguish the work of cultural capital from that of economic capital.

The approach we use draws on the emerging field of textual analytics within the framework of Bourdieu’s theory of the literary field. Our interest lies in exploring a larger population of works, but also the ways in which groups of works help to mutually define one another through their differences. As Bourdieu writes, “Only at the level of the field of positions is it possible to grasp both the generic interests associated with the fact of taking part in the game and the specific interests attached to different positions.” We wanted to test the extent to which “bestsellers” and “prizewinners” cohere as categories and how this coherence may be based on meaningful, and meaningfully distinguishing, textual features.

Our aim is to help us see the values and ideological investments that accrue around different cultural categories  the act of “position taking,” in Bourdieu’s words  and the ways in which these horizons of expectation help maintain positions of power and social hierarchy. Our project is ultimately about asking how forms of social and symbolic distinction correspond and how that knowledge may be used to critique normative assumptions about what counts as significant within the literary field. The high cultural investment in retrospection that we see on display should not be taken as a default but can also be seen in a critical light — as privileging a more regressive, backward looking narrative mode.

For the full article, you can go here.