Why are non-data driven representations of data-driven research in the humanities so bad?

One of the more frustrating aspects of working in data-driven research today is the representation of such research by people who do not use data. Why? Because it is not subject to the same rules of evidence. If you don’t like data, it turns out you can say whatever you want about people who do use data.

Take for example this sentence, from a recent special issue in Genre:

At the heart of much data-driven literary criticism lies the hope that the voice of data, speaking through its visualizations, can break the hermeneutic circle.

Where is the evidence for this claim? If you’re wondering who has been cited so far in the piece you can guess it’s Moretti. That’s it. Does it matter that others have made the exact opposite claim? For example, in this piece:

In particular, I want us to see the necessary integration of qualitative and quantitative reasoning, which, as I will try to show, has a fundamentally circular and therefore hermeneutic nature.

But does a single piece of counter-evidence really matter? Wouldn’t the responsible thing be to try to account for some summary judgment of all “data-driven literary criticism” and its relationship to interpretive practices?

To be concerned about the hegemony of data and data science today is absolutely reasonable and warranted. Data-driven research has a powerful multiplier effect in its ability to be covered by the press and circulate as social certainty. Projects like “Calling Bullshit” by Carl Bergstrom and Javin West are all the more urgent for this reason.

But there is another dimension of calling bullshit that we shouldn’t overlook. It’s when people invent statements to confirm their prior belief systems. To suggest that data is omnipotent in its ability to shape public opinion misses one of the great tragedies of facticity of our time: climate collapse (a phrase I prefer to climate “change” which is too wishy washy a word for where we’re headed — “change is good!”).

In other words, calling bullshit is a multidimensional problem. It’s not just about data certainty. Its also about certainty in the absence of data. Its about rhetorical tactics that are used to represent phenomena without adequate evidence, something that happens all too often in the humanities these days when it comes to understanding things as disparate as the novel or our own discipline.

As authors, journal editors, peer-reviewers, researchers and teachers we need to wake up to this problem and stop allowing it to pass with a mild nod of the head. We need to start asking that hard question: Where’s your evidence for that?

Literary Text Mining Syllabus


It’s that time of year and so I’m posting my latest syllabus of my data and literature class. I have found over the years that every time I create a new class I always start with too much and gradually winnow as the years go by (until there is nothing left and I teach a new class…). This year is no different. Here are some things I’ve learned:

  • there are very few good readings on text mining for undergraduates. did you enjoy reading full-blown research articles when you were in university? every year I take more and more off because they are just too confusing.
  • apparently undergrads in the Arts hate programming. who knew?! I lose 50% of my class with the first assignment.
  • this is particularly sad because I think in the programming lies all the knowledge. Run type token ratio and see that there are just 5% unique words in Jane Austen’s Pride and Prejudice and you have gained more of an insight into the novel than all of Jameson’s work combined could every teach you.
  • teaching statistics on top of literary theory and computer programming really is the proverbial straw that breaks the camel’s back. I do it anyway.
  • if you’re going to do this anyway, then you need to go very slowly. You can review one study for a few weeks to understand the whole process from modelling to data selection to significance testing. Maybe one study for a whole semester.
  • ultimately I see myself waging guerrilla warfare — hopefully my students will circulate through humanities departments and constantly ask annoying questions in their other classes like, “How big is your sample?” or “Can you be more explicit about how you generalize from your one example?” Because let’s be clear, at 99-1 we’re still the underdog…

Looking forward to another awesome semester of empowering students to be critical readers and creative analysts.



The New Young Adult Fiction. More Human, More Me.

What difference does an editor make?

This was the question posed by a recent profile of the highly successful editor of young adult fiction, Julie Strauss-Gabel, who manages the imprint Dutton Children’s Books. Her titles have consistently performed well over recent years and it was a timely reminder of the impact that a good editor can have on writers’ careers. In an age of dwindling resources in publishing – yet another sign of the disappearing middle – a good editor appears to make a big difference.

Here at .txtLAB, we were curious to see whether Strauss-Gabel’s books were recognizably different from other successful young adult fiction. Was the success of her books due more to extrinsic factors, like reputation or marketing, or did her list have something unique about it in terms of content? Understanding what makes these books stand out is not only a way for other editors to keep up with the competition. It could also be useful for aspiring writers who don’t have access to high-powered agents or editors (or pre-established reputations like John Grisham).


So what difference does a Dutton imprint make? To find out, we compared the 22 most recent Dutton books to a collection of 200 titles drawn from the Goodreads Best of 2014 young adult fiction list and Amazon’s list of best-selling YA fiction. We used the popular data mining tool, LIWC, developed by James W. Pennebaker, which compares linguistic features across 85 different dimensions, including grammatical features like pronouns and punctuation and more complex phenomena like social, cognitive, and perceptual processes.

The first and most salient point that we found is that there is a statistically significant difference between Ms. Strauss-Gabel’s books and other popular young adult fiction. This is not something that should be taken for granted. But her editorial sensibility has indeed produced a unique signature among her books.

When we looked more closely at which features made her books stand out, we found Dutton books were defined by stylistic aspects like the use of first person pronouns (I, We), a vocabulary of inclusivity (and, with, plus), and a greater use of conjunctions and commas, suggesting more complex sentence structure (sentences were also longer on average). An emphasis on time-words also emerged, suggesting a greater degree of narrative sophistication (or at least diversity). Finally, the Dutton books tended to focus more explicitly on “humans” (adults, boys, girls, etc.), suggesting an investment in the description of people rather than emotions. Perhaps surprisingly, the only significant theme to emerge was “money.”

Recent discussions of Ms. Strauss-Gabel’s list have indeed emphasized the sophistication of her taste (the Times spoke of her “high-quality”), and we can see this reflected in the way her books rely on longer and more complex sentences, an important point if you’re trying to guide your own young adult towards more high-brow material. More interesting, and so far unnoticed, is the way her books are more focused on individuals as well as social belonging. The features that were most indicative of non-Dutton books, for example, were body-related words (heart, head, hands), words related to feelings (caress, feel, grab), and a range of negative emotions (anger and anxiety being the highest). These struck me as the more stereotypically teenage: an attention to physique, physical sensation, and emotional negativity capture the adolescent imaginary rather well. It is decidedly interesting that the Dutton books don’t fit this mold, indicating a potential new trend towards more humanistic YA fiction, away from the dystopian worlds of hunger games and fantasy conflict.

These trends became even more pronounced when we subsetted our Amazon list by the top- and bottom-selling books. Top sellers emphasized death far more then their bottom counterparts, as well as the pronouns “we” and “they,” suggesting a highly binary, and collective, moral universe. They focused more on space and the present tense, while the lesser-selling books focused on the past tense, positive emotions, friends, religion, sex and family. If you wanted to make it to the top of the heap in the last few years in young adult fiction, then your best bet was to go negative, stay in the present, and avoid families and positive emotions. Once again, the Dutton books seem to stand out in how different they are.

One question we had was whether Ms. Strauss-Gabel’s list was more diverse than other samples of young adult fiction. As the New York Times noted, “Ms. Strauss-Gabel’s books are strikingly diverse.” According to three different measures of similarity, we found that the Dutton books tended to be significantly less similar to each other than comparable sample sizes of young adult fiction from our other collections. There appears to be more linguistic range to the Dutton books than in the typical subset of the genre as a whole, one more indication of these books’ sophistication.

This last point made us curious whether the sophistication of Dutton Books indicated that they were in fact more “adult” than “young adult.” As the Times pointed out, more and more adults are reading YA fiction these days. Is this a dumming down of readers or a scaling-up of the genre? When we compared our young adult fiction collection with a collection of bestselling fiction, it turned out that the non-Dutton books tended to be slightly more similar with their adult counterparts, though the significance of this was minimal. What this suggests is that the Dutton books in particular are not mimicking bestselling adult books in any overtly recognizable way. Indeed, young adult fiction in general tends to look more like itself than adult bestsellers. While many people are critical of the rising popularity of young adult books, I was pleased to see that genre differences continued to exist for readers of different ages. Young adult doesn’t seem to be blending into adult (or vice versa), at least for now.

These were just some of things that we were able to learn in our lab in a few days studying these books. The insights are of course more coarse than a highly-trained human reader might be able to offer. But they are also more generalizable and less dependent on individual judgment. Every writer knows someone whom she can ask for an opinion of her manuscript. But not every writer has access to an understanding of a genre as a whole or trends in readers’ taste. This is where computers can be useful and, I feel, democratizing. They can make something as complex as the publishing industry – which can look like a secret society from the outside – seem more transparent.