I made the a list the other day of all of the letters, names, and new terms I have had to learn to undertake the computational study of literature and culture. It was very long. It made me realize that when researchers speak of the “bilingualism” of interdisciplinary work, that we should take this idea very literally. I feel like I’m learning German all over again. It started as a novelty (Ich is so funny sounding!), then a frustration (I have no idea what you’re saying), and then magically you could do something with it (ich hätte gern ein Bier). And then you waited, and waited, and waited until you stopped noticing you were thinking in this other thing.
I listened to a beautiful podcast the other day by Chimamanda Ngozi Adichie on “the danger of the single story.” Her point was that when we only tell one kind of story about a person or a place we cheapen our understanding. She began with her experience as an African writer, one who all too often only hears one kind of story about a whole continent. As she remarked, it’s not that stereotypes aren’t true, it’s that they make one story the only story. Having more than one story gives us a richer understanding of the world.
How does this connect to the lab? Well, our aim is to use quantity to better understand literature and creativity. Adichie’s point is that when we focus on single things we get locked into single versions of them. We need quantity to help envision and imagine alternatives. Often when we use quantity we do so to reduce diversity into a single summary-like assessment. Fan fiction tends to look this way or the nineteenth-century novel behaves this way. I’m hoping that as we move forward we can begin to locate the diversity within quantity, the different kinds of stories that are available to us within the large quantities of stories that we have been telling ourselves for centuries. The goal is to make our generalizations more flexible, while still being based on something more than our personal beliefs or single pieces of evidence.
Katherine Bode has written an excellent new piece asking us to reflect more on the data we use for computational literary studies. Her argument is that many of the current data sets available, which rely on date of first publication as a criteria for selection, miss the more socially imbedded ways literary texts have circulated in the past.
Her thinking is deeply informed by the fields of bibliography, book history, and textual studies, most importantly by the work of D.F. McKenzie. McKenzie was the one who first showed how New Critical reading practices that used modern editions to build their arguments missed the historical specificity of the texts they were analyzing. Instead McKenzie wanted us to think of the sociology of texts, the ways in which texts (as books, manuscripts, flyers, illustrations) circulate in social settings.
Bode’s intervention is coming at a crucial time. For people working in the field, there is an increasing awareness of the ways different data sets represent objects of study in different ways. We’re well past the days of believing that having “many” books solves the problem of historical representation. Bode’s piece suggests two important directions for future study, which I would put under the heading of “historical representation” and what Matthew Lincoln has called the problem of “commensurability” (perhaps also called the problem of “sample bias”).
Bode’s first point is that representing literary history as a collection of works with dates of first publication ignores much of the way those works existed in the world. Maybe they were reprinted numerous times, or anthologized in parts, or began as serial productions, or were heavily revised in later editions. In 1910, people did not only read books published in 1910. This circulatory identity of literature privileges in many ways a more “reader-centred” point of view. The question is less what texts did writers write when, but what texts were available for readers and in what forms. I have a whole book about the impact that material format plays on literary production in the nineteenth century, so I am deeply sympathetic to this point of view.
Bode gives us concrete ideas about how to build data sets that are more attune to these historical frameworks. “Fiction in newspapers” is the particular answer she gives, but there are plenty more options — looking at specific library collections, household collections, books that were reviewed in the periodical press, or bestseller lists or prizewinning lists in the twentieth century. These all put us in touch with the historical filters that impact when, where and how books mattered to readers.
As historians of reading know, however, just having a representation of what circulated does not quite get at “reading.” We still do not know what readers did with these books, how many of them were actually read, if they were read multiply, quickly, fragmentarily, gifted, regifted, shared, burned, or used as doorstops, etc. Bode’s suggested approach is important and useful because of the way it allows us to observe “textual availability” or even “textual circulation” in a specific time and place. But it is equally important to see how it as only one possible solution to the problem of historical representation that is more centred on reception than production. It does indeed orient us more towards a reading environment, but it stops short at being able to understand readers or reading. For this, other kinds of data would be needed (similar to my colleagues Matthew Erlin and Lynn Tatlock who have a new study out on reader behavior in a lending library in Muncie, Indiana).
If Bode’s example is both useful and limited in equal measure, the example she gives of the problem she wants to solve — the data set of first editions — is far less illegitimate than she makes it out to be. The aim of structuring data in this way is to focus on writerly behaviour — what stylistic tendencies were available at what points in time. Dating novels in this way is no different from a critical edition that organizes its poems by composition date — in each case we are trying to recreate the process through which writing changed over time, putting a fixed marker in the ground for every poetic output. Like the eclectic editions of previous traditions of bibliographers, such collecting practices try not to recover the textual environment in all of its complexity, but one regulated by a sense of temporal change. Such an approach overplays, to be sure, the historical specificity of texts and dates — did it all happen in that year? And as we know poets’ works change, too, so what about variants? Trying to capture writer behaviour through first editions misses all of the pre-and post-work that precede and follow publication, the messiness of creativity that was once upon a time the object of text-genetic criticism (a field that is interestingly not discussed by Bode).
But the point is, all data sets have limitations. Each data set will represent a set of historical transactions differently, and each has limits on what it can and cannot tell us about the past. A text-genetic approach will tell us something about the developmental process of works with respect to authors and their intervening agents of editors, readers, and booksellers. A first edition approach will allow us to approximate new items that enter into the literary field while ignoring questions of penetration and circulation (how many were printed, how many were bought, how many were read). And Bode’s approach will allow us to better understand this circulatory world of what’s “out there” at a given time in a given medium.
This brings me to the concerns I have with how the issue is framed by Bode. I would have thought it went without saying that using Moretti today to justify progress in the field is no longer acceptable. Want to find an outrageous quote that informs no one’s work today? Use Moretti. But if you want to understand what people are actually doing, then you need to turn elsewhere. Bode makes the claim that early practitioners did not share their data. Fair enough. But the new journal, Cultural Analytics, does. It is over a year old (full disclosure, Bode is on the board). We have an entire dataverse established where authors deposit their code and their data for others to use and review. I personally just released tables of derived data on over 25,000 documents from the nineteenth-century to the present, where the features used were drawn from LIWC. Again, it’s not perfect, but it’s definitely a start.
Similarly, to suggest that current cultural analysts imagine that their datasets stand unproblematically for some larger “whole” or population is an unfair representation. Ted Underwood tests multiple different sets of data according to different bibliographic criteria of selection in his piece, “The Life Cycles of Genres.” I test no fewer than 17 different data sets to better understand the uniqueness of fiction-writing in the past two centuries in my piece, “Fictionality.” Peter M. Broadwell et al test a single data set of Danish folklore, but they do so against previous scholars’ classifications to better understand how machinic labels (mis)align with scholarly judgments. In none of these cases do the authors think their data represents something stable or definitive. They are all aware of the contingency of their data sets in their ability to capture some aspect of history. And they build that contingency into their methodology. Of course we could do more, we can always do more. But we first need to acknowledge the existence of the work that is actually happening.
All of this was clearly stated (well, clear for me) in my introduction to the journal of Cultural Analytics, where I write:
This then is one of the major contributions, and challenges, of cultural analytics. Rather than abandon generalization, the task before us is to reflect not simply on the acts of cultural representation, of Auerbach’s notion of “represented reality,” but on the representativeness of our own evidence, to reconstruct as contingently as possible the whole about which we are speaking. Instead of embodying the whole like the cultural critic – having read the entire archive or seen all the images – the cultural analyst focuses instead on the act of construction itself. The cultural analyst is self-conscious about being implicated in the knowledge that is being created.
Similarly, in a piece from 2015, I tried to provide a model of the process of literary modeling that showed just how circular and contingent the relationship between part (data) and whole (history) was (Fig. 1). And I have a new piece forthcoming in PMLA that lays out the contingencies of representation that flow through the entire process of data modeling.
Once we acknowledge the contingency of data, however, a major issue is raised, one that, as Matthew Lincoln has pointed out, is omitted from Bode’s piece: that of commensurability. How can we assess how these various contingent representations relate to one another? What methods can we use to account for the different kinds of answers that different kinds of data representations give to our questions? Bode’s piece stops here, ironically suggesting that one data set is enough, the one she is building from Australian newspapers. It may be the case that she has access to “all” newspapers ever printed in Australia (though I’d be surprised). But are they all equally accessible in terms of textual quality (OCR) and what about other types of representations of fiction, say, books? Small presses? Manuscript circulation?
The point is that there is nothing wrong with the data set Bode wants to use, but in its singularity — in its singular standing for history — it risks running into the very same problem that she accused Moretti of. We absolutely need methods that reflect on the “representativeness” of data and how different representations differ. That is our job as cultural historians. From from being discredited through this point of contingency, data offers critical tools to make these assessments rather than take information at face value, like the New Critics did with their paperback editions.
If there is one larger point to take away from all of this it is that this whole process of data modeling is very messy and complicated. We really need to get past the discourse of finger pointing and move towards one that supports each other and acknowledges work that is being done rather than citing straw men from the past. Building data sets takes a lot of time. There are tons of interpretive questions built into the process. Feeling unsatisfied about a data set is the default, especially in the humanities given our very poor infrastructure for data. And building new methods takes a lot of time. There are tons of interpretive questions built into the process. Feeling unsatisfied about methods is also a default, especially given how rudimentary all of this is.
But waiting for the perfect data set or the perfect model is a bit like waiting for the white whale. And thinking that one set solves all problems is equally problematic. People should try to build data and models to answer questions they care about and be supportive of the work other people are doing. Its not going to happen overnight and there is no one right answer. Far from needing better data sets, we need a better discourse about how to engage with each other’s work, because there is a lot of on-going effort out there.
For years, women have been aware that their books are less likely to get reviewed in the popular press and they are also less likely to serve as reviewers of books. Projects like VIDA and CWILA were started to combat this kind of exclusion. Over time they have managed to make some change happen in the industry. Although nowhere near parity, more women are being reviewed in major outlets than they were five or ten years ago.
Just Review was started in response to the belief that things were getting better. Just because you have more female authors being reviewed doesn’t mean those authors aren’t being pigeon-holed or stereotyped into writing about traditionally feminine topics. “Representation” is more than just numbers. It’s also about the topics, themes, and language that circulates around particular identities. In an initial study run out of our lab, we found there was a depressing association between certain kinds of words and the book author’s gender (even when controlling for the reviewer’s gender). Women were strongly associated with all the usual tropes of domesticity and sentimentality, while men were associated with public facing terms related to science, politics and competition. It seemed like we had made little progress from a largely Victorian framework.
To address this, we created an internship in “computational cultural advocacy” in my lab focused on “women in the public sphere.” We recruited five amazing students from a variety of different disciplines and basically said, “Go.”
The team of Just Review set about understanding the problem in greater detail, working together to identify their focus more clearly (including developing the project title Just Review) and reaching out to stakeholders to learn more about the process. By the end, they created a website, advocacy tools to help editors self-assess, recommendations for further reading, and a computational tool that identifies a book’s theme based on labels from Goodreads.com. If you know the author’s gender and the book’s ISBN (or even title), you can create a table that lists the themes of the books reviewed by a website. When we did this for over 10,000 book reviews in the New York Times, we found that there are strong thematic biases at work, even in an outlet prized for its gender equality.
Beyond the important findings that they have uncovered, the really salient point about this project is the way it has been student-led from the beginning. It shows that with mentoring and commitment young women can become cultural advocates. They can take their academic interests and apply them to existing problems in the world and effect change. There has been a meme for awhile in the digital humanities that it is a field alienating to women and feminists. I hope this project shows just how integral a role data and computation can play to promote ideals of gender equality.
We will be creating a second iteration in the fall that will focus on getting the word out and tracking review sites more closely with our new tools. Congratulations to this year’s team who have made a major step in putting this issue on the public’s radar.
Eleanor & Park is a beautiful young adult novel about two kids who fall in love after meeting, as so many kids do, on the school bus. It also contains a perfect challenge for the computational study of culture. Think of it as an alternative to the Turing test.
Here’s the back story. We’ve begun studying young adult novels in our lab and as part of that project we read Eleanor & Park. The novel is interesting because it is about not fitting in socially — who doesn’t identify with that as a teenager? — and also about the really intense attachments that kids develop when they first fall in love (ditto). But as it progresses (spoiler alert), Eleanor and Park are separated by social forces larger than themselves. It ends on a sweet, and sad, note of her sending him a postcard after a long stretch of silence between them. Park finally gets something in the mail after months of thinking about her. “Eleanor hadn’t written him a letter, it was a postcard. Just three words long.”
Yeah, try not getting all goose-bumpy on that one. But the point is that for you and me this is a really easy linguistic puzzle to solve. We know exactly what she wrote. That slight delay in recognition followed by the absolute certainty of what it said provides the ultimate emotional rush. This is literature in a nut-shell — saying something deeply emotionally resonant in a slightly indirect way. We humans just eat this stuff up.
But could a computer figure this out? Now I know as a computational challenges go this one is relatively easy to code for:
i<-c(“I love you”)
But would it be possible to infer these three words just from the context without any rules other than knowledge of the characters, the novel’s plot, and the limit of three contiguous words (“trigram” for short)? And if so, wouldn’t this be a better challenge than the Turing test? Rather than guess whether a piece of software is “human,” isn’t a more realistic first step the ability to infer emotional expressions that are latent within a text?
While this challenge might be relatively easy to solve I think the larger point is an important one for those of us who want to study culture computationally. One of the biggest challenges is how to infer latent meaning in highly symbolic documents. Websites and scientific articles are often very explicit in terms of what they are “about”. But a poem? Or a novel? Not so much. How can we develop methods that allow us to capture these deeper truths about our creations, the way some words point to other words? If we could, wouldn’t this be one of the criteria for making progress on the way towards creating emotionally intelligent beings?
I am pleased to add this year’s syllabus for my graduate course, “LLCU 614, Cultural Analytics: The Computational Study of Culture.” The aim of the course is twofold: 1) to introduce students in the humanities to the computational and quantitative methods for studying culture in order to move beyond the use of anecdotal evidence and 2) to introduce students in computer science to the importance of theory for studying culture, i.e. to avoid a naive approach to data analysis. As I mention in my opening class, this course is about valuing different ways of looking at cultural questions and also conceding major methodological flaws in our current disciplinary orientations. Everyone in the room has something valuable to add given their disciplinary training and everyone also brings essential blind spots to the study of culture, including myself. This course is about making us all more sophisticated cultural analysts.
I am pleased to announce the publication of a new piece I have written that appears today in CA: Journal of Cultural Analytics. The aim of the piece is to take a first look at the ways in which fictional language distinguishes itself from non-fiction using computational approaches. When authors set out to write an imaginary narrative as opposed to an ostensibly “true” one, what kinds of language do they use to signal such fictionality? One of the interesting findings that the piece offers is the way such signalling has remained remarkably constant for the past two centuries. Using a classification algorithm trained on nineteenth-century fiction, we can still predict contemporary fiction with above 91% accuracy (down from about 95% when tested against data from its own time period). These results hold across at least one other European language (German). In the future I hope to be able to test more languages to better understand just how constant such fictional discourse can be said to be.
In addition to seeing the constancy of these features across time and languages, the piece also highlights the specific nature of those features. As I argue in the piece, fictional language distinguishes itself most strongly by an attention to a phenomenological investment: an attention to a language of sensing and perceiving embodied individuals. It is this heightened focus on sense perception — the world’s feltness — that makes fiction stand out as a genre. When we look at the ways novels in particular distinguish themselves from other kinds of fictional texts, we see a very interesting case of a language of “doubt” and “prevarication” emerge, suggesting that the novel does not put us into the world in a fundamentally realist way, but inserts people into the world in a skeptical, testing, hypothetical relationship to the world around them.
This piece is part of a nascent project to use computation to better understand creative human practices. The aim is not to replace human judgments about literary meaning or quality, but to make more transparent the semantic profiles of different types of cultural practices. Computation can be a useful tool in showing us how different cultures use different kinds of writing to convey meaning to readers over time. It helps us transcend the impressionistic ideas we develop when we read a smaller sample of novels or stories and test the extent to which these beliefs hold across much broad collections of writing.
While the original text data could not be shared in this project, all derived data has been shared as part of the article. One of the advantages of using non-word-based feature sets as I do in the piece is that that derived data can then be freely shared.
At some point, theory was declared over. Which was a polite way of saying we can get back to doing what we’ve always done. Which, it turns out, was theory.
The humanities represent an amazing collection of individuals who have over the ages developed an extraordinary array of theories about people, the past, creativity, and social life. There is a richness of theoretical models about how human life works — at different points in time and in different places. Traditionally we have not called this theory, but evidence. Because we only knew of one kind of evidence for the questions we were asking. “Theory” in its heyday was understood as an aberration because it was not evidence-based — it was about theory, rather than “the text.” The death of theory was a way of saying let’s get back to doing evidence-based research.
What the rise of data and the computational modelling of language and social experience have allowed us to see is the way our evidence was really more like theory. It was, until recently, only ever tested against exemplary cases. This is what I mean by saying that we have been doing theory all along. Data changes that, not in the sense of replacement, as many have argued, but in the sense of complementarity, or even extension. Data expands the applicability of our theoretical models. It combines the evidence richness of quantitative models with the conceptual richness of our theoretical models. It makes research better.
This is what I see as the great research challenge in the coming years. Not distant reading or close reading, but how they enter into conversation with each other and challenge each other’s point of view. Data can be a way of testing theoretical models, but theoretical models can be used to challenge under-sophisticated computational models. Anyone who has ever read papers in computer or social science will be familiar with the under-theorization that often (but definitely not always) accompanies that work. Anyone who has ever read humanities papers is deeply aware of the limited nature of the evidence to support overly broad claims. There are huge gains to be made in facilitating this two-way conversation. This is the era of rapprochement.
Computation, meet theory. Theory, computation.