Where’s the data? Notes from an international forum on limited use text mining

I’m attending a two-day workshop on issues related to data access for text and data mining (TDM). We are 25 participants from different areas, including researchers who do TDM, librarians who oversee digital content, and content providers who package and sell data to academic libraries (principally large publishers), and finally, lawyers.

I am excited to be here because these issues strike me as both complicated and intractable. I have for several years tried to gain greater access to data in our university library with no success. I have also worked extensively with limited use data and wished I could be more open with the data. Whenever I ask how the situation can improve, a finger pointing circle begins where everyone points at someone else and nothing changes.

The overarching question that we are all implicitly asking ourselves: Will anything change after our meeting?

Here we go.

Continue reading “Where’s the data? Notes from an international forum on limited use text mining”

An Open Letter to the MLA

Dear Prof. Taylor,

I am writing to you as a member of the MLA who has concerns about the practices and policies relating to the society’s data and its impact on research. This is an issue that effects many scholarly organizations. For this reason I have chosen to write an open letter.

The MLA has emerged as an important champion of the principles of open access scholarship. The creation of the MLA Commons represents a recent positive example of such pro-active work.

It is all the more troubling to realize that such open access does not apply to the MLA’s own data. I was recently served with a take-down notice by my university library for publicly sharing data and code used in a recent publication with the PMLA. The data was drawn from the MLA database and represented two years worth of records, one collection from 2015 and one from 1970. When I contacted the MLA to ask for the data outside of such corporate mediation I was refused. Here we have a case where data from the MLA was used to support an article published by the flagship publication of the MLA that is now being repressed from public view.

The MLA database is an essential source of knowledge about the practices within our field. As we have begun to learn, metadata alone can reveal a great deal of information about the behaviour of a community. In my own work I am interested in studying the concentration of attention surrounding literary authors, especially with respect to gender and racial diversity and how such concentration has changed (or not) over time.

Below I attach a screenshot of the licensing agreement that my university has signed with ProQuest, who distributes the data for the MLA to our university library. As you can see, principles i-k all violate essential norms of research. Not being able to mine a database (i) means that it has been walled off from standard research practices. Not being able to communicate materials received from the service (j) means that the evidentiary bases of claims using the data cannot be publicly shared or externally validated. And not being able to download parts of the service in a systematic manner (k) means that we cannot study the contents of the database in any responsible fashion. These are all principles that favour a mode of interaction with information that is both out of date and prohibitive in terms of the accepted norms of academic research today.

The MLA, and it should be added numerous other scholarly organizations, have contracted out the organization and access to their data to third parties, most of whom are private, for-profit initiatives. These parties’ business models are in direct conflict with the scholarly mission of the society, indeed any academic society. While this may have been an arrangement that was initially convenient, not to mention profitable, it is no longer an acceptable way of curating data within an academic context. Libraries need to stop signing license agreements that limit access to data in the library. And scholarly organizations need to stop signing license agreements that limit access and the public circulation of their data. Anything short represents a serious abrogation of scholarly responsibility.

I would be happy to work with you to craft data policies that are more in line with the values and norms of scholarship. The MLA has an opportunity once again to take the lead in this important matter.


Andrew Piper

Data, data, data. Why Katherine Bode’s new piece is so important and why it gets so much wrong about the field

Katherine Bode has written an excellent new piece asking us to reflect more on the data we use for computational literary studies. Her argument is that many of the current data sets available, which rely on date of first publication as a criteria for selection, miss the more socially imbedded ways literary texts have circulated in the past.

Her thinking is deeply informed by the fields of bibliography, book history, and textual studies, most importantly by the work of D.F. McKenzie. McKenzie was the one who first showed how New Critical reading practices that used modern editions to build their arguments missed the historical specificity of the texts they were analyzing. Instead McKenzie wanted us to think of the sociology of texts, the ways in which texts (as books, manuscripts, flyers, illustrations) circulate in social settings.

Bode’s intervention is coming at a crucial time. For people working in the field, there is an increasing awareness of the ways different data sets represent objects of study in different ways. We’re well past the days of believing that having “many” books solves the problem of historical representation. Bode’s piece suggests two important directions for future study, which I would put under the heading of “historical representation” and what Matthew Lincoln has called the problem of “commensurability” (perhaps also called the problem of “sample bias”).

Bode’s first point is that representing literary history as a collection of works with dates of first publication ignores much of the way those works existed in the world. Maybe they were reprinted numerous times, or anthologized in parts, or began as serial productions, or were heavily revised in later editions. In 1910, people did not only read books published in 1910. This circulatory identity of literature privileges in many ways a more “reader-centred” point of view. The question is less what texts did writers write when, but what texts were available for readers and in what forms. I have a whole book about the impact that material format plays on literary production in the nineteenth century, so I am deeply sympathetic to this point of view.

Bode gives us concrete ideas about how to build data sets that are more attune to these historical frameworks. “Fiction in newspapers” is the particular answer she gives, but there are plenty more options — looking at specific library collections, household collections, books that were reviewed in the periodical press, or bestseller lists or prizewinning lists in the twentieth century. These all put us in touch with the historical filters that impact when, where and how books mattered to readers.

As historians of reading know, however, just having a representation of what circulated does not quite get at “reading.” We still do not know what readers did with these books, how many of them were actually read, if they were read multiply, quickly, fragmentarily, gifted, regifted, shared, burned, or used as doorstops, etc. Bode’s suggested approach is important and useful because of the way it allows us to observe “textual availability” or even “textual circulation” in a specific time and place. But it is equally important to see how it as only one possible solution to the problem of historical representation that is more centred on reception than production. It does indeed orient us more towards a reading environment, but it stops short at being able to understand readers or reading. For this, other kinds of data would be needed (similar to my colleagues Matthew Erlin and Lynn Tatlock who have a new study out on reader behavior in a lending library in Muncie, Indiana).

If Bode’s example is both useful and limited in equal measure, the example she gives of the problem she wants to solve — the data set of first editions — is far less illegitimate than she makes it out to be. The aim of structuring data in this way is to focus on writerly behaviour — what stylistic tendencies were available at what points in time. Dating novels in this way is no different from a critical edition that organizes its poems by composition date — in each case we are trying to recreate the process through which writing changed over time, putting a fixed marker in the ground for every poetic output. Like the eclectic editions of previous traditions of bibliographers, such collecting practices try not to recover the textual environment in all of its complexity, but one regulated by a sense of temporal change. Such an approach overplays, to be sure, the historical specificity of texts and dates — did it all happen in that year? And as we know poets’ works change, too, so what about variants? Trying to capture writer behaviour through first editions misses all of the pre-and post-work that precede and follow publication, the messiness of creativity that was once upon a time the object of text-genetic criticism (a field that is interestingly not discussed by Bode).

But the point is, all data sets have limitations. Each data set will represent a set of historical transactions differently, and each has limits on what it can and cannot tell us about the past. A text-genetic approach will tell us something about the developmental process of works with respect to authors and their intervening agents of editors, readers, and booksellers. A first edition approach will allow us to approximate new items that enter into the literary field while ignoring questions of penetration and circulation (how many were printed, how many were bought, how many were read). And Bode’s approach will allow us to better understand this circulatory world of what’s “out there” at a given time in a given medium.

This brings me to the concerns I have with how the issue is framed by Bode. I would have thought it went without saying that using Moretti today to justify progress in the field is no longer acceptable. Want to find an outrageous quote that informs no one’s work today? Use Moretti. But if you want to understand what people are actually doing, then you need to turn elsewhere. Bode makes the claim that early practitioners did not share their data. Fair enough. But the new journal, Cultural Analytics, does. It is over a year old (full disclosure, Bode is on the board). We have an entire dataverse established where authors deposit their code and their data for others to use and review. I personally just released tables of derived data on over 25,000 documents from the nineteenth-century to the present, where the features used were drawn from LIWC. Again, it’s not perfect, but it’s definitely a start.

Similarly, to suggest that current cultural analysts imagine that their datasets stand unproblematically for some larger “whole” or population is an unfair representation. Ted Underwood tests multiple different sets of data according to different bibliographic criteria of selection in his piece, “The Life Cycles of Genres.” I test no fewer than 17 different data sets to better understand the uniqueness of fiction-writing in the past two centuries in my piece, “Fictionality.” Peter M. Broadwell et al test a single data set of Danish folklore, but they do so against previous scholars’ classifications to better understand how machinic labels (mis)align with scholarly judgments. In none of these cases do the authors think their data represents something stable or definitive. They are all aware of the contingency of their data sets in their ability to capture some aspect of history. And they build that contingency into their methodology. Of course we could do more, we can always do more. But we first need to acknowledge the existence of the work that is actually happening.

All of this was clearly stated (well, clear for me) in my introduction to the journal of Cultural Analytics, where I write:

This then is one of the major contributions, and challenges, of cultural analytics. Rather than abandon generalization, the task before us is to reflect not simply on the acts of cultural representation, of Auerbach’s notion of “represented reality,” but on the representativeness of our own evidence, to reconstruct as contingently as possible the whole about which we are speaking. Instead of embodying the whole like the cultural critic – having read the entire archive or seen all the images – the cultural analyst focuses instead on the act of construction itself. The cultural analyst is self-conscious about being implicated in the knowledge that is being created.

Similarly, in a piece from 2015, I tried to provide a model of the process of literary modeling that showed just how circular and contingent the relationship between part (data) and whole (history) was (Fig. 1). And I have a new piece forthcoming in PMLA that lays out the contingencies of representation that flow through the entire process of data modeling.

Once we acknowledge the contingency of data, however, a major issue is raised, one that, as Matthew Lincoln has pointed out, is omitted from Bode’s piece: that of commensurability. How can we assess how these various contingent representations relate to one another? What methods can we use to account for the different kinds of answers that different kinds of data representations give to our questions? Bode’s piece stops here, ironically suggesting that one data set is enough, the one she is building from Australian newspapers. It may be the case that she has access to “all” newspapers ever printed in Australia (though I’d be surprised). But are they all equally accessible in terms of textual quality (OCR) and what about other types of representations of fiction, say, books? Small presses? Manuscript circulation?

The point is that there is nothing wrong with the data set Bode wants to use, but in its singularity — in its singular standing for history — it risks running into the very same problem that she accused Moretti of. We absolutely need methods that reflect on the “representativeness” of data and how different representations differ. That is our job as cultural historians. From from being discredited through this point of contingency, data offers critical tools to make these assessments rather than take information at face value, like the New Critics did with their paperback editions.

If there is one larger point to take away from all of this it is that this whole process of data modeling is very messy and complicated. We really need to get past the discourse of finger pointing and move towards one that supports each other and acknowledges work that is being done rather than citing straw men from the past. Building data sets takes a lot of time. There are tons of interpretive questions built into the process. Feeling unsatisfied about a data set is the default, especially in the humanities given our very poor infrastructure for data. And building new methods takes a lot of time. There are tons of interpretive questions built into the process. Feeling unsatisfied about methods is also a default, especially given how rudimentary all of this is.

But waiting for the perfect data set or the perfect model is a bit like waiting for the white whale. And thinking that one set solves all problems is equally problematic. People should try to build data and models to answer questions they care about and be supportive of the work other people are doing. Its not going to happen overnight and there is no one right answer. Far from needing better data sets, we need a better discourse about how to engage with each other’s work, because there is a lot of on-going effort out there.



Why your dissertation needs data


Dear Future Graduate Students,

It’s that time of year to start thinking about grad school. Recruiting is not easy for me. My general sentiment around graduate training is, let them decide. Advertising or persuasion is for places like Trump University not scholarship. But I think we are at a bit of a crossroads in our field and I am concerned that too many people aren’t making good choices, potentially because of what they’re hearing from their faculty. After all, the ratio of people doing computational humanities to those who are not is tiny. The messaging is bound to be skewed. It seems important therefore to go out on a limb (yes it feels like a limb) and try to articulate why you should orient your work towards a more data-driven approach. So here goes.

Why does your dissertation need data? Because it opens up so many more questions. When your only method is to read as much as possible, first, you’ll always come up short. You can never read enough and you’ll always know it. This is one of the reasons we like to parade our erudition. It’s to cover over our knowledge of what we know we don’t know. Second, you have no principled way of making judgments about all that you have read as a whole. You have no way to contextualize those insights, to put it in conversation with the things you haven’t read. To put it another way, you have no way to generalize about what you are finding. If you want to talk about the politics of modernism or the spectrality of televisual personalities, watching or reading alone isn’t going to get you there in a convincing way. Data isn’t the be all to end all. But it does solve problems. It answers questions that you will not otherwise be able to pose.

There’s another reason too, one that I think is almost more important because it isn’t about a particular subject area. Rather, it’s about your position in the field more generally. Every day thousands of dissertations are uploaded to ProQuest. And every day we know a little bit less about our respective fields. The more research there is, the harder it is to have a sense of the field as a whole — and where your place is within it.

I remember, very distinctly, a moment I had one day wandering through the stacks as a graduate student at Columbia University, the home of Melvil Dewey. I remember thinking to myself, holy s%*t, look at all these books. What is the point of me writing one more? The aggregate value of one more book decreases every day. But the ability to use data to understand that whole to which you yourself are a contributor: that is invaluable. And you can’t get there by reading alone. Only data can do this, for better and for worse.

I know people will tell you it’s a bad idea. Or that it’s a fad. It’s not. It’s an essential part of the research process. You should be thinking about programs that will help you integrate it into your research, be able to guide you towards using it effectively and thoughtfully, and above all champion methodological plurality rather than dogma. If you’re hearing something else then you aren’t being given very good advice.