Measuring Unreading

In a new piece out in the Goethe Yearbook, I and my co-author, student James Manalad, use text re-use algorithms to better understand citational practices within scholarly publications. In particular we look at how Goethe’s collected works are directly quoted in 68 volumes of the Goethe Jahrbuch over the course of the twentieth century.

How scholars cite major authors or works can give us insights into the methodological and ideological frameworks of these interpretive communities. In particular, we’re interested not just in what is cited, but also what the absence of citation can tell us about modes of scholarly attention. As we write in the piece:

Our aim in drawing attention to the unread spaces of Goethe’s corpus is to initiate a conversation about the meaning of concentration and repetition in the practice of critical reading. 

In tracing the ways Goethe’s works are quoted, we can begin to observe the semantic and thematic underpinnings of cultural capital, of what “version” of Goethe scholars reproduce. For example, our models suggest that there is a strong gendered aspect to the history of Goethe quotation within the Jahrbuch, with a reliance on valuing Goethe’s language of universality and patriarchy over and above more domestic concerns that also find expression in his poetry. The unread spaces of his corpus — unread in the sense of unreproduced (a distinction we discuss in the piece) — provide a form of commentary in relief of such disciplinary investments.

Text re-use algorithms offer researchers in computational literary studies a valuable tool to understand disciplinary practices. In our piece we use a tool developed by Jonathan Reeve (who also has an informative talk on this as well), which allows you to map the passages of a source corpus (in our case Goethe’s works) to a target corpus (in this case the Goethe Jahrbuch).

The interesting aspect about these algorithms is the way they handle ambiguity. Here is an example of a “match” in our piece:

Notice how optical character recognition (OCR) errors like “binn” and “fo” are still able to be matched due to the fuzzy matching technique, just as the differing punctuation—the question marks in the original versus the excla- mation points in the Goethe Jahrbuch—are similarly resolved. But it is also important to point out that this tool does not capture the entirety of the quoted line. The quotation extends further backward and forward in ways that our algorithm does not capture. In other words, we cannot use it to measure the exact number of quoted words. However, it does an excellent job of identifying when a sequence of words from Goethe’s corpus is being quoted and attributing that quotation to one or more source texts.

One of my main goals with this piece is to offer a case study of how we can use text-reuse algorithms to better understand disciplinary behaviour. I think there is a ton of potential there that I hope researchers will take up. The piece is part of a broader special section on “the great unread” and digital methods and I encourage you to read all of the contributions.