Data Sets


This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres.


We present a new dataset with detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags on an additional 400,000 non-English volumes.


We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000.

Can We Be Wrong?

All of the data and code that accompanies the book, Can We Be Wrong? The Problem of Textual Evidence in a Time of Data (Cambridge 2020). Specifically, this dataset contains ~3k manually annotated sentences for “generalizations” within the fields of literary studies, history, and sociology and ~15k automatically annotated sentences from 9 academic journals from a single recent year (2016).


All of the data and code that accompanies the book, Enumerations: Data and Literary Study (Chicago 2018).

Please cite: Andrew Piper, Enumerations: Data and Literary Study (Chicago 2018).


A collection of 450 novels in German, French, and English that span 1770 to 1930. Each language is represented by 150 novels with a roughly even distribution across time, length, and gender. The data can be downloaded here. And the metadata is here. Please cite:

Piper, Andrew (2016): txtlab Multilingual Novels. figshare.

Academic Publishing I: Prestige Data

Metadata on institutional affiliation for 5,000+ academic articles published in four prestige journals within the humanities (PMLA, Critical Inquiry, New Literary History, Representations). Included in the metadata are the author’s institutional affiliation at time of publication, the author’s PhD institution, and the author’s gender. 3,500 authors are represented from close to 350 PhD-granting institutions and 725 authorial institutions. We also include supplementary data on gender and publication on another ~3,800 articles published since 2010 in 16 further journals. The data is available here.

Please cite: Chad Wellmon and Andrew Piper, “Publication, Power and Patronage: On Inequality and Academic Publishing,” Critical Inquiry (July 2017):

Academic Publishing II:  MLA Author Data

This data represents 1,937 and 6,252 and bibliographic records in the field of literary studies of articles published in 1970 and 2015 respectively. The data was downloaded from the MLA database using the ProQuest interface in January 2017. The full data cannot be accessed due to licensing policies of the MLA.

Please cite: Andrew Piper, “Think Small: On Literary Modeling.” PMLA 132.3 (2017): 651-658.

LIWC for Literature

LIWC Tables for 25,000+ documents, consisting of both fiction and non-fiction texts drawn from different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). The data is available here.

Please cite: Andrew Piper, “Fictionality,” Cultural Analytics (December 2016): DOI: 10.22148/16.011

Race and Film

This data set contains character dialogue from 780 Hollywood movies produced between 1970 and 2014. Characters have been labeled by their racial and ethnic identity using IMDB. The data set is available here.

Please cite: Vicky Svaikovsky, Anne Meisner, Eve Kraicer, and Matthew Sims, “Racial Lines: Race Ethnicity and Dialogue in 780 Hollywood Films, 1970-2014.”

20C Poetry

A table of derived word counts from a collection of 75,297 English-language poems. A table with the top 20K words is located here and three tables of POS, Hypernyms, and word counts is located here.