This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres.
We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000.
Can We Be Wrong?
All of the data and code that accompanies the book, Can We Be Wrong? The Problem of Textual Evidence in a Time of Data (Cambridge 2020). Specifically, this dataset contains ~3k manually annotated sentences for “generalizations” within the fields of literary studies, history, and sociology and ~15k automatically annotated sentences from 9 academic journals from a single recent year (2016).
All of the data and code that accompanies the book, Enumerations: Data and Literary Study (Chicago 2018).
Please cite: Andrew Piper, Enumerations: Data and Literary Study (Chicago 2018).
A collection of 450 novels in German, French, and English that span 1770 to 1930. Each language is represented by 150 novels with a roughly even distribution across time, length, and gender. The data can be downloaded here. And the metadata is here. Please cite:
Piper, Andrew (2016): txtlab Multilingual Novels. figshare.
A collection of 1,211 novels published between 2000-2015. They are categorized by the following 6 groups: Bestsellers (BS), Prizewinners (PW), Novels reviewed in the New York Times (NYT), Mysteries (MYST), Romances (ROM), and Science Fiction (SCIFI). Metadata is available here.
Please cite: Andrew Piper and Eva Portelance, “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading,” Post45 (2016).
Academic Publishing I: Prestige Data
Metadata on institutional affiliation for 5,000+ academic articles published in four prestige journals within the humanities (PMLA, Critical Inquiry, New Literary History, Representations). Included in the metadata are the author’s institutional affiliation at time of publication, the author’s PhD institution, and the author’s gender. 3,500 authors are represented from close to 350 PhD-granting institutions and 725 authorial institutions. We also include supplementary data on gender and publication on another ~3,800 articles published since 2010 in 16 further journals. The data is available here.
Please cite: Chad Wellmon and Andrew Piper, “Publication, Power and Patronage: On Inequality and Academic Publishing,” Critical Inquiry (July 2017): http://bit.ly/2B93Jpu.
Academic Publishing II: MLA Author Data
This data represents 1,937 and 6,252 and bibliographic records in the field of literary studies of articles published in 1970 and 2015 respectively. The data was downloaded from the MLA database using the ProQuest interface in January 2017. The full data cannot be accessed due to licensing policies of the MLA.
Please cite: Andrew Piper, “Think Small: On Literary Modeling.” PMLA 132.3 (2017): 651-658.
LIWC for Literature
LIWC Tables for 25,000+ documents, consisting of both fiction and non-fiction texts drawn from different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). The data is available here.
Please cite: Andrew Piper, “Fictionality,” Cultural Analytics (December 2016): DOI: 10.22148/16.011
Race and Film
This data set contains character dialogue from 780 Hollywood movies produced between 1970 and 2014. Characters have been labeled by their racial and ethnic identity using IMDB. The data set is available here.
Please cite: Vicky Svaikovsky, Anne Meisner, Eve Kraicer, and Matthew Sims, “Racial Lines: Race Ethnicity and Dialogue in 780 Hollywood Films, 1970-2014.”