Data Sets

Check out our lab dataverse and our citizen science dataverse for a full list of datasets accompanying published articles.

CR4-NarrEmote

We introduce “Citizen Readers for Narrative Emotions”, a large-scale, open-vocabulary dataset of narrative emotions derived through our citizen science initiative. Over a four-month period, 3,738 volunteers contributed more than 200,000 emotion annotations across 43,000 passages from long-form fiction and non-fiction, spanning 150 years, twelve genres, and multiple Anglophone cultural contexts. To facilitate model training and comparability, we provide mappings to both dimensional (Valence-Arousal-Dominance) and categorical (NRC Emotion) frameworks.

CR4-Interact

This dataset is part of the Citizen Readers initiative, a citizen science project to promote more open and transparent training data for AI modeling and the study of storytelling. Consists of a collection of 13,395 passages labeled by 1,915 participants for character interactions according to six types. Please cite: Piper, Andrew, Michael Xu, and Derek Ruths. “The Social Lives of Literary Characters: Combining citizen science and language models to understand narrative social networks.” In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pp. 472-482. 2024.

NarraDetect

This dataset includes over 13,000 passages sampled from 18 diverse genres ranging from narrative fiction (novels, fables) to narrative non-fiction (memoirs, biographies) to non-narrative non-fiction (Supreme Court decisions, scientific abstracts). Additionally, it contains a manually annotated subset of 400 passages labeled for a scalar concept of “narrativity,” i.e. how narrative a passage feels. You can read the full the paper here.

Mini Worldlit

A dataset of 1,192 manually curated works of contemporary fiction from 13 countries representing nine languages and five continents.

StorySeeker

A dataset of ca. 500 annotated posts on Reddit to facilitate the task of narrative detection in online conversations.

CONLIT

This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres.

MultiHathi

We present a new dataset with detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags on an additional 400,000 non-English volumes.

Hathi1M

We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000.

Can We Be Wrong?

All of the data and code that accompanies the book, Can We Be Wrong? The Problem of Textual Evidence in a Time of Data (Cambridge 2020). Specifically, this dataset contains ~3k manually annotated sentences for “generalizations” within the fields of literary studies, history, and sociology and ~15k automatically annotated sentences from 9 academic journals from a single recent year (2016).

Enumerations

All of the data and code that accompanies the book, Enumerations: Data and Literary Study (Chicago 2018).

Please cite: Andrew Piper, Enumerations: Data and Literary Study (Chicago 2018).

Novel450

A collection of 450 novels in German, French, and English that span 1770 to 1930. Each language is represented by 150 novels with a roughly even distribution across time, length, and gender. The data can be downloaded here. And the metadata is here. Please cite:

Piper, Andrew (2016): txtlab Multilingual Novels. figshare.

https://dx.doi.org/10.6084/m9.figshare.2062002.v3

Academic Prestige

Metadata on institutional affiliation for 5,000+ academic articles published in four prestige journals within the humanities (PMLA, Critical Inquiry, New Literary History, Representations). Included in the metadata are the author’s institutional affiliation at time of publication, the author’s PhD institution, and the author’s gender. 3,500 authors are represented from close to 350 PhD-granting institutions and 725 authorial institutions. We also include supplementary data on gender and publication on another ~3,800 articles published since 2010 in 16 further journals. The data is available here.

Please cite: Chad Wellmon and Andrew Piper, “Publication, Power and Patronage: On Inequality and Academic Publishing,” Critical Inquiry (July 2017): http://bit.ly/2B93Jpu.

LIWC for Literature

LIWC Tables for 25,000+ documents, consisting of both fiction and non-fiction texts drawn from different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). The data is available here.

Please cite: Andrew Piper, “Fictionality,” Cultural Analytics (December 2016): DOI: 10.22148/16.011

Race and Film

This data set contains character dialogue from 780 Hollywood movies produced between 1970 and 2014. Characters have been labeled by their racial and ethnic identity using IMDB. The data set is available here.

Please cite: Vicky Svaikovsky, Anne Meisner, Eve Kraicer, and Matthew Sims, “Racial Lines: Race Ethnicity and Dialogue in 780 Hollywood Films, 1970-2014.”

20C Poetry

A table of derived word counts from a collection of 75,297 English-language poems. A table with the top 20K words is located here and three tables of POS, Hypernyms, and word counts is located here.