Introducing the CONLIT dataset of contemporary literature

October 12, 2022

Data, Narrative Studies, News

Excited to announce the release of a new data set curated by my lab. Special thanks go to Joey Love and Eve Kraicer for their work in helping bring this to fruition.

This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres. The data was manually collected to capture popular writing aimed at a range of different readerships across fiction (1,934) and non-fiction (820). Genres include forms of cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult). The dataset allows researchers to explore the effects of audience, genre, and instrumentality (i.e., fictionality) on the stylistic behavior of authors within the recent past across different classes of professionally published writing.

Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years, we define genre for our purposes as a form of institutionally framed classification. According to this definition, genre is what a given institution labels a book using a distinct category of writing.

Below is a list of the features we provide:

FEATURE	DESCRIPTION	ANNOTATION TYPE
Category	Fiction or non-fiction	Manual

Genre	Twelve categories	Manual

Publication Date	Date of first publication	Manual

Author Gender	Perceived authorial gender	Manual

POS	Part-of-speech uni- and bigrams	Computational

Supersense	Frequency of 41-word supersenses	Computational

Word Frequencies	Word frequencies for every book/1,000-word passage	Computational

Token Count	Work length measure	Computational

Total Characters	Estimated total number of named characters	Computational

Protagonist Concentration	Percentage of all character mentions by main character	Computational

Avg. Sentence Length	Average length of all sentences per book	Computational

Avg. Word Length	Average length of all words per book	Computational

Tuldava Score	Reading difficulty measure	Computational

Event Count	Estimated number of diegetic events	Computational

Goodreads Avg. Rating	Average user rating on Goodreads	Computational

Goodreads Total Ratings	Total number of ratings on Goodreads as of June 2022	Computational

Average Speed	Measure of narrative pace	Computational

Minimum Speed	Measure of narrative distance	Computational

Volume	Measure of topical heterogeneity	Computational

Circuitousness	Measure of narrative non-linearity	Computational

computational literary studies data novels

1 Comment