Introducing the CONLIT dataset of contemporary literature

Introducing the CONLIT dataset of contemporary literature

Excited to announce the release of a new data set curated by my lab. Special thanks go to Joey Love and Eve Kraicer for their work in helping bring this to fruition.

This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres. The data was manually collected to capture popular writing aimed at a range of different readerships across fiction (1,934) and non-fiction (820). Genres include forms of cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult). The dataset allows researchers to explore the effects of audience, genre, and instrumentality (i.e., fictionality) on the stylistic behavior of authors within the recent past across different classes of professionally published writing.

Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years, we define genre for our purposes as a form of institutionally framed classification. According to this definition, genre is what a given institution labels a book using a distinct category of writing.

Below is a list of the features we provide:

CategoryFiction or non-fictionManual
GenreTwelve categoriesManual
Publication DateDate of first publicationManual
Author GenderPerceived authorial genderManual
POSPart-of-speech uni- and bigramsComputational
SupersenseFrequency of 41-word supersensesComputational
Word FrequenciesWord frequencies for every book/1,000-word passageComputational
Token CountWork length measureComputational
Total CharactersEstimated total number of named charactersComputational
Protagonist ConcentrationPercentage of all character mentions by main characterComputational
Avg. Sentence LengthAverage length of all sentences per bookComputational
Avg. Word LengthAverage length of all words per bookComputational
Tuldava ScoreReading difficulty measureComputational
Event CountEstimated number of diegetic eventsComputational
Goodreads Avg. RatingAverage user rating on GoodreadsComputational
Goodreads Total RatingsTotal number of ratings on Goodreads as of June 2022Computational
Average SpeedMeasure of narrative paceComputational
Minimum SpeedMeasure of narrative distanceComputational
VolumeMeasure of topical heterogeneityComputational
CircuitousnessMeasure of narrative non-linearityComputational