Introducing the CONLIT dataset of contemporary literature
Excited to announce the release of a new data set curated by my lab. Special thanks go to Joey Love and Eve Kraicer for their work in helping bring this to fruition.
This dataset includes derived data on a collection of ca. 2,700 books in English published between 2001–2021 and spanning 12 different genres. The data was manually collected to capture popular writing aimed at a range of different readerships across fiction (1,934) and non-fiction (820). Genres include forms of cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult). The dataset allows researchers to explore the effects of audience, genre, and instrumentality (i.e., fictionality) on the stylistic behavior of authors within the recent past across different classes of professionally published writing.
Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years, we define genre for our purposes as a form of institutionally framed classification. According to this definition, genre is what a given institution labels a book using a distinct category of writing.
Below is a list of the features we provide:
FEATURE | DESCRIPTION | ANNOTATION TYPE |
---|---|---|
Category | Fiction or non-fiction | Manual |
Genre | Twelve categories | Manual |
Publication Date | Date of first publication | Manual |
Author Gender | Perceived authorial gender | Manual |
POS | Part-of-speech uni- and bigrams | Computational |
Supersense | Frequency of 41-word supersenses | Computational |
Word Frequencies | Word frequencies for every book/1,000-word passage | Computational |
Token Count | Work length measure | Computational |
Total Characters | Estimated total number of named characters | Computational |
Protagonist Concentration | Percentage of all character mentions by main character | Computational |
Avg. Sentence Length | Average length of all sentences per book | Computational |
Avg. Word Length | Average length of all words per book | Computational |
Tuldava Score | Reading difficulty measure | Computational |
Event Count | Estimated number of diegetic events | Computational |
Goodreads Avg. Rating | Average user rating on Goodreads | Computational |
Goodreads Total Ratings | Total number of ratings on Goodreads as of June 2022 | Computational |
Average Speed | Measure of narrative pace | Computational |
Minimum Speed | Measure of narrative distance | Computational |
Volume | Measure of topical heterogeneity | Computational |
Circuitousness | Measure of narrative non-linearity | Computational |
1 Comment
Join the discussion and tell us your opinion.
[…] Introducing the Conlit Dataset of Contemporary Literature (.txtlab) […]