Mini Worldlit: A new dataset spanning 13 countries, 9 languages, and 5 continents

The study of world literature is crucial for understanding cross-cultural differences and the global landscape of literary production (Damrosch 2003). Despite its importance, we continue to lack sufficient datasets that allow for such large-scale comparative analysis. In recent years, there have been renewed calls within the digital humanities to embrace multilingualism in both the development of new datasets and computational methods (Spence & Brandao 2021; Nilsson-Fernàndez & Dombrowski 2022; Viola & Spence 2023; Viola 2024). As more literary texts are digitized, the need for datasets that reflect a diverse, multilingual, and transnational literary world has become increasingly urgent.

While a number of new initiatives are underway to build multilingual literary collections, there are some notable limitations: they either do not focus on narrative, exclude more recent works, or lack global representation. Large-scale collections, on the other hand, can suffer from unclear curation standards, limiting their applicability to the study of literary history.

To address these limitations, we introduce Mini Worldlit, a dataset of 1,192 manually curated
works of contemporary fiction from 13 countries representing nine languages and five continents. In addition to its geographic and linguistic breadth, the value of Mini Worldlit is its
highly curated nature using consistent cross-cultural criteria of selection, overseen by a team of
scholarly experts. While it can only cover a tiny fraction of the world’s languages and cultures,
Mini Worldlit provides a template for future commensurable cross-cultural sampling to facilitate
the further exploration of global literary cultures.

What’s in the Dataset?

The dataset includes fiction from:

📍 Argentina (Spanish) – 97 books
📍 Canada (English) – 95 books
📍 Denmark (Danish) – 100 books
📍 Germany (German) – 98 books
📍 India (English) – 94 books
📍 Israel (Hebrew) – 100 books
📍 Japan (Japanese) – 95 books
📍 Mexico (Spanish) – 76 books
📍 Netherlands/Belgium (Dutch) – 98 books
📍 Nigeria (English) – 71 books
📍 South Korea (Korean) – 100 books
📍 South Africa (English) – 77 books
📍 Turkey (Turkish) – 91 books

How Was the Dataset Built?

While a single definition of “world literature” is contested (Damrosch 2003Anderson 2004Cheah 2016), here we define world literature as a collection of literary collections. By this we mean that the set of world literature consists for our purposes of sets of written fictional stories that are locally recognized and geographically and linguistically bounded. As we describe in more detail in our sampling criteria below, each of these choices has important implications: we focus on fiction as the most publicly prominent dimension of the category of literature; we focus on locally recognized fiction as a means of selecting works that have achieved some degree of meta-cultural acknowledgment (in our case book and/or literary reviews); and finally, we focus on linguistic and geographically bounded entities.

We used the following criteria when sampling books:

  • Regional coherence. We define a “collection” as consisting of books published in a single language deriving from a bounded geographic space. This may be defined as “Dutch fiction published in Belgium or Holland,” “English fiction published in India,” or “Korean fiction published in South Korea.” For this reason, books must not be translations from foreign languages and we prioritize authors whose location or upbringing aligns with the region and language of interest.
  • Audience coherence. We aim to collect books written for roughly similar audiences. For each region we select a small set of literary or cultural reviews and sample books from these lists.
  • Stylistic Coherence. Books must be written in the third person. We impose this constraint to ensure that the linguistic behaviours of books are aligned. All books are manually inspected prior to inclusion.
  • Historical Coherence. Books are ideally supposed to be published within 1–2 years of the sampling exercise. However, if there were not enough books that met the above criteria, we allowed researchers to sample further back in time. The full range of publication dates included in the collection can be in Figure 1. Over 90% of the data appeared during the five-year period between 2017 and 2021.
  • Quantitative Coherence. We aim to sample ca. 100 books from each region. Due to availability limitations, this number can fluctuate as seen in Table 1.
  • Availability. All books must be able to be purchased as physical copies for the purposes of manual digitization to abide by copyright restrictions surrounding the derivative use of electronic editions.

The Future of Mini Worldlit

While this dataset is a significant step forward, it is only the beginning. Future expansions aim to include more languages, broader regional representation, and additional literary traditions.

We invite scholars, linguists, and computational researchers to explore the dataset, contribute insights, and help us build a truly global literary resource.

For full details, access the dataset here: https://doi.org/10.5334/johd.248