Introducing the MultiHathi multilingual fiction dataset

Excited to announce a new dataset for the study of multilingual fiction developed by my student Sil Hamilton.

Prior work in the Digital Humanities has highlighted the importance of multilingual corpora for cultural study (Mahony, 2018Spence & Brandao, 2021Gil & Ortega, 2016). In treating this absence, we present researchers with a dependable list of volumes with predicted fictionality tags representing over 500 languages. In doing so, we significantly increase the number of readily available (non-)fictional texts currently provided by the HathiTrust Digital Library. 

Specifically, this dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published.

We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We equipped XLM-RoBERTa with an additional classification layer and trained this layer for five epochs on 144,000 examples of 512-word spans of English fiction and non-fiction drawn from the CONLIT dataset (Piper, 2022). We found that our model performs well (minimum 80% F1-score) in all tests despite having only been trained with English samples.

This dataset now joins other Hathi-based datasets for the large-scale study of written culture.