txtLAB450. A Multilingual Data Set of Novels for Teaching and Research
I am very pleased to be able to share a collection of 450 novels that we have assembled that were published in English, French, and German during the long nineteenth century (1770-1930). The novels are labeled according to language, year of publication, author, title, author gender, point of view, and word length. They have been labeled as well for use with the stylo package in R. They are drawn exclusively from full-text collections and thus should not have errors comparable to OCR’d texts. The novels are available for download here and the metadata here.
As Alan Liu recently remarked, putting together stable, small to medium sized data sets for use in the classroom and our own research is a major requirement for digital research and pedagogy. These sets have been assembled with an eye to balance within and between languages — in terms of gender, word length, historical dates, and point of view. We have tried to avoid the over-representation of any single author and tried to ensure that the collections are relatively balanced between long and short works.
Of course more could always be done to refine and/or expand these collections. But we feel this offers students and researchers a very good start who are interested in studying how their insights work across three major European languages. For those who are interested, this collection was the basis of my essay on “conversional novels” in New Literary History.
We are hoping to add more languages to the collection as time goes on. If you wish to help us, please do contact me. We would really appreciate it.