Hathi1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

March 17, 2022

Data, News

Really pleased to announce the release of a new data set that I’ve been working on with my collaborator Sunyam Bagga. In it we build on the prior work of Ted Underwood and his team to develop parallel corpora of fiction and non-fiction writing over the past two centuries.

The data consist of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000. As we describe in greater detail in the piece, this data set has the following affordances:

The page as historical unit. Rather than sample entire documents we focus on the “page” as a distinct historical unit of analysis.
Comparative framework. Rather than focus on a single genre of writing, the parallel nature of the corpus allows us to better understand linguistic changes that are specific to fictional and non-fictional writing.
Single model. We update Underwood’s method by using a single predictive model on the entire time period. This corrects for anomalies in the data when two different time periods/models are combined.
Enriched metadata. We provide 107 enriched features about every page, including part-of-speech tags, difficulty measures, and supersense-tags as defined by bookNLP among many others. While full text is the gold standard for text data, we allow researchers to work with this data immediately.
Portability and Accessibility. By focusing on page data, we provide a data set that is easily portable and can be worked on at the local machine level. The enriched features also require no further computational processing and so researchers have immediate access to the data set. Because we also provide Hathi IDs, researchers can also use the Hathi capsule system to work on otherwise inaccessible in-copyright data.

We see this is an exciting example of how research in the digital humanities can build on prior work to continue to generate better and more accessible data sets. We are extremely appreciative of the work of Underwood and his team who have created the conditions of this work at such a high level.

Our next steps are to see if we can leverage this information to generate more multilingual data sets to match the English language collection. More to come!

The data can be accessed here: https://doi.org/10.7910/DVN/HAKKUA.