Detecting footnotes in 32 million pages of ECCO

Detecting footnotes in 32 million pages of ECCO

I’m very proud to be part of the collaborative project that produced this new article in the Journal of Cultural Analytics where we outline the process we used to visually detect footnotes in 32 million page images. (Full disclosure: I am the editor and to avoid conflict of interest all pieces with my name on them are  double-blind peer-reviewed and the acceptance/rejection process is handled by a separate editor.) The results are part of a multi-year study of visual practices of scientific communication called “The Visibility of Knowledge.” Footnotes are one of four primary practices we are studying that include tables, diagrams, and illustrations.

The aim of the project is to use emerging techniques in document image analysis to study large-scale historical practices. In our case, this means developing techniques to detect particular visual features across large heterogenous text collections. We have begun with ECCO because it is one of the most widely used historical databases in the field. It also coincides with the period when scholars believe these kinds of visual scientific practices began to be implemented with wider frequency.

Our second data set is a collection of proceedings of national academy of sciences from five different European countries, from the 17th century to the early 20th.

Our goals for this first paper are twofold:

a) to point to the inter-disciplinary undertaking that involves researchers at McGill, ETS, and the University of Virginia, and

b) to release metadata related to our feature of interest.

Users of ECCO can now subset documents by the percentage of footnoted pages or download a table of all footnoted pages within ECCO and inspect them.

Our next step is to parse ECCO more finely in terms of genre. It is a very heterogeneous text collection and understanding the genre divisions within the corpus will allow us more fine-grained understanding of these typographical practices. Coming soon.