The Visibility of Knowledge

brewster-kaleidoscope-2I am very pleased to announce that our new collaboration with Chad Wellmon and Mohamed Cheriet has been funded by the Social Sciences and Humanities Research Council of Canada. The project is called “The Visibility of Knowledge: The Computational Study of Scientific Illustration in the Long Nineteenth Century.” Our aim is to study how scientific knowledge became visible to readers over the course of the eighteenth and nineteenth centuries using new computational techniques in image detection. The project is based on previous work that was part of a Digging into Data Challenge Grant.

The nineteenth century is traditionally understood as the great age of scientific popularization, when scientific knowledge moved from a specialized writing practice to one that was increasingly engaged with by a broader public. Bringing together a team of humanists and computer scientists, we are interested in understanding how visual techniques, such as the use of footnotes, diagrams, tables and mimetic representations of objects, were used to engage the public and make new ideas accessible. Rather than focus exclusively on language, we want to know more about the different cultures of scientific representation that accompanied this writing.

Drawing our data from the digitized collections of Gale’s ECCO, NCCO, and the Hathi Trust will allow us to dramatically alter the scale of our analysis compared to previous histories of science. How widespread were practices of scientific illustration during this period and in what fields do we see different practices gaining more or less traction? Do we find certain types of practices aligning with different types of audiences or publics? What is the relationship between popular and specialized publications with regards to illustration? Our ideal outcome is not only new knowledge about the past, but also a collective resource for others to use. We hope our efforts at visually analyzing books will provide ample new data for other historians of culture.

One of the principal aims of this project is to continue the effort to bring humanists and computer scientists together to better understand our cultural past and present. Understanding how science was received by a broader reading and viewing public in the past not only gives us a more expansive and potentially more diverse portrait of scientific illustration than previous accounts have allowed. It also provides a key historical dimension to contemporary debates about the openness and accessibility of scholarly ideas today. How ideas were visualized in print through different kinds of techniques gives us an important sense of context for our present moment of thinking about the visual nature of intellectual production and its accessibility to a wider audience. It can help us see the long history of how text and image have intersected within the practices of knowledge diffusion.

Footnote Detection

What does it take to understand a page visually at the machine level?

That is our guiding question and one that we are initially applying to the problem of detecting footnotes in a large corpus of German Enlightenment periodicals.

Why footnotes?

According to much received literary critical wisdom, one of the defining features of the Enlightenment is not only a growth in periodicals — more printed material circulating about in more timely fashion than ever before – but also more footnotes or more broadly speaking more indeces – that is, more print that points to print. As the world of print became increasingly heterogenous and faster paced, print evolved mechanisms for indexing and pointing to this increased amount of material to make it intelligible. Indexicality, as my colleague Chad Wellmon has argued, is one of the core features of Enlightenment.

Below I give some idea of the process of what it means to identify footnotes through a process of what we call visual language processing. This is work that has been produced by the stellar efforts of the members of the Synchromedia Lab at Montréal’s ETS directed by Mohamed Cheriet. These members include: Ehsan Arabnejad, Youssouf Chherawala, Rachid Hedjam, and Hossein Ziaei Nafchi.

 

Step 0: Know thy page.

Here is an image of a sample page. This gives some indication of the overall layout and quality of the images. The good news is that unlike manuscripts, these pages are very regular. The bad news is that regular is a relative term. The very bad news is the quality of the reproduction.

Periodicals - Sample 1

 

Step 0.1: Hypotheses

To begin, the humanist team member (in this Andrew Piper) identified four possible features in advance of the project that he thought might be computationally identifiable and would also be of interest to literary historians. These are:

  1. Footnotes per page. If the thesis is that Enlightenment = indexicality, then we should find more footnotes as time progresses in our data set. The first, most basic question is, how many footnotes are there in this data and can we get a per page measure?
  2. Footnoted Words. Can we capture the words that are most often footnoted, and if so, what can this tell us about Enlightenment concerns? When authors footnote, do these cluster around common semantic patterns?
  3. Footnote Length. Does length matter? Are longer footnotes indicative of a certain paratextual style that is useful for grouping documents into categories? When authors use long versus very short footnotes, do they tend to be working in similar genres?
  4. Language Networks. Can we detect which language is being referred to in the footnote? Knowing the degree to which footnotes become more/less internationally minded as the Englightenment progresses into the nineteenth century will give us an indication of the natioanalization of print discourse during this period. Because German uses different type faces for Latin and German scripts we should be able to identify at least these distinctions (with perhaps Greek and Arabic added in).
  5. Citation Networks. This is the holy grail: can we extract the references that are cited in the footnote? Can we construct a citation network for our our periodicals, a record of all the references that are mentioned in German periodicals over an eighty-year period to better understand the groupings and relationships that print articulates about itself?

 

Step 1: Enhance images

Once these hypotheses were in place, the next step was to prepare a sample set of pages for detection. This involves two steps:

First, try to repair the incompleteness of the reproduction by focusing on a) reducing background noise and b) enhancing the stroke completion of the letters.

a) Reducing background noise

Slide1

b) Enhancing holes in strokes

Slide2

Second, pages are not all uniformly aligned. So we need to correct for skew in order to perform word segmentation and line measurements (and remove shadows from the OCR process).


Slide3

 

Step 2: Footnote Markers

Once the pages have been processed, we now search for our first feature. Can we reliably find footnote markers? In modern texts these are usually numbers, in the periodicals they were most often asterisks *, or two **. These three slides should give a good idea of the task and the challenges. The results on our sample set for accurately detecting footnote markers were:

Precision = 85.58% and Recall = 71.65%.

Slide19Slide20

Slide37

 

 

Step 3: Footnoted Words

Next we began looking to see if we could capture just those before the footnote mark. Is there something common to the footnote vocabulary of the German Enlightenment?

Slide24Slide25Slide28

 

Step 4: Footnote Length

Next, we concerned ourselves with trying to identify the length of footnotes. This involved a process of line segmentation instead of word segmentation.

Slide30

Slide31

 

Step 5: Language Detection

Finally, we began to explore how we could identify words in different languages. We found a very interesting way of finding it at the level of the line by comparing the distribution of letter heights. It turns out that German and Arabic script, for example, have very different emphases on up and down strokes. We are still working on trying to understand this at the level of the single word or phrase.

Slide34

 

Step 6: Citation Network

Our last very high level question is: can we detect the words that indicate titles to which footnotes are referencing. This is a much more complex problem. To be continued…

Digging into Data: Global Currents

Detail, Canterbury Psalter (1147)
Detail, Canterbury Psalter (1147)

What can you learn from the visual features of a page?

This is the question that lies at the centre of our digging into data project, the awards for which were announced yesterday.  A vast amount of our textual heritage has so far been resistant to large-scale data analysis, whether it is non-western scripts or early- or pre-print documents. These are works that don’t lend themselves well to current OCR technology and thus to the usual approaches of data and text mining.

Partnering with Mohamed Cheriet at the Synchromedia Lab at the École de technologie supérieure in Montreal and Lambert Schomaker of the Artificial Intelligence Lab at Groningen University in the Netherlands, we will be applying new image-processing techniques to better understand the relations between pages at the visual level. Rather than OCR a text and  compare the relations between words, we want to know how it is that pages correlate with one another through their visual features. How much semantic information is contained in the visual dimensions of a page and what other kinds of information is encoded there — whether it be indeces of scribal communities or perhaps styles of ornamentation that marked different periods or cultures? Although we think of texts as things that we read, texts are first and foremost visual objects. Our goal is thus to account for new kinds of texts and new kinds of textual information that have so far been missing from the big data turn.

This is just the first step, however. Our second principal question is: knowing something about the visual relations between pages, can we create larger maps of connections between texts in corpuses of writing that represent different world cultures at different historical junctures? Can we understand the networks of literary exchange that existed and helped define these different cultural formations of the past? To this end, we are bringing together four different databases for our analysis that have been curated by researchers at McGill and Stanford: post-classical Islamic philosophy (Robert Wisnovsky), Chinese Women’s Writing from the Ming-Qing Dynasties (Grace Fong), the Anglo-Saxon Middle-Ages (Elaine Treharne), and the European Enlightenment (Andrew Piper/Mark Algee-Hewitt). These collections bring together writings from diverse spans of both time and space, from 1050, the beginnings of Islamic post-classical philosophy and the Anglo-Saxon high middle ages, to 1900, the onset of various global modernisms across China, the Middle East and Europe. Together they comprise 1,194,000 pages. Uniting each of these domains, we would suggest, is the shared sense of being a culture in transition. Our aim is thus as capacious as it is straightforward: how are these different transitional periods and places characterized by networks of shared ideas?

Partnering with Derek Ruths of the Network Dynamics Lab at McGill University, we will be asking some of the following questions regarding our different cultural collections:

  • what texts were most central to a particular epoch? what do such texts have in common with one another?
  • What texts play a mediating role between different communities of writing within a corpus? To make the bridge between different clusters of texts, what kinds of writing does one most often pass through?
  • How are these different cultures themselves comprised of different textual communities? Do we find that different periods or cultures are marked by different degrees of communities (many smaller communities versus fewer strong concentrations)?
  • How do ideas move across time? Are there strong correlations between works from similar time periods or do we find periods more defined by anachronism or recycling?

These are just some of the questions we hope to answer over the next two years. The value we think is the way this allows us to put into practice a model of comparative globalism, one that places major, often transnational regional cultures from diverse parts of the world in conversation with one another while at the same time preserving the uniqueness of their cultural differences. Our goal is not a flattening of the world into a single, unified cultural account, but the study of the communicative underpinnings that maintain these differences. The quantitative study of literary networks, we argue, allows for a renewed project of comparative inquiry, one that enables artifacts of a very different nature, whether of medium, script, language, or epoch, to be put into conversation with one another. This project would mark the first cross-cultural comparative study of literary networks of its kind.