LLCU 255: Intro to Literary Text Mining — New Syllabus 2017

Less but better. That’s the essentialist’s motto and that’s the one I use every year when I revise my syllabus. I keep removing things and students keep learning more every year. While there is clearly a ceiling for this approach, it works remarkably well as a pedagogical tactic. Here’s the full syllabus.

This year’s class will focus on three things:

  1. understanding what text mining or literary modeling is. I am always struck by how few students have ever heard of this field.
  2. being able to undertake a variety of analytical tasks, including preparing your data, significance testing, clustering, machine learning, sentiment analysis, and social network analysis.
  3. starting to generate ideas about how to apply these tools to good questions.

It’s the last one that is always the hardest. Learning how to use R may seem intimidating at first, but being good at creating creative models and measures for complex literary concepts is always the hardest part of this research.

The most rewarding part of this class is to see the mental transformation of students when the light bulb goes off — oh you mean I can test my beliefs on more than 1 text!?! That’s awesome!

Congratulations to this year’s students!

We have had an excellent year at .txtLAB. I want to send out a special thanks to all of the students who have been contributing to the lab. You’ve made it a great place to work. Here is a list of projects that we’ve been working on this year:

Culture + Computation: New Syllabus in Cultural Analytics LLCU 614


I am pleased to add this year’s syllabus for my graduate course, “LLCU 614, Cultural Analytics: The Computational Study of Culture.” The aim of the course is twofold: 1) to introduce students in the humanities to the computational and quantitative methods for studying culture in order to move beyond the use of anecdotal evidence and 2) to introduce students in computer science to the importance of theory for studying culture, i.e. to avoid a naive approach to data analysis. As I mention in my opening class, this course is about valuing different ways of looking at cultural questions and also conceding major methodological flaws in our current disciplinary orientations. Everyone in the room has something valuable to add given their disciplinary training and everyone also brings essential blind spots to the study of culture, including myself. This course is about making us all more sophisticated cultural analysts.

Literary Text Mining Syllabus


It’s that time of year and so I’m posting my latest syllabus of my data and literature class. I have found over the years that every time I create a new class I always start with too much and gradually winnow as the years go by (until there is nothing left and I teach a new class…). This year is no different. Here are some things I’ve learned:

  • there are very few good readings on text mining for undergraduates. did you enjoy reading full-blown research articles when you were in university? every year I take more and more off because they are just too confusing.
  • apparently undergrads in the Arts hate programming. who knew?! I lose 50% of my class with the first assignment.
  • this is particularly sad because I think in the programming lies all the knowledge. Run type token ratio and see that there are just 5% unique words in Jane Austen’s Pride and Prejudice and you have gained more of an insight into the novel than all of Jameson’s work combined could every teach you.
  • teaching statistics on top of literary theory and computer programming really is the proverbial straw that breaks the camel’s back. I do it anyway.
  • if you’re going to do this anyway, then you need to go very slowly. You can review one study for a few weeks to understand the whole process from modelling to data selection to significance testing. Maybe one study for a whole semester.
  • ultimately I see myself waging guerrilla warfare — hopefully my students will circulate through humanities departments and constantly ask annoying questions in their other classes like, “How big is your sample?” or “Can you be more explicit about how you generalize from your one example?” Because let’s be clear, at 99-1 we’re still the underdog…

Looking forward to another awesome semester of empowering students to be critical readers and creative analysts.



txtLAB450. A Multilingual Data Set of Novels for Teaching and Research

I am very pleased to be able to share a collection of 450 novels that we have assembled that were published in English, French, and German during the long nineteenth century (1770-1930). The novels are labeled according to language, year of publication, author, title, author gender, point of view, and word length. They have been labeled as well for use with the stylo package in R. They are drawn exclusively from full-text collections and thus should not have errors comparable to OCR’d texts. The novels are available for download here and the metadata here.

As Alan Liu recently remarked, putting together stable, small to medium sized data sets for use in the classroom and our own research is a major requirement for digital research and pedagogy. These sets have been assembled with an eye to balance within and between languages — in terms of gender, word length, historical dates, and point of view. We have tried to avoid the over-representation of any single author and tried to ensure that the collections are relatively balanced between long and short works.

Of course more could always be done to refine and/or expand these collections. But we feel this offers students and researchers a very good start who are interested in studying how their insights work across three major European languages. For those who are interested, this collection was the basis of my essay on “conversional novels” in New Literary History.

We are hoping to add more languages to the collection as time goes on. If you wish to help us, please do contact me. We would really appreciate it.