As part of my new book, I have made the code and all derived text data freely available online. The underlying text data has been shared as far as copyright restrictions would allow. As I mentioned in my initial post, this entailed a massive amount of information as well as labor: close to 150,000 files that accompany a 250 page book and about 7,000 lines of code.
Others have written about the importance of sharing code and data for furthering academic research. On that front I don’t have anything new to say. Enumerations is part of a widespread movement that is trying to make research more transparent, including the humanities. (For example the journal I edit, Cultural Analytics, has a code and data repository to accompany every article.) Nevertheless it is important to point out how novel this is for the humanities, which has traditionally been deeply opaque about its evidence and methods. How did one choose a passage from a novel or poem to focus on from among all possible passages? What were the steps taken to arrive at the insight about the passage or the document to which it belongs or the historical period? Imagine asking a researcher in the humanities to turn over all notes, library records, and archive slips and you get an idea of the transparency gap that currently exists between what we might call bibliographic versus computational research.
Two other points are worth making. Given the novelty of this practice in the humanities it is inevitable that a) there are better ways to do it and b) there will be errors. The beauty of a scientific system is the way through the process of replication and review over time these improvements can be made. It is crucial that as we embark down this road that we do so with generosity and collegiality. When people share (anything) they are making themselves vulnerable. It is imperative that we take that vulnerability seriously if we want more people to participate. Openness isn’t just about producers, but also recipients. This is why it’s of upmost importance for businesses, organizations and any other entity that collects data on others to keep vulnerability management high on their priorities list.
In the months to come I will be trying to make the pieces of code used in the book more modular and accessible so that people can use them for teaching or their own research. For now, everything is written so that you can accompany the choices and discoveries of each chapter. I hope you have fun with it because I certainly did.