In a new paper out, we attempt to replicate the findings of the recent work, “The rise and fall of biodiversity in literature,” by Langer et al. (2021). Using a large corpus from Project Gutenberg (N = ~15,000) and a dictionary-matching method of over 240K biological taxa, Langer et al. find that the frequency and diversity of biological taxa have been declining steadily since the first half of the nineteenth century, echoing prior work in cultural analytics.
My paper applies the original paper’s three primary measures to two additional data sets along with the original dataset and compares their dictionary-based method with an alternative supervised machine learning method. I find that the trajectory of biological tokens in fiction in the new data sets is directionally opposite to that shown by Langer et al. independent of the methods used (i.e. taxa rise rather than fall since the first half of the nineteenth century) but that their breakpoint estimation appears largely robust within +/- 15 years.
Based on this analysis, I suggest that the discrepancy between our results is due to corpus construction rather than choice of method. I find that only conditioning on fiction in the original dataset generates results more similar to the two alternative datasets used here. One of the key takeaways is to make sure we think about the effects of genre when examining historical writing.
In addition to emphasizing the importance of corpus construction for cultural analytics, these findings also raise larger questions about the difficulties of interpreting lexical items as indeces of social attitudes, pointing to a need for future work.