The Replication Crisis I: Restoring confidence in research through replication clusters

Much has been written about the so-called “replication crisis” going on across the sciences today. There are many ways that these issues impact literary and cultural studies, but not always in the most straightforward way. “Replication” has a complicated fit with more interpretive disciplines and it warrants thinking about its implications. In the next few weeks I’ll be writing some posts about this to try to generate a conversation around the place of replication in the humanities.

One of the first ways this has manifested itself is in the journal I edit, Cultural Analytics, which has just put out a call for publishing null results as one answer to this problem. By focusing on projects that fail, we remove pressure for always finding “significance,” and thus over-estimating effects in the real world. This is as true for quantitative disciplines as it is for qualitative ones. When was the last time you read an article in literary studies that said, “I didn’t find anything significant here.”

There are many other suggested solutions out there and more coming. Here I would like to propose a new, far more experimental one. I call it replication clusters.

When it comes to replication, i.e. can I get a similar result to you, people often focus on “duplication.” If I use your code and your data, does it work? This is a good reality check. Are there weird decisions that a researcher made that might impact the results? Were there buried “degrees of freedom” that might have made things look more ambiguous? Or is there outright, this is not a good way to run this algorithm problems?

The problem of course is that replicating someone else’s code is a giant pain in the ass. And there is very little payoff unless you are the kind of person who enjoys the “burn on you, you made a mistake” moment. It’s possible you’ll discover an error, but it’s also possible you’ll waste a lot of time doing exactly what someone else already did.

But that’s the point of replication! We need it. But if it simply doesn’t happen until it’s like a feeding frenzy (see that Cornell Food guy, thus the food metaphor), but otherwise basically doesn’t, that tells us something about human behaviour that will be very hard to overcome. It’s fundamentally boring and tedious. Those who do it also probably have very (impure?) motives and thus their efforts are very likely to be marred by their own biases that then lead to uncertainty about the replication itself. If Researcher A wanted to find an effect and Researcher B didn’t want to find one, how are the odds of over-statement any different? It doesn’t feel like much of a resolution.

What might be more valuable is to try to do what researchers call “conceptual replication.” This is where you “extend” the method/approach to new data. This is where you test the generalizability of the findings. To my mind this is much more interesting way of replicating and it also answers a much needed question: do these results hold when I look at different texts? It gives you some incentive to learn something new and gain confidence that some theory or hypothesis is “true” (or more durable). The more evidence we have from different sources that all continue to show X, the more we can say, X is true. (In my next post I’ll talk about the problem of the “X is true” model for interpretive sciences.)

This is where the idea of replication clusters comes in. Typically we measure the power of an idea by citation count. If a lot of people cite something, it must be more valid. There are too many ways this is a bad metric to be useful. Plenty of highly cited (and fashionable) insights turn out to be hard to replicate (power poses anyone?).

Replication clusters on the other hand are indices of how many times something similar has been tested in different ways. “Similarity” is key here, not duplication. Similar but different. That’s how truth builds.

How might we do this? I have no idea! But there might be a way through text analytics to identify articles that are extending an initial study in a confirmatory way. These would be highly similar and lack critical language of the source. It feels doable, just hard. But the outcome would be a metric that tells you how many times the idea of an article has been conceptually replicated (i.e. extended). That is way better than citation counts.

And yeah, I just used the word truth a bunch of times because today more than ever we need methods that help us achieve consensus around shared facts. Science critique is great. Science denial, not so much.

Next I’ll dive into why “replication” poses a problem for interpretation.