In my previous post I tried to illustrate how different runs of the same topic modelling process can produce topics that appear to be slightly semantically different from one another. If you keep k and all other parameters constant, but change your initial seed, you’ll see the kind of variation that I showed.
The question that I want to address here is whether we can put a number to that variation, so that we can understand which topics are subject to more semantic variability than others.
I’ve gone ahead and written a script in R that calculates the average difference between a given topic and the most similar topic to it from all other runs. You can download it in GitHub.