Topic Stability, Part 2

In my previous post I tried to illustrate how different runs of the same topic modelling process can produce topics that appear to be slightly semantically different from one another. If you keep k and all other parameters constant, but change your initial seed, you’ll see the kind of variation that I showed.

The question that I want to address here is whether we can put a number to that variation, so that we can understand which topics are subject to more semantic variability than others.

I’ve gone ahead and written a script in R that calculates the average difference between a given topic and the most similar topic to it from all other runs. You can download it in GitHub.

I use KL-divergence to calculate the difference between topics, meaning I am asking how much information is lost when we approximate a topic’s term probability distribution by the most similar topic from another model. I then take the average KLD of all these comparisons for a given topic. I have run 10 models (small N!) and so that is the basis of this average.

When I run this on my 10 models what I see is that there is a fairly wide range of “stability” across topics (mean KLD = 0.87, standard deviation = 0.68, table pasted in at the bottom along with table of topic words). The topic with the lowest average information loss between runs is the “French” topic (jean, roman, pierre, saint, review, french, donne, france, rousseau, marie) followed by the junk topic (think, people, know, way, say, just, things, something, time, now) but more interestingly also the “race” topic (black, white, african, race, racial, slave, africa, slavery, south, negro). These have average KLD scores between 0.14 and 0.15.

The topics with the highest divergence are the French-arab studies topic (french, france, arabic, first, arab, studies, two, many, since, time, century) with a score of 3.5 (!) and some sort of novelists topic (romance, dickens, tom, house, gothic, hawthorne, melville, poe, mark, conrad) with a score of 2.6. Interestingly the next highest is the poetry topic which looks very stable when you observe the top few words but obviously must have a lot more variability further down the list (poetry, poet, poets, poetic, eliot, poems, poem, prose, pound, eliots).

Two things immediately come to mind: the first is that these high-scoring topics are very high scoring with respect to the average and also low scorers. They definitely feel like outliers. Frankly, I wouldn’t invoke the French-arab studies topic as a topic. It is too unstable to do a deep dive into it. Essentially it keeps turning up as something different each time. It might even make sense to subset any analysis of a model by removing topics with these really strong variabilities. Since they are so evanescent, do we really want to conclude anything based on them? In this sense, I think this kind of measure can be very valuable in gaining confidence about the stability of topics or subsetting on a group of more stable topics.

The second idea that occurs to me is from that poetry topic. That really surprised me. I wonder what would happen if you condition on just the top words. Then again, are variations on the word “poet” really a topic? Maybe that is what this is telling us is that the topic of “poetry” is actually very unstable depending on how you model it. While you will have a strong register of top words, underneath that tip is an iceberg of variability.

If we return to the culture studies topic I discussed in the previous post, we can see that it scores almost dead in the middle of the pack (29th in terms of stability). In other words, that variability I was seeing was average for this model, neither low nor high, i.e. “some.”

Next steps are going to be to compare this to document probabilities. Does semantic instability correlate with document instability? In other words, when a topic changes its word distributions, does this mean we are getting very different distributions of articles?

Second, I want to compare this with David Mimno’s work on topic “coherence”. Are stability and coherence capturing the same thing?

Finally, I could imagine that running this across numerous values of k might be valuable — do we see a significant decline in the average topic stability as k increases? I assume stability will decrease with the rise in k, but maybe there is something telling about when it declines faster. This might be another diagnostic for choosing k. But that’s a lot of run time, so don’t expect anything on this anytime soon.

Table of average stability of topics for a model of k=60.
topicmean.kldsd.kldnorm
10.6031008420786380.230915792878045-0.39279399596302
20.8056434537330190.356186351211379-0.0932849378235169
30.640613353650650.588805499885999-0.33732252298452
40.5281748810322520.274482013239219-0.503590454248515
50.5839701490622440.2084634829349-0.421083429974433
60.1902537790782750.056951708329921-1.003289900923
70.4029382697022160.356544338468519-0.688783580759832
80.3036686119332520.225769818280089-0.835578180954619
91.290457831676640.9915606244991970.62363233960432
100.1964560803475570.164411572935192-0.994118273302088
111.824941507654070.6609015433569641.41399788824307
120.1496507877431660.00381496807766735-1.06333140820432
133.46777161987740.4898268362691153.84332619390767
141.832448379579861.832954955974781.42509864438373
150.5515306272746360.541245901097331-0.469053240068922
160.8217053563997110.214815710545396-0.0695334649836277
171.108722307479580.06892067463381740.354891676895614
181.91838799131020.9126557417182071.55218149354808
191.131864219458510.5689262160180390.389112684533177
200.4095409664487830.324305411765304-0.679019869965385
210.6235353048653890.369176907175886-0.362576617738974
220.2382131321869120.152658998958459-0.932370203772754
230.8794190260584230.700035636858390.0158103878484456
240.5087442054554460.167600151602236-0.532323486243552
251.666867393147030.8121981413406131.18024643828632
262.565220744995830.389335057240292.50868278109318
270.3783542129766990.308451705418931-0.725137154039868
280.9266685556423270.3955428270595240.0856804363662265
290.8180092433358690.847667218603971-0.0749990770515584
300.9026390383220640.09507408666335390.0501468859624224
310.502808053843270.161708533825336-0.54110154614815
322.088335665452920.9846925821317551.80349092122714
330.9904170382708730.8691257015557440.179948244181793
341.836488231960990.5740722837290011.43107255956044
350.8023043731273280.527592660763758-0.0982225896050051
361.910551381964710.5431886797204191.5405931395474
370.5437428087732080.170529485521647-0.480569444808615
380.4626263602299030.199618149241667-0.600520061147482
391.242964336662490.5445081956888590.553401528204912
400.1498795359109680.100551829891209-1.06299314778727
410.550013870415670.381533228164756-0.471296138067249
420.2553300991350120.173561822793647-0.907058558944265
430.7615801954534930.575755697849621-0.158443300688801
440.3827527057224510.439826585819993-0.718632900891589
450.8987744090879350.5212282706717370.044432081297281
460.6423716150713890.343398431851845-0.334722501103757
470.24202022768360.0851796646471162-0.926740476886415
481.2829545132570.5450903169650670.612536838196336
490.4342282533216270.286988871910663-0.642513645527426
502.165586434647010.4751353450154911.91772518004747
510.3042629909700940.15160499103407-0.834699245386205
520.2840858670088060.173391575452376-0.86453608487518
530.2893789490158960.349967134121609-0.856708961517533
540.7615744351241930.145389460295225-0.158451818752181
551.171962815905850.6934810999732220.448408319938301
561.032661293871220.3845982071587760.242416764278966
570.402982977210120.0786439033495061-0.688717469715469
580.1577502203007870.0227775658441129-1.05135440545422
591.025077936067490.6675041827750910.231202904959287
600.2786568403689240.259287244350614-0.872564235802255
Table of top 10 topic words by topic, k=60.
Topic 1educationschoolyearsmanyworkpercentstatestudiesmembersschools
Topic 2firstpassagetwoparthoweverchapterusereferencelatersecond
Topic 3sciencehumannaturescientificnaturalworldanimalknowledgemedicalanimals
Topic 4namewithoutmomentplaceselfwritingquestiontimenothingfirst
Topic 5muchthoughmightperhapslessmanyseemsleastcoursefact
Topic 6motherfamilyfatherchildchildrendaughteryounglifehomemothers
Topic 7historyhistoricalpasttimememorypresentfutureeventshistoriansevent
Topic 8lawpubliclegalstateauthoritycasejusticelawsrightsprivate
Topic 9germanjewishgermanyjewsberlingoethemannjewantinazi
Topic 10filmfilmscinemamediaimageimagestelevisioncamerascreenvisual
Topic 11theorymodernaestheticcritiquehistoricalbenjaminsocialideologypoliticalcritical
Topic 12jeanromanpierresaintreviewfrenchdonnefrancerousseaumarie
Topic 13frenchfrancearabicfirstarabstudiestwomanysincetime
Topic 14italianspanishdantejuanmadridmariaspainitalyvitaflorence
Topic 15languagewordsspeechwordmeaninglinguisticusemetaphordiscourseverbal
Topic 16poempoemspoetlinespoetrylinepoetsversepoeticstanza
Topic 17cityspaceplacejoycehouseirishurbanhomestephenulysses
Topic 18jamesmrsmarymisshenryjanewilliamswoolflawrencejamess
Topic 19miltonparadiseadamallegorypastorallostmiltonsbookblakespenser
Topic 20letterlettersyearswrotetimefirstyearlifelatertwo
Topic 21culturalcultureidentitysocialdiscoursepoliticalwithinpowerpoliticsstudies
Topic 22womenfemalewomanmalewomensmengenderfeministsexualsex
Topic 23artworkpaintingartistaestheticartisticartistsimageartsvisual
Topic 24formstructuresystemorderwithinobjectprocesstermstworeality
Topic 25englishtranslationlatinoldgreekanglosagatranslationsstudiesoriginal
Topic 26romancedickenstomhousegothichawthornemelvillepoemarkconrad
Topic 27godchristianreligiouschristchurchreligionspiritualgodsdivinefaith
Topic 28narrativenarratorstoryvoicereadernarrativesnarratorseventsnarrationfirst
Topic 29kingshakespeareshakespeareshenryqueenhamletrichardsirkingscourt
Topic 30researchsocialstudyanalysisstudiesdifferentroleuseexampletheory
Topic 31novelfictionnovelscharacterscharacterreaderworldreadersfictionalrealism
Topic 32russianeastchineseeuropeanwesternwestsovietjournaljapanesechina
Topic 33musicsongfolkloresongsfolkmusicaloraldancetraditionperformance
Topic 34storytalestoriestaleschaucermythversionlegendchaucerstwo
Topic 35socialclasseconomicsocietylaborworkmoneyworkingeconomymarket
Topic 36heroactionironytragedycharactertragiccomicheroicepiccharacters
Topic 37literaryliteratureworkwriterscriticismwritingworkscriticalcriticsstudies
Topic 38indiancolonialworldnativelandbritishnationaleuropeanpeoplewest
Topic 39londoncenturyenglishjohnenglandthomasgeorgewilliamvictorianeighteenth
Topic 40thinkpeopleknowwaysayjustthingssomethingtimenow
Topic 41deathviolencelifedeadmurdersufferingfearbloodguiltcrime
Topic 42politicalpoliticsstatenationalrevolutionpartypowergovernmentpeoplemovement
Topic 43bodydesireselfsubjectfreudsexualobjectbodiespowerpleasure
Topic 44lovemarriagewomanladyloverdesireloverspassionbeautywife
Topic 45lifeselfworldexperiencehumanrealitysenseconsciousnessindividualpersonal
Topic 46philosophythoughttruthknowledgetheoryphilosophicalreasonideaideashuman
Topic 47playplaysstagedramascenetheateraudiencedramatictheatreact
Topic 48centurymedievalromanearlyromerenaissancemiddleageslatetwo
Topic 49twoformfirstthreeusedformssecondtypelineword
Topic 50poetrypoetpoetspoeticeliotpoemspoemprosepoundeliots
Topic 51americanyorkamericastatesunitedamericansjohnliteratureearlyboston
Topic 52romanticwordsworthnatureimaginationcoleridgemindsublimeshelleywordsworthsromanticism
Topic 53saiddayoldbacktimenowtwoheadcamelittle
Topic 54languagestudentslearningteachingforeignstudentteachersstudyteacherenglish
Topic 55moralgoodmenvirtueuponnaturejohnsonlifesatiregreat
Topic 56mustquestionfactviewmightproblemargumentsinceevidenceseems
Topic 57textreadingtextsreaderreadreaderswritingtextualinterpretationmeaning
Topic 58blackwhiteafricanraceracialslaveafricaslaverysouthnegro
Topic 59worldimagelightdreamvisionimagesearthnaturesunlife
Topic 60bookbookseditionfirstworkpublishedmanuscriptpagetwoworks

1 Comment

Join the discussion and tell us your opinion.

Gender Trouble: Literary Studies’ He/She Problem – .txtLAB @ mcgill
June 12, 2018 at 3:40 pm

[…] sexual, sex). This topic is one of the most semantically stable topics in the model (see my previous post for a discussion) and also has a relatively high average probability over time compared to other […]