Topic Stability, Part 2
In my previous post I tried to illustrate how different runs of the same topic modelling process can produce topics that appear to be slightly semantically different from one another. If you keep k and all other parameters constant, but change your initial seed, you’ll see the kind of variation that I showed.
The question that I want to address here is whether we can put a number to that variation, so that we can understand which topics are subject to more semantic variability than others.
I’ve gone ahead and written a script in R that calculates the average difference between a given topic and the most similar topic to it from all other runs. You can download it in GitHub.
I use KL-divergence to calculate the difference between topics, meaning I am asking how much information is lost when we approximate a topic’s term probability distribution by the most similar topic from another model. I then take the average KLD of all these comparisons for a given topic. I have run 10 models (small N!) and so that is the basis of this average.
When I run this on my 10 models what I see is that there is a fairly wide range of “stability” across topics (mean KLD = 0.87, standard deviation = 0.68, table pasted in at the bottom along with table of topic words). The topic with the lowest average information loss between runs is the “French” topic (jean, roman, pierre, saint, review, french, donne, france, rousseau, marie) followed by the junk topic (think, people, know, way, say, just, things, something, time, now) but more interestingly also the “race” topic (black, white, african, race, racial, slave, africa, slavery, south, negro). These have average KLD scores between 0.14 and 0.15.
The topics with the highest divergence are the French-arab studies topic (french, france, arabic, first, arab, studies, two, many, since, time, century) with a score of 3.5 (!) and some sort of novelists topic (romance, dickens, tom, house, gothic, hawthorne, melville, poe, mark, conrad) with a score of 2.6. Interestingly the next highest is the poetry topic which looks very stable when you observe the top few words but obviously must have a lot more variability further down the list (poetry, poet, poets, poetic, eliot, poems, poem, prose, pound, eliots).
Two things immediately come to mind: the first is that these high-scoring topics are very high scoring with respect to the average and also low scorers. They definitely feel like outliers. Frankly, I wouldn’t invoke the French-arab studies topic as a topic. It is too unstable to do a deep dive into it. Essentially it keeps turning up as something different each time. It might even make sense to subset any analysis of a model by removing topics with these really strong variabilities. Since they are so evanescent, do we really want to conclude anything based on them? In this sense, I think this kind of measure can be very valuable in gaining confidence about the stability of topics or subsetting on a group of more stable topics.
The second idea that occurs to me is from that poetry topic. That really surprised me. I wonder what would happen if you condition on just the top words. Then again, are variations on the word “poet” really a topic? Maybe that is what this is telling us is that the topic of “poetry” is actually very unstable depending on how you model it. While you will have a strong register of top words, underneath that tip is an iceberg of variability.
If we return to the culture studies topic I discussed in the previous post, we can see that it scores almost dead in the middle of the pack (29th in terms of stability). In other words, that variability I was seeing was average for this model, neither low nor high, i.e. “some.”
Next steps are going to be to compare this to document probabilities. Does semantic instability correlate with document instability? In other words, when a topic changes its word distributions, does this mean we are getting very different distributions of articles?
Second, I want to compare this with David Mimno’s work on topic “coherence”. Are stability and coherence capturing the same thing?
Finally, I could imagine that running this across numerous values of k might be valuable — do we see a significant decline in the average topic stability as k increases? I assume stability will decrease with the rise in k, but maybe there is something telling about when it declines faster. This might be another diagnostic for choosing k. But that’s a lot of run time, so don’t expect anything on this anytime soon.
Table of average stability of topics for a model of k=60.topic | mean.kld | sd.kld | norm |
---|---|---|---|
1 | 0.603100842078638 | 0.230915792878045 | -0.39279399596302 |
2 | 0.805643453733019 | 0.356186351211379 | -0.0932849378235169 |
3 | 0.64061335365065 | 0.588805499885999 | -0.33732252298452 |
4 | 0.528174881032252 | 0.274482013239219 | -0.503590454248515 |
5 | 0.583970149062244 | 0.2084634829349 | -0.421083429974433 |
6 | 0.190253779078275 | 0.056951708329921 | -1.003289900923 |
7 | 0.402938269702216 | 0.356544338468519 | -0.688783580759832 |
8 | 0.303668611933252 | 0.225769818280089 | -0.835578180954619 |
9 | 1.29045783167664 | 0.991560624499197 | 0.62363233960432 |
10 | 0.196456080347557 | 0.164411572935192 | -0.994118273302088 |
11 | 1.82494150765407 | 0.660901543356964 | 1.41399788824307 |
12 | 0.149650787743166 | 0.00381496807766735 | -1.06333140820432 |
13 | 3.4677716198774 | 0.489826836269115 | 3.84332619390767 |
14 | 1.83244837957986 | 1.83295495597478 | 1.42509864438373 |
15 | 0.551530627274636 | 0.541245901097331 | -0.469053240068922 |
16 | 0.821705356399711 | 0.214815710545396 | -0.0695334649836277 |
17 | 1.10872230747958 | 0.0689206746338174 | 0.354891676895614 |
18 | 1.9183879913102 | 0.912655741718207 | 1.55218149354808 |
19 | 1.13186421945851 | 0.568926216018039 | 0.389112684533177 |
20 | 0.409540966448783 | 0.324305411765304 | -0.679019869965385 |
21 | 0.623535304865389 | 0.369176907175886 | -0.362576617738974 |
22 | 0.238213132186912 | 0.152658998958459 | -0.932370203772754 |
23 | 0.879419026058423 | 0.70003563685839 | 0.0158103878484456 |
24 | 0.508744205455446 | 0.167600151602236 | -0.532323486243552 |
25 | 1.66686739314703 | 0.812198141340613 | 1.18024643828632 |
26 | 2.56522074499583 | 0.38933505724029 | 2.50868278109318 |
27 | 0.378354212976699 | 0.308451705418931 | -0.725137154039868 |
28 | 0.926668555642327 | 0.395542827059524 | 0.0856804363662265 |
29 | 0.818009243335869 | 0.847667218603971 | -0.0749990770515584 |
30 | 0.902639038322064 | 0.0950740866633539 | 0.0501468859624224 |
31 | 0.50280805384327 | 0.161708533825336 | -0.54110154614815 |
32 | 2.08833566545292 | 0.984692582131755 | 1.80349092122714 |
33 | 0.990417038270873 | 0.869125701555744 | 0.179948244181793 |
34 | 1.83648823196099 | 0.574072283729001 | 1.43107255956044 |
35 | 0.802304373127328 | 0.527592660763758 | -0.0982225896050051 |
36 | 1.91055138196471 | 0.543188679720419 | 1.5405931395474 |
37 | 0.543742808773208 | 0.170529485521647 | -0.480569444808615 |
38 | 0.462626360229903 | 0.199618149241667 | -0.600520061147482 |
39 | 1.24296433666249 | 0.544508195688859 | 0.553401528204912 |
40 | 0.149879535910968 | 0.100551829891209 | -1.06299314778727 |
41 | 0.55001387041567 | 0.381533228164756 | -0.471296138067249 |
42 | 0.255330099135012 | 0.173561822793647 | -0.907058558944265 |
43 | 0.761580195453493 | 0.575755697849621 | -0.158443300688801 |
44 | 0.382752705722451 | 0.439826585819993 | -0.718632900891589 |
45 | 0.898774409087935 | 0.521228270671737 | 0.044432081297281 |
46 | 0.642371615071389 | 0.343398431851845 | -0.334722501103757 |
47 | 0.2420202276836 | 0.0851796646471162 | -0.926740476886415 |
48 | 1.282954513257 | 0.545090316965067 | 0.612536838196336 |
49 | 0.434228253321627 | 0.286988871910663 | -0.642513645527426 |
50 | 2.16558643464701 | 0.475135345015491 | 1.91772518004747 |
51 | 0.304262990970094 | 0.15160499103407 | -0.834699245386205 |
52 | 0.284085867008806 | 0.173391575452376 | -0.86453608487518 |
53 | 0.289378949015896 | 0.349967134121609 | -0.856708961517533 |
54 | 0.761574435124193 | 0.145389460295225 | -0.158451818752181 |
55 | 1.17196281590585 | 0.693481099973222 | 0.448408319938301 |
56 | 1.03266129387122 | 0.384598207158776 | 0.242416764278966 |
57 | 0.40298297721012 | 0.0786439033495061 | -0.688717469715469 |
58 | 0.157750220300787 | 0.0227775658441129 | -1.05135440545422 |
59 | 1.02507793606749 | 0.667504182775091 | 0.231202904959287 |
60 | 0.278656840368924 | 0.259287244350614 | -0.872564235802255 |
Topic 1 | education | school | years | many | work | percent | state | studies | members | schools |
Topic 2 | first | passage | two | part | however | chapter | use | reference | later | second |
Topic 3 | science | human | nature | scientific | natural | world | animal | knowledge | medical | animals |
Topic 4 | name | without | moment | place | self | writing | question | time | nothing | first |
Topic 5 | much | though | might | perhaps | less | many | seems | least | course | fact |
Topic 6 | mother | family | father | child | children | daughter | young | life | home | mothers |
Topic 7 | history | historical | past | time | memory | present | future | events | historians | event |
Topic 8 | law | public | legal | state | authority | case | justice | laws | rights | private |
Topic 9 | german | jewish | germany | jews | berlin | goethe | mann | jew | anti | nazi |
Topic 10 | film | films | cinema | media | image | images | television | camera | screen | visual |
Topic 11 | theory | modern | aesthetic | critique | historical | benjamin | social | ideology | political | critical |
Topic 12 | jean | roman | pierre | saint | review | french | donne | france | rousseau | marie |
Topic 13 | french | france | arabic | first | arab | studies | two | many | since | time |
Topic 14 | italian | spanish | dante | juan | madrid | maria | spain | italy | vita | florence |
Topic 15 | language | words | speech | word | meaning | linguistic | use | metaphor | discourse | verbal |
Topic 16 | poem | poems | poet | lines | poetry | line | poets | verse | poetic | stanza |
Topic 17 | city | space | place | joyce | house | irish | urban | home | stephen | ulysses |
Topic 18 | james | mrs | mary | miss | henry | jane | williams | woolf | lawrence | jamess |
Topic 19 | milton | paradise | adam | allegory | pastoral | lost | miltons | book | blake | spenser |
Topic 20 | letter | letters | years | wrote | time | first | year | life | later | two |
Topic 21 | cultural | culture | identity | social | discourse | political | within | power | politics | studies |
Topic 22 | women | female | woman | male | womens | men | gender | feminist | sexual | sex |
Topic 23 | art | work | painting | artist | aesthetic | artistic | artists | image | arts | visual |
Topic 24 | form | structure | system | order | within | object | process | terms | two | reality |
Topic 25 | english | translation | latin | old | greek | anglo | saga | translations | studies | original |
Topic 26 | romance | dickens | tom | house | gothic | hawthorne | melville | poe | mark | conrad |
Topic 27 | god | christian | religious | christ | church | religion | spiritual | gods | divine | faith |
Topic 28 | narrative | narrator | story | voice | reader | narratives | narrators | events | narration | first |
Topic 29 | king | shakespeare | shakespeares | henry | queen | hamlet | richard | sir | kings | court |
Topic 30 | research | social | study | analysis | studies | different | role | use | example | theory |
Topic 31 | novel | fiction | novels | characters | character | reader | world | readers | fictional | realism |
Topic 32 | russian | east | chinese | european | western | west | soviet | journal | japanese | china |
Topic 33 | music | song | folklore | songs | folk | musical | oral | dance | tradition | performance |
Topic 34 | story | tale | stories | tales | chaucer | myth | version | legend | chaucers | two |
Topic 35 | social | class | economic | society | labor | work | money | working | economy | market |
Topic 36 | hero | action | irony | tragedy | character | tragic | comic | heroic | epic | characters |
Topic 37 | literary | literature | work | writers | criticism | writing | works | critical | critics | studies |
Topic 38 | indian | colonial | world | native | land | british | national | european | people | west |
Topic 39 | london | century | english | john | england | thomas | george | william | victorian | eighteenth |
Topic 40 | think | people | know | way | say | just | things | something | time | now |
Topic 41 | death | violence | life | dead | murder | suffering | fear | blood | guilt | crime |
Topic 42 | political | politics | state | national | revolution | party | power | government | people | movement |
Topic 43 | body | desire | self | subject | freud | sexual | object | bodies | power | pleasure |
Topic 44 | love | marriage | woman | lady | lover | desire | lovers | passion | beauty | wife |
Topic 45 | life | self | world | experience | human | reality | sense | consciousness | individual | personal |
Topic 46 | philosophy | thought | truth | knowledge | theory | philosophical | reason | idea | ideas | human |
Topic 47 | play | plays | stage | drama | scene | theater | audience | dramatic | theatre | act |
Topic 48 | century | medieval | roman | early | rome | renaissance | middle | ages | late | two |
Topic 49 | two | form | first | three | used | forms | second | type | line | word |
Topic 50 | poetry | poet | poets | poetic | eliot | poems | poem | prose | pound | eliots |
Topic 51 | american | york | america | states | united | americans | john | literature | early | boston |
Topic 52 | romantic | wordsworth | nature | imagination | coleridge | mind | sublime | shelley | wordsworths | romanticism |
Topic 53 | said | day | old | back | time | now | two | head | came | little |
Topic 54 | language | students | learning | teaching | foreign | student | teachers | study | teacher | english |
Topic 55 | moral | good | men | virtue | upon | nature | johnson | life | satire | great |
Topic 56 | must | question | fact | view | might | problem | argument | since | evidence | seems |
Topic 57 | text | reading | texts | reader | read | readers | writing | textual | interpretation | meaning |
Topic 58 | black | white | african | race | racial | slave | africa | slavery | south | negro |
Topic 59 | world | image | light | dream | vision | images | earth | nature | sun | life |
Topic 60 | book | books | edition | first | work | published | manuscript | page | two | works |
1 Comment
Join the discussion and tell us your opinion.
[…] sexual, sex). This topic is one of the most semantically stable topics in the model (see my previous post for a discussion) and also has a relatively high average probability over time compared to other […]